Digital: Data Handling
Accelerating Life Sciences Research and Delivery with Knowledge Graphs
Advanced data handling techniques may point the way to a potential new innovation-rich future, which two use studies are already proving
Alexander Jarasch at Neo4j
From aiding in early drug discovery, to better understanding the connections between genes, proteins, cells and tissues, life sciences researchers are applying the power of graph databases to what were previously intractably hard problems. 1The context here is the sector’s adoption of new standards. SDTM (Study Data Tabulation Model), a new way of organising human clinical and nonclinical study data tabulations, and one of the required standards for data submission to the FDA (US) and PMDA (Japan), is of huge importance. Another important new standard is ADaM (Analysis Data Model), which defines data set and metadata standards for the efficient generation, replication and review of clinical trial statistical analyses. Finally, there’s CDISC 360, an initiative aimed at implementing standards as linked metadata, which provides the additional semantics needed to support metadata driven-automation across the end-to-end clinical research data life cycle.
In parallel, there is an extensive and ongoing digital transformation of pharma industry regulatory processes. As a result, data features high on the pharma research and development agenda. And it’s in this context that knowledge graphs and their ability to draw insights from complex data correlations are proving highly useful to life sciences organisations.
The Power of Knowledge Graphs
Knowledge graphs are multidimensional and work on the basis that every data set is a connected element. Unlike traditional SQL databases, which store data in tables with fixed columns and rows, knowledge graphs store data as nodes (or entities) connected by edges (or relationships). It is in the power of those interconnections that the breakthrough insights lie. For example, in the Panama Papers the use of a knowledge graph made it possible to represent the complex network of offshore accounts, shell companies and individuals involved in the scandal.
Because knowledge graphs are designed to represent complex data, they can be used in a wide range of applications beyond just financial investigations. For example, they can be used in biological science to represent the complex interrelationships and correlations between information about diseases, genes, the environment, diet, behaviour and other factors. The more these interrelationships and correlations can be analysed, the richer the knowledge and the faster important deductions can be made. Modern native graph databases have made it possible to perform mass-scale cross-comparisons involving billions of connections, which can help researchers identify patterns and connections that might not be immediately obvious; this has the potential to transform fields such as medicine.
While there are existing use cases for knowledge graphs in life sciences, the potential for this technology is still largely untapped. Currently, 90% of pharma companies are using knowledge graphs, but within that the technology represents a fraction of their databases. This means that there is still a lot of untapped potential for using knowledge graphs to gain insights from data.
Dealing with Multiple Pharma Data Challenges
Two use cases that demonstrate the application of knowledge graphs in life sciences are: GSK (formerly GlaxoSmithKline), which has been using knowledge graphs to improve its clinical reporting workflows and tackle new regulatory standards; and AstraZeneca, which has used
knowledge graphs to help in the important area of reaction and synthesis prediction, to make the process of developing new organic molecules easier.
Modern native graph databases have made it possible to perform mass-scale cross-comparisons involving billions of connections, which can help researchers identify patterns and connections that might not be immediately obvious
In the case of GSK:
Clinical Programming leader Alexey Kuznetzov confirms:
“Our aim is to eventually introduce end-to-end automation in clinical reporting.”
Clinical Programming director Jorine Putter states,
“We expend a tremendous amount of time and effort progressing our data through its life cycle from initial collection, studying, reporting and ultimately delivering the results, and ensure we are standardising our data as required by regulatory agencies.”
Beyond GSK’s work improving clinical reporting workflows, there is AstraZeneca’s work with reaction and synthesis prediction in the drug discovery process.
Associate principal scientist Dr Christos Kannas says his team has a nine million node graph with 33 million relationships and growing, which is identifying regions in the chemical space where new reaction networks formulate.
Kannas explains that graphs help in drug discovery because, by default, chemical reactions create networks. When you have a reaction, the product of one reaction can enable other reactions, which is by default a kind of graph structure anyway. A data scientist can use path queries between two molecules and understand how the reactions are connected together. The information the scientist can glean from linked molecules helps train new lead prediction algorithms.
Reduce the Impacts of Restandardisation of Data
AstraZeneca’s application of graph technology is being supplemented with data visualisation tools so that scientists can recognise important molecules and reactions they want to investigate.
“We use weakly connected components to identify general regions of the graph, and then we go down to strongly connected components to identify the core of the graph. By having this analysis, we can identify islands in the chemical space and we can explore what type of reactions they enable and if they are useful.”
GSK finds graph techniques and tools highly useful in overcoming the traditional problems of labour intensity and multiple handoffs by way of a knowledge graph.
Putter says,“We aim to disrupt our data silos and redesign how we standardise and harmonise our clinical trial data to reduce the impacts of restandardisation of our data during the life cycle of an asset.”
Applying the technology helps reduce any manual effort required to validate analysis to ‘nearly zero’ and ensures compliance with GDPR of informed consent, so that patient’s data immediately disappears from all downstream renderings of the data.
Other ambitions in the use of knowledge graphs at the firm include how risk-based monitoring might be done proactively instead of reactively. There is a Google-like question-and-answer system that allows users to quickly get any answer from their clinical trial data. Powerful AI algorithms developed for preclinical data sets can be applied to patient-level clinical data. And the chosen tool for managing the data set is a clinical knowledge graph that offers a patient-centric data model that integrates all domain silos and allows everyone to understand its clinical data.
Knowledge Graphs’ R&D Potential
GSK is definitely on the way to achieving this, says the team. The graph data base does not present a full industrial process, but the early results are consistently strong. They also agree that graph technology was just the natural fit for this problem space.
“It is very hard to define all the schema requirements up front and we believe that graph technology would provide us with the flexibility of easily extending schema when some new data types arrive and need to be processed,” says Kuznetzov.
What’s especially useful about knowledge graphs is that they aren’t dependent on data sources having been prepared or formatted in a particular way (data schema). They can work with the native data structure, and queries can be performed by asking meaningful questions.
Queries can be performed at hyper speed, too: typically, 3,000 times faster than an SQL data base query, and across dense networks of knowledge. That could be pinpointing the best clinical doctor to target for a clinical trial to be successful, based on not only their area of expertise but also their current capacity, whether they have access to the right equipment and whether they may be working with a competitor.
Overcoming Gaps in Clinical Data
Knowledge graphs can be especially useful in the context of clinical trials, particularly for rare conditions where small patient populations can make it difficult to achieve statistical significance. As some of the growing body of work in diabetes research shows, knowledge graphs can help in phenotype mapping, where researchers are trying to understand the relationship between different phenotypes (observable characteristics or traits) in humans and animals. This can be particularly challenging when the clinical parameters and observations used to measure these phenotypes are not directly comparable between species.
As the clinical opportunity for pharma grows ever more rewarding, yet more demanding and complex in scope, knowledge graphs are potentially transformative. Understanding the value of relationships between data is every bit as important as understanding what those individual data points tell us in their own right. Without the ability to mine those correlations for new insights, companies will lack vital context and find themselves compromised in their ability to make accurate advanced predictions.
There can be no doubt that the future of life sciences is data-driven. Pharma now needs to look to a bolder use of data to determine its product roadmaps more strategically and navigate regulatory approvals swiftly. Undoubtedly, knowledge graphs are already helping R&D teams genuinely ‘move the needle’ in pharma discovery and reporting.
Dr Alexander Jarasch
is the technical consultant for Pharma and Life Sciences at native graph database leader
. He was previously head of Data Management and Knowledge Management at Germany’s National Center for Diabetes Research (DZD). He is a visionary speaker on the future of clinical investigation, plus AI and data management in the pharma and healthcare space, in particular the potential to advance pharmaceutical analytics and unlock terabytes of hard-to-parse research/trial data by revealing data relationships for better predictive accuracy.