Have we found a way to help non-coders understand complex biomedical data structure?
An exploration of the thoughts on the significance of generative AI within life sciences and pharma
Alexander Jarasch at Neo4J
A natural language understanding (NLU) interface to artificial intelligence (AI) is now at everyone’s fingertips – even if ChatGPT, one of the most successful introductions of technology of all time, isn’t the final version of AI every user has easy access to. The problem is how to use it. That, in turn, breaks down into two underlying issues: should we trust the answers these systems offer given the errors to which they are prone, and what is the best way to ask the questions to get the answers we want? Allied to this, the other main challenge in working with generative AI is establishing what value we get from integrating our data sets with it.
Generative AI works as a way to present the answers it derives from a large language model (LLM). With programmes like ChatGPT, it’s essentially the public internet. Anything that has been typed out online as an answer, or examples of code, has been ingested into its LLM and a subset of that is what occurs when a question is entered into the system.
ChatGPT lacks the accuracy needed when dealing with unstructured data
In life sciences and pharma, the danger is obvious. Within the field of drug discovery, it would be ideal to find out what a company’s competitors are doing to address the problem currently being worked on – and if ideas are put online, then they become available to everyone else through internet ingesting. Therefore, using ChatGPT-like tools for any kind of serious data gathering should be approached with caution. Nonetheless, generative AI presents a highly useful tool for general-purpose domain acquisition at the start of a project.
It is undoubtedly more practical to keep the real data exploration in-house. By creating your own LLM based on your team’s results or field trials (or both) and supplementing it with publicly available scientific or medical database information, you can develop a robust, proprietary LLM. ChatGPT could be utilised to analyse this private LLM, ensuring that your organisation’s intellectual property (IP) is safeguarded.
However, ChatGPT may not provide the depth of answers or the level of accuracy required just on that unstructured heap of data. While it is possible to integrate PubMed’s 30 million peer-reviewed articles, the quality of the data is not guaranteed. Even in peer-reviewed articles, there may be errors or differences in scientific interpretation, leading to inaccurate information. As a result, erroneous data will be included in the final model.
In order to create a knowledge base that is useful for research purposes, it is essential to structure and organise the data in a manner that allows patterns and connections to be identified easily for ChatGPT to use. One approach that is becoming increasingly popular in this sector is the use of a graph-based knowledge graph, which can transform private LLM into a powerful research engine.
The power graphs can bring to medical research
Graph-based knowledge graphs of specific, targeted pieces of biomedical domain can be used for a lot of valuable data analysis work. By structuring the data in this way, it becomes possible to quickly query and analyse relationships between modules or sub-graphs of the graph of genes in the virus, but also connections between genes and proteins. In these cases, the ability to use a natural language interface instead of writing code would mean the research team could more easily navigate complex data structures and identify specific patterns or relationships of interest.
In an ideal scenario, once a model has been trained and tailored for an area of research, it will require some specific information. For instance, if there is a focus on female patients aged between 60 and 69, it is expected that the system will comprehend this requirement automatically. If ‘patient’ serves as a node label and ‘age’ and ‘gender’ are properties within the ‘patient’ node, it is essentially a database-like query. Graph technology – as the underlying knowledge representation model – enhances this process, layering a natural language interface on ChatGPT that will transform questions or statements into Cypher code directly. This enables the graph database to perform a ‘female patients aged between 60 and 69’ query and retrieve results within milliseconds.
Getting useful answers to hard questions
Using a natural language interface approach to translate questions or statements into Cypher code to then query the in-house data store is infinitely superior for research purposes, rather than a vague or imprecise ChatGPT answer. Moreover, this NLU approach can make it much more accessible for non-computer scientists to work in a database and ask meaningful questions. Typically, researchers do not ordinarily want to interact with the database querying layer, and do not feel comfortable if you use technical terms like ‘Cypher’ or ‘Python’. With this system, it is possible to use the patient data to ask if any patients in the sample have type 2 diabetes or have a comorbidity such as Alzheimer’s or cancer. This type of query is precisely what pharmaceutical companies aim to achieve, as they seek to ask meaningful questions and obtain useful answers from their clinical trials. For example, they may want to know whether all type 2 diabetic patients in a trial have a secondary clinical condition, such as liver cancer. The use of natural language interfaces/conversational queries that get translated into Cypher-type programmes that then interrogate a knowledge graph based on a bounded, non-public LLM can be a very powerful tool here.
To sum up, using ChatGPT-like systems in the sense of translating natural language and human-level expressions into data queries can really lower the barriers to getting answers from complex data systems. Generative AI-based NLU in conjunction with a knowledge graph is the next step in using data in medical and bioscience research.
Dr Alexander Jarasch is technical consultant for Pharma and Life Sciences at graph database and analytics leader Neo4j. He was formerly head of Data and Knowledge Management for the German Centre for Diabetes Research (Deutsches Zentrum für Diabetesforschung, DZD) where he helped use graph data science to meet its target of developing novel strategies for successful, personalised detection, prevention and treatment of diabetes and its complications via an innovative, integrative research approach.
Innovations in Pharmaceutical Technology (IPT)
IPT provides a platform for cutting-edge ideas, concepts, and developments shaping the future of pharmaceutical R&D.