Have we found a way to help non-coders understand complex biomedical data structure?

A natural language understanding (NLU) interface to artificial intelligence (AI) is now at everyone’s fingertips – even if ChatGPT, one of the most successful introductions of technology of all time, isn’t the final version of AI every user has easy access to. The problem is how to use it. That, in turn, breaks down into two underlying issues: should we trust the answers these systems offer given the errors to which they are prone, and what is the best way to ask the questions to get the answers we want? Allied to this, the other main challenge in working with generative AI is establishing what value we get from integrating our data sets with it.

In life sciences and pharma, the danger is obvious. Within the field of drug discovery, it would be ideal to find out what a company’s competitors are doing to address the problem currently being worked on – and if ideas are put online, then they become available to everyone else through internet ingesting. Therefore, using ChatGPT-like tools for any kind of serious data gathering should be approached with caution. Nonetheless, generative AI presents a highly useful tool for general-purpose domain acquisition at the start of a project.

Graph-based knowledge graphs of specific, targeted pieces of biomedical domain can be used for a lot of valuable data analysis work. By structuring the data in this way, it becomes possible to quickly query and analyse relationships between modules or sub-graphs of the graph of genes in the virus, but also connections between genes and proteins. In these cases, the ability to use a natural language interface instead of writing code would mean the research team could more easily navigate complex data structures and identify specific patterns or relationships of interest.

In an ideal scenario, once a model has been trained and tailored for an area of research, it will require some specific information. For instance, if there is a focus on female patients aged between 60 and 69, it is expected that the system will comprehend this requirement automatically. If ‘patient’ serves as a node label and ‘age’ and ‘gender’ are properties within the ‘patient’ node, it is essentially a database-like query. Graph technology – as the underlying knowledge representation model – enhances this process, layering a natural language interface on ChatGPT that will transform questions or statements into Cypher code directly. This enables the graph database to perform a ‘female patients aged between 60 and 69’ query and retrieve results within milliseconds.

Using a natural language interface approach to translate questions or statements into Cypher code to then query the in-house data store is infinitely superior for research purposes, rather than a vague or imprecise ChatGPT answer. Moreover, this NLU approach can make it much more accessible for non-computer scientists to work in a database and ask meaningful questions. Typically, researchers do not ordinarily want to interact with the database querying layer, and do not feel comfortable if you use technical terms like ‘Cypher’ or ‘Python’. With this system, it is possible to use the patient data to ask if any patients in the sample have type 2 diabetes or have a comorbidity such as Alzheimer’s or cancer. This type of query is precisely what pharmaceutical companies aim to achieve, as they seek to ask meaningful questions and obtain useful answers from their clinical trials. For example, they may want to know whether all type 2 diabetic patients in a trial have a secondary clinical condition, such as liver cancer. The use of natural language interfaces/conversational queries that get translated into Cypher-type programmes that then interrogate a knowledge graph based on a bounded, non-public LLM can be a very powerful tool here.

Dr Alexander Jarasch is technical consultant for Pharma and Life Sciences at graph database and analytics leader Neo4j. He was formerly head of Data and Knowledge Management for the German Centre for Diabetes Research (Deutsches Zentrum für Diabetesforschung, DZD) where he helped use graph data science to meet its target of developing novel strategies for successful, personalised detection, prevention and treatment of diabetes and its complications via an innovative, integrative research approach.