Retrieval-Augmented Generation

Making sense of your scientific archive: how RAG helps biotechs do more with less

How is retrieval-augmented generation aiding smaller biotechs in collating and utilising disparate data systems?

Lisa Stehno-Bittel and Cole Bittel at Likarda

Biotechnology companies accumulate years or even decades of lab notes, reports, meeting minutes, regulatory submissions and research documentation. Over time, this wealth of knowledge becomes fragmented across different formats and storage locations, making it increasingly difficult for scientists to access and cross-reference critical information. Researchers often face the ‘needle in a haystack’ problem. Sifting through vast archives to find a specific insight or data point can be time-consuming and frustrating. Important contextual knowledge may be buried in an old lab notebook or a prior project report, and valuable know-how is lost when experienced employees leave if there is no effective knowledge retention programme in place. In short, biotechs struggle with knowledge silos – the technical complexity of their work and sheer volume of documents make it challenging to get the right information in a timely manner.


The opportunity created by modern AI

Over the past decade, advances in foundational models like ChatGPT’s GPT5 have radically expanded what is feasible for small organisations adopting artificial intelligence (AI). A foundational model is a large, general-purpose system trained on vast and heterogeneous information spanning scientific literature, technical manuals, spreadsheets, code and everyday language. Because of this broad training, a single model can perform many tasks such as answering questions, summarising diverse data sets, reasoning and predicting simply through prompting rather than custom training.

Image

For small biotechs, the practical significance is that these capabilities are now available through simple application programming interfaces, without the need for internal machine learning (ML) teams or massive training runs. By combining a modern foundational model with a retrieval-augmented generation (RAG) pipeline, companies can direct the model’s broad linguistic and reasoning ability to their own experimental reports, meeting notes and regulatory documents. In doing so, they transform a general-purpose model into a research assistant capable of synthesising years of internal knowledge. Foundational models like Claude continue to improve through larger context windows, better optimisation and richer pre-training data allowing organisations to upgrade their systems simply by switching to newer models while keeping the same RAG pipeline. This creates a rare opportunity; even small biotechs can now deploy powerful, continuously improving AI systems that meaningfully amplify scientific productivity at modest costs.


The framework

RAG is an AI framework that combines a text retrieval system with a large language model’s (LLM) reasoning and generative abilities. In a RAG setup, the language model is not expected to rely on its built-in training memory alone when answering questions. Instead, it actively pulls in relevant information from an external document repository (such as the company’s own data) and uses that as context to produce its answer. In practical terms, RAG gives an LLM access to a company’s internal knowledge base or document store, allowing the model’s outputs to be grounded in facts from those documents. This approach differs from a standard ‘closed book’ LLM that generates answers purely from its pre-trained knowledge.

By augmenting prompts with retrieved data, RAG can produce responses that are more accurate, contextually appropriate and up to date. Importantly, RAG does not require changing the underlying language model’s parameters at all.


Advantages for biotech

Organisations in life sciences commonly sit on huge sets of archives including experimental protocols, assay results, project update slides, regulatory correspondence, literature analyses and more. Much of this information is unstructured (free-form text in PDFs, Word docs, emails) and not easily quarriable with traditional keyword searches. RAG provides a powerful solution by allowing these legacy documents to be queried conversationally. A scientist could ask, for example, “Have we ever tested compound X in a mouse model of disease Y, and what were the results?” Instead of hunting through folders or relying on the memory of veteran team members, the RAG system can retrieve the relevant experiment reports and summarise the findings. Importantly, this requires no prior reorganisation of the documents because the RAG pipeline works on scans or text as-is, after an initial indexing step. In practical terms, this means a small biotech can achieve an interactive knowledge base that draws on all its past research, yielding insights that span across projects and time. This capability is especially valuable in science, where connecting observations from different years or teams can lead to new hypotheses. By augmenting LLMs with company-specific data, RAG turns static document collections into a dynamic, ask-me-anything resource for scientists.


Dependence on clean, useful documents

While RAG is powerful, its effectiveness is fundamentally limited by the quality of the documents it retrieves. The old computer science adage ‘garbage in, garbage out’ squarely applies. If the underlying reports and files are unclear, poorly organised or contain factual errors, a RAG system will faithfully surface those problems in its answers. Unlike a human expert who might integrate information and resolve contradictions, the LLM will treat retrieved text as authoritative context. Thus, the scope and clarity of what’s in your document repository place a hard ceiling on what RAG can achieve. Ensuring that the source materials are comprehensive, consistent and correct is a quiet but crucial determinant of success for any RAG deployment.

“ With a solid corpus, the RAG system will return highly relevant, precise context for the LLM, enabling it to give accurate and deeply grounded answers ”

Given these challenges, companies should start any RAG initiative with an honest evaluation of their data readiness. It is often valuable to involve advisors or partners who have experience with data cleaning and curation for AI. A trustworthy expert will not simply rush to deploy RAG on whatever is there, instead they will help answer questions like: are your documents mostly digital and text-extractable? Do they contain the information you expect an AI to deliver? Are there major gaps or ambiguities? In some cases, the right first step may be corpus cleanup rather than AI integration. This could mean standardising terminology (creating a glossary of synonyms for the AI to recognise), merging duplicate records, resolving conflicting versions of documents or enriching files with metadata (such as adding labels for document type, project, date etc). An external consultant who is technically honest will tell you if your data is too sparse or too chaotic to yield good results with RAG in its current state. It’s better to pause and improve the knowledge base than to deploy an AI that answers with ‘I don’t know’ or, worse, confidently gives wrong answers based on faulty data.


Integrating RAG into the biotech workforce

When a biotech’s documentation is clean, well structured and comprehensive, the payoff from RAG can be immense. With a solid corpus, the RAG system will return highly relevant, precise context for the LLM, enabling it to give accurate and deeply grounded answers. Users can start to get answers that span across many documents and time periods, something that would be nearly impossible with manual search. For example, a RAG system might pull together a finding from a 2008 experiment report, a protocol from 2015 and a note from a 2023 meeting minutes – and synthesise an answer that links all three. This cross-document insight is where RAG becomes a true force multiplier for research.

The surge of interest in generative AI has led to a crowded marketplace of consultants, vendors and products, all claiming to have the solution for your AI needs. Small biotechs may find themselves approached by firms offering to ‘transform’ their operations with AI, or touting off-the-shelf solutions that supposedly work for any industry. It’s important to approach such claims with healthy scepticism. Many solution providers are rushing into the space, and not all of them appreciate the unique challenges of scientific and regulatory environments. AI systems are not one-off IT installations; they require tuning, evaluation and improvement over time. This is why having a consistent long-term partner can be highly valuable. A partner who is familiar with your objectives will be able to guide you incrementally – perhaps starting with a pilot on a subset of data, and then scaling up. For both small and medium biotech firms, RAG should be viewed as the default approach for leveraging internal textual data. It is currently the most cost-effective, scalable and immediately beneficial way to make years of written research accessible.


Image

Lisa Stehno-Bittel PhD, president of Likarda, licensed her lab patents from the University of Kansas Medical Center, Kansas, US, and founded Likarda. The patents had applications in drug screening for efficacy and toxicity, as well as the ability to transform cell-based therapeutics. Stehno-Bittel is a co-inventor on 33 global patents (issued and pending) and has received numerous awards, including the Jim Baxendale Commercialization Award and the Marjorie S. Sirridge, M.D., Excellence in Medicine and Science Award. She is a fellow of the American Institute for Medical and Biological Engineering, and in 2025 was named one of the ‘Top 50 Women Leaders of Missouri’.


Image

Cole Bittel is an AI and MLOps sofware engineer, Embedded Engineering ApS. For the last two years, his career has centred on ML infrastructure and modern AI delivery workflows. His work spans cloud-agnostic environments and MLOps systems that support scalable, secure deployment of AI applications. At Likarda, Cole serves as the architect of RAG-based LLM systems built on private enterprise data. His consultancy provides specialised support for ML application delivery and virtual infrastructure. Across roles – from leading platform engineering at the LEGO Group to AI reliability teams in scale-up AI companies like Legora – he has consistently built automated, resilient systems and mentored teams adopting cloud, DevOps and AI-adjacent technologies.

0