Digital Health
Calculating the true value of scientific data storage platforms is the goal of every developer. With millions of cells to analyse, digital solutions hold the key
Marilyn Matz and Zach Pitluk at Paradigm4
Making data-driven technology purchase decisions to assess the potential ROI of significant capital and enterprise software investments is an established practice in every enterprise. However, for cutting-edge technology – such as single-cell analytics – the innovative and dynamically evolving nature of the technology and applications are making it harder to formulate comparative assessment and selection criteria. This has been an ongoing problem over the last few years. As more data are made available, they become more complex, and storing the data at hand is a major challenge. Infrastructure, cost, and scale are all problems that scientists have faced when handling largescale datasets. A 2018 Inc. article stated: “A problem that’s emerging is that our ability to produce data is outstripping our ability to store it (1).” Since then, having a reliable and effective storage platform has been crucial for companies where technology has continued to evolve.
While more data are available to compare vendors and assess ROI for established analytical instruments – highperformance liquid chromatography-mass spectrometer systems, for example – users are also reporting how data model choice can have a dramatic impact on usability and costs (2-3). If pharmaceutical companies are seeking a higher ROI and reduced costs by using a new data storage model, this could have a knock-on effect on the acceleration of drug discovery and development efforts.
Today, healthcare is set on the path towards precision medicine, and the life science research and pharma R&D communities are increasingly focused on the power of multi-modal single-cell RNA sequencing data (scRNA-seq) in their quest for insight into disease aetiology, and actionable results that can inform better drug development.
Single-cell omics is accelerating the potential of precision medicine to target the right cell, in the right patient, with the right therapy as diseases evolve. With these new data, researchers and clinicians can, for example, better explore the transition from ‘healthy’ to ‘disease’ states, study potential biomarkers, understand the mechanics of disease pathways and more accurately assess response to drug targets.
However, it is important to realise that current single-cell datasets have been generated from a small number of individuals, whereas the statistical significance relies on the number of patients studied, rather than the number of cells. This is because cells from the same patient are ‘siblings’ and not true biological replicates. The immune cell survey in the Human Cell Atlas Consortium, for example, currently contains 780,000 cells – but these are from just 16 individuals. To illustrate the numbers we will soon be facing, consider that the International Hundred Thousand Plus Cohort Consortium is bringing together more than 100 cohorts, comprising more than 50 million individuals from 43 countries, which equates to an astonishing 1 trillion cells if 20,000 cells per individual are analysed.
Consequently, the bioinformatics platforms and data management systems needed to store, compute, and query the very large and complex datasets being generated will become one of the most significant investments scientists make. The need for these key tools to be assessed in a systematic way, and for users to have a clear picture ahead of any purchase decision, is evident. However, in many cases, data on the capabilities, performance, and costs for analysing multimodal population-scale datasets, for example, seems to not be available in the public domain. Moreover, the ‘free’ web portals afford access to limited amounts of data and rely on academic quality software to support users. This makes reaching informed choices regarding critical bioinformatics investments harder.
Those at the leading edge understand that there is a significant challenge here. Running through the single-cell advances reported in 2021, there is a clear central theme, summed up in a publication from a wide consortium of researchers in the US, which states that: “The simultaneous measurement of multiple modalities, known as multimodal analysis, represents an exciting frontier for single-cell genomics and necessitates new computational methods that can define cellular states based on multiple data types (7).”
This is emphasised by many other groups. For example, researchers at the Carnegie Mellon University, Pittsburgh, US, recently published their work on integrative single-cell spatial modelling for inferring cell identity (8). They note that: “The development of computational methods that capture the unique properties of single-cell spatial transcriptome data that can unveil cell identities remains a challenge.”
Ultimately, data storage and management software must be assessed for an organisation to see if its data are being utilised to their full advantage. As such, evaluating performance, productivity, and price is crucial to allow for a meaningful assessment. A variety of questions can be asked, which can be used to define and compare the output from one platform to another.
With respect to productivity, and in the context of single-cell analysis, we must think of a database as part of a contemporary, high-performance, computational data management system. Against this background, for the majority of users, everyday usability, the ease of asking questions, and the time-to-answer both common exploratory queries, as well as large scale computations, will most often become decisive factors. Moreover, with open-source solutions, another challenge can often be that bugs persist for weeks, significantly impacting day-to-day productivity – with the burden of maintaining such tools often falling on the user, rather than the developer.
In the context of using single-cell data to test hypotheses of disease formation or response to therapy, simple assessments, such as the ease of doing correlations across large datasets in seconds, or examples of analysis and query on more than 10 different tissue samples, can provide a realistic comparison between products and approaches. For example, some platforms require the user to take on much of the setup to run distributed computation, rather than automating and hiding it. Such data can also indicate how straightforward it is to do computations on a biologically meaningful scale on a day-to-day basis.
Performance is another aspect that must be fully considered when assessing database software. Important considerations are scalable to – and across – multiple, heterogenous, large datasets. Performance can include how accessible the platform is to navigate for a user, rather than a developer. Characteristics to consider include:
Raw storage and computing costs figure significantly in ROI assessment and budgeting. Platforms that add a surcharge on storage and computing often serve up huge, unanticipated cost surprises. Factors that must be considered when addressing whether value for money is being gained to its full potential include:
The reality is that to store, compute, interrogate, and validate research hypotheses across such heterogeneous datasets requires a new generation of software tools that are capable of working flexibly and efficiently with diverse data types from billions of cells. Multimodal single-cell data will be orders of magnitude larger than genomic data. The associated data and analytical challenges will continue to increase. Many current ideas about database structure, as well as the existing computational toolbox, will simply be unable to cope with what is required of them.
Comparing various software solutions – database options, analytical engines and tools, and approaches – to interrogation and analysis of datasets, must become more objective and more standardised.
If scientists and researchers can deal with and process more complex data at their disposal, then drug discovery and development capabilities can be given a new lease of life. If the power of data storage is utilised, then the ability of precision medicine can be accelerated to new heights.
References
Marilyn Matz is CEO and Co-Founder, along with Turing laureate Michael Stonebreaker, of Paradigm4. Prior to Paradigm4, after completing an MSc degree at the MIT Computer Science & Artificial Intelligence Laboratory, US, she was one of three co-founders of Cognex Corporation, now a publicly-traded, global industrial machine vision company, where she was Senior Vice President and Business Unit Manager of its Vision Software Products Group. Marilyn was the recipient of the sixth annual Women Entrepreneurs in Science and Technology Leadership Award, a co-recipient of the SEMI industry award for outstanding technical contributions to the semiconductor industry, and a 2020 NACD Directorship 100. She also serves on the Board of Directors of Teradyne, a leading supplier of automation equipment for test and industrial applications.
Dr Zachary Pitluk is Vice President of Life Sciences and Healthcare at Paradigm4. He has worked in sales and marketing for 23 years, from being a pharma representative for Bristol Myers Squibb to management roles in life science technology companies. Since 2003, his positions have included Vice President of Business Development at Gene Network Sciences and Chief Commercial Officer at Proveris Scientific. Zach has held academic positions at Yale University Department of Molecular Biophysics and Biochemistry: Associate Research Scientist, Postdoctoral Fellow, and Graduate Student, and has been named as co-inventor on numerous patents.