1 Trillion Cells, 50 Million Individuals

Making data-driven technology purchase decisions to assess the potential ROI of significant capital and enterprise software investments is an established practice in every enterprise. However, for cutting-edge technology – such as single-cell analytics – the innovative and dynamically evolving nature of the technology and applications are making it harder to formulate comparative assessment and selection criteria. This has been an ongoing problem over the last few years. As more data are made available, they become more complex, and storing the data at hand is a major challenge. Infrastructure, cost, and scale are all problems that scientists have faced when handling largescale datasets. A 2018 Inc. article stated: “A problem that’s emerging is that our ability to produce data is outstripping our ability to store it (1).” Since then, having a reliable and effective storage platform has been crucial for companies where technology has continued to evolve.

However, it is important to realise that current single-cell datasets have been generated from a small number of individuals, whereas the statistical significance relies on the number of patients studied, rather than the number of cells. This is because cells from the same patient are ‘siblings’ and not true biological replicates. The immune cell survey in the Human Cell Atlas Consortium, for example, currently contains 780,000 cells – but these are from just 16 individuals. To illustrate the numbers we will soon be facing, consider that the International Hundred Thousand Plus Cohort Consortium is bringing together more than 100 cohorts, comprising more than 50 million individuals from 43 countries, which equates to an astonishing 1 trillion cells if 20,000 cells per individual are analysed.

Consequently, the bioinformatics platforms and data management systems needed to store, compute, and query the very large and complex datasets being generated will become one of the most significant investments scientists make. The need for these key tools to be assessed in a systematic way, and for users to have a clear picture ahead of any purchase decision, is evident. However, in many cases, data on the capabilities, performance, and costs for analysing multimodal population-scale datasets, for example, seems to not be available in the public domain. Moreover, the ‘free’ web portals afford access to limited amounts of data and rely on academic quality software to support users. This makes reaching informed choices regarding critical bioinformatics investments harder.

Those at the leading edge understand that there is a significant challenge here. Running through the single-cell advances reported in 2021, there is a clear central theme, summed up in a publication from a wide consortium of researchers in the US, which states that: “The simultaneous measurement of multiple modalities, known as multimodal analysis, represents an exciting frontier for single-cell genomics and necessitates new computational methods that can define cellular states based on multiple data types (7).”

With respect to productivity, and in the context of single-cell analysis, we must think of a database as part of a contemporary, high-performance, computational data management system. Against this background, for the majority of users, everyday usability, the ease of asking questions, and the time-to-answer both common exploratory queries, as well as large scale computations, will most often become decisive factors. Moreover, with open-source solutions, another challenge can often be that bugs persist for weeks, significantly impacting day-to-day productivity – with the burden of maintaining such tools often falling on the user, rather than the developer.

In the context of using single-cell data to test hypotheses of disease formation or response to therapy, simple assessments, such as the ease of doing correlations across large datasets in seconds, or examples of analysis and query on more than 10 different tissue samples, can provide a realistic comparison between products and approaches. For example, some platforms require the user to take on much of the setup to run distributed computation, rather than automating and hiding it. Such data can also indicate how straightforward it is to do computations on a biologically meaningful scale on a day-to-day basis.

The reality is that to store, compute, interrogate, and validate research hypotheses across such heterogeneous datasets requires a new generation of software tools that are capable of working flexibly and efficiently with diverse data types from billions of cells. Multimodal single-cell data will be orders of magnitude larger than genomic data. The associated data and analytical challenges will continue to increase. Many current ideas about database structure, as well as the existing computational toolbox, will simply be unable to cope with what is required of them.

References

Visit: www.inc.com/greg-satell/data-storage-is-becoming-a-massive-problem-this-startup-may-have-answer.html
Visit: www.msacl.org/presenter_slides/2018_US_deborah.french_82527_slides.pdf
Visit: www.nature.com/articles/s41598-020-65015-y
Alexander MJ et al, Breathing fresh air into respiratory research with single-cell RNA sequencing, Eur Respir Rev 29(156):200060, 2020
Plasschaert LW et al, A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte, Nature 560(7718): pp377-81, 2018
Visit: www.nature.com/articles/d41586-021-01994-w
Yuhan Hao et al, Integrated analysis of multimodal single-cell data, Cell 184(13): pp3,573-87, 2021
Visit: www.biorxiv.org/content/biorxiv/early/2021/03/20/2020.11.29.383067.full.pdf

Marilyn Matz is CEO and Co-Founder, along with Turing laureate Michael Stonebreaker, of Paradigm4. Prior to Paradigm4, after completing an MSc degree at the MIT Computer Science & Artificial Intelligence Laboratory, US, she was one of three co-founders of Cognex Corporation, now a publicly-traded, global industrial machine vision company, where she was Senior Vice President and Business Unit Manager of its Vision Software Products Group. Marilyn was the recipient of the sixth annual Women Entrepreneurs in Science and Technology Leadership Award, a co-recipient of the SEMI industry award for outstanding technical contributions to the semiconductor industry, and a 2020 NACD Directorship 100. She also serves on the Board of Directors of Teradyne, a leading supplier of automation equipment for test and industrial applications.

Dr Zachary Pitluk is Vice President of Life Sciences and Healthcare at Paradigm4. He has worked in sales and marketing for 23 years, from being a pharma representative for Bristol Myers Squibb to management roles in life science technology companies. Since 2003, his positions have included Vice President of Business Development at Gene Network Sciences and Chief Commercial Officer at Proveris Scientific. Zach has held academic positions at Yale University Department of Molecular Biophysics and Biochemistry: Associate Research Scientist, Postdoctoral Fellow, and Graduate Student, and has been named as co-inventor on numerous patents.

1 Trillion Cells, 50 Million Individuals

Striving Towards Precision Medicine

Comparing Data Storage Platforms With Data

Looking Ahead to a New Generation of Software Solutions