Sara Gawlinski

Sequences and Systems: The Convergence of Machine Learning and Biotech


When NGS applied the power of massively parallel processing to the analysis of DNA, it transformed the biotech playing field.

When Next-Generation Sequencing (NGS) applied the power of massively parallel processing to the analysis of DNA, it transformed the biotech playing field. Overnight, researchers were able to ask questions that would have been prohibitively time-consuming and costly using traditional methods of DNA sequencing.

NGS could rapidly generate hundreds of megabases, even gigabases, of nucleotide sequence reads — the entire human genome could be sequenced in under a day! This fundamental shift in genomic research amassed a vast amount of biological data. To derive insights from it, that data needed to be processed, transformed and analyzed — and machine learning has proved essential for those tasks. 

Last week, the Union team hosted a virtual webinar, “Unlocking value from sequencing data with workflow management.” We brought together a number of Flyte and Union Cloud users who are experts in data architecture, bioinformatics and machine learning to discuss: 

  • The state of data in biotech
  • Challenges they face managing and collaborating on workflows
  • Differences between academic and industry applications of ML
  • The future of biotech tooling and platforms 

Read on for the highlights, or tune into the recording below. 

The impact of machine learning on biotech

As annotated genomic databases, NGS data repositories and protein structure information in the RCSB PDB became publicly available, researchers in bioinformatics and computational biology utilized ML methods to extract valuable insights. “The accumulation of large-scale data sets in a variety of different specific areas has been the key to unlocking a lot of breakthroughs in machine learning we’ve seen: sequence analysis, protein structure prediction, protein engineering etc.,” said Alex Ford, Head of Data Platform at AbCellera. Just one example of this is AlphaFold, the machine learning engine that is tackling the challenge of protein structure prediction. As of now, the latest open-access AlphaFoldDB contains over 200 million entries that can be used to advance biological research. 

More data, more problems

It’s important to note that there are stark qualitative distinctions between life sciences data and more classic internet image and text data. “With internet scale data, people use the products and it creates data whereas in the bio space…you have to actively create the data” said Thomas Vetterli, Director of Machine Learning and Bioinformatics at Hedera Dx. The nuance of data in this domain also creates labeling and active learning problems. 

“AI can generate a ton of structures, and articles will say it’s the equivalent of about 2 billion hours of grad student work making crystal structures … but there still needs to be a ground truth to the function and the purpose and the kind of evolutionary trajectory and forces that shape that molecule or pathway,” said Brian O’Donovan, Head of Bioinformatics at Delve Bio. Many in the ML community who encounter this biological data are taken by surprise at the sheer scale, noise and lingering vestiges of the academic origins of some of these file types, like FASTQ or PDB. Before they can apply ML methods to this domain-specific data, scientists and engineers alike have found they need specialized tooling and infrastructure to efficiently manage, store and analyze it. “The discovery of this data by the ML community has definitely changed processing needs,” said Eli Bixby, co-founder and ML engineer of Cradle Bio, “and this is also where the workflows get more complex from an ops perspective.”

Delivering results is a team effort 

When so many different specialized scientists and engineers have to develop workflows with such exacting and specific requirements, collaboration becomes a necessity. Ford said AbCellera thinks of collaboration across three layers: a scientific layer (often a domain expert running a particular lab workflow), a data-science layer and the platform infrastructure layer. “We often have a very fast-cycle exploratory collaboration in sort of a computational notebook context between a scientist and a data scientist to begin framing an amorphous scientific problem down into some kind of computational tool or solution,” Ford said.  “Then we have another handoff in collaboration between a data scientist and an engineer to really take that prototype system and knit that into our production infrastructure. We collaborate across that entire stack, and we need a tool set that supports that.”  

The role of workflow orchestration

One topic that surfaced in the discussion: What common challenges are driving biotech companies to adopt workflow orchestration platforms? When Union Software Engineer Jeev Balakrishnan started working in bioinformatics, he found that the popular thing to do was write long Bash scripts for pipelines. These would run really well on a laptop or single machine, but as the product grew, so did the complexity of their pipeline. In response scientists turned to widely known orchestration tools such as Snakemake, Nextflow and Airflow. “People started migrating their Bash scripts and breaking them up into smaller pieces and putting it in these different frameworks and still running it on a single machine,” Balakrishnan said. “Every single step of the way was a struggle.” To serve customers, “distributing and running at scale was a painful job to run things reliably, resiliently, and without your infrastructure struggling.” 

As a Kubernetes-based workflow orchestration engine, Flyte uses containers to isolate and scale tasks in a workflow. This is especially helpful when you have a workflow made up of tasks that require drastically different languages, dependencies and compute resources. However, understanding Kubernetes and the underlying container technology shouldn’t be a prerequisite for a bioinformatic scientist to build a sequence analysis pipeline using tools they are familiar with. According to Ford, “You need an environment in which you can allow [scientists] to pull the appropriate tools off the shelf, stitch together the pipeline that solves the scientific application and then reliably and reproducibly roll that out into a compute environment.” 

Summary and next steps

As organizations continue to productize biological research, the adoption of established software engineering practices is critical to enable speed, scalability, and reproducibility. If you found this recap intriguing, you can check out the full panel discussion below.

Curious to learn more about Flyte, the workflow orchestration engine powering these innovations in biotechnology research?

Machine Learning