Cross-species gene comparison
Code available here.
This tutorial builds a bioinformatics pipeline that compares homologous genes across species. The workflow loads curated gene sequences (insulin, hemoglobin, or p53 by default), scores each sequence with the Carbon genomic language model, aligns DNA and translated protein sequences, folds proteins with ESMFold, and renders interactive HTML reports with identity heatmaps, phylogenetic trees, and 3D structure viewers.
Flyte makes the multi-stage GPU/CPU pipeline reliable:
- Separate CPU and GPU
TaskEnvironments so alignment runs on modest CPU boxes while Carbon scoring and ESMFold run on GPUs. report=Trueon every stage for live HTML progress and final summaries in the Flyte UI.- Cached data loading and orchestrated fan-out across pipeline stages.
Define the task environments
GPU tasks handle Carbon log-likelihood scoring and ESMFold structure prediction; CPU tasks load gene sets, run Needleman-Wunsch alignments, and generate the final summary.
main_img = flyte.Image.from_uv_script(__file__, name="genomic-gene-comparison", pre=True)
gpu_env = flyte.TaskEnvironment(
name="genomic-gene-comparison-gpu",
image=main_img,
resources=flyte.Resources(cpu=4, memory="32Gi", gpu=1),
)
cpu_env = flyte.TaskEnvironment(
name="genomic-gene-comparison-cpu",
image=main_img,
resources=flyte.Resources(cpu=2, memory="8Gi"),
depends_on=[gpu_env],
)
Dependencies are declared at the top of the file using the uv script style:
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "flyte>=2.4.0",
# "torch>=2.9.0",
# "transformers>=4.49.0",
# "accelerate>=0.34.0",
# "numpy",
# ]
# ///Orchestrate the pipeline
The top-level pipeline task chains four stages: load genes, Carbon scoring, sequence alignment, ESMFold folding, and a cross-species summary report.
@cpu_env.task(report=True)
async def pipeline(
gene_set: str = "insulin",
model_name: str = "HuggingFaceBio/Carbon-3B",
custom_json: str = "",
) -> tuple[str, str]:
"""
End-to-end cross-species gene comparison pipeline.
Returns (comparison JSON, structures JSON).
1. Load homologous gene sequences across species
2. Score with Carbon genomic language model
3. Align sequences and compute pairwise similarity
4. Fold translated proteins with ESMFold
5. Generate comprehensive summary with phylogenetic trees
"""
log.info(f"Starting cross-species gene comparison pipeline (gene_set={gene_set})...")
def _pipeline_progress(step: int, label: str) -> str:
steps = [
"Load Genes",
"Carbon Scoring",
"Sequence Alignment",
"ESMFold Structures",
"Generate Summary",
]
dots = ""
for i, s in enumerate(steps):
if i + 1 < step:
icon = '<span style="color:#2563eb;">✓</span>'
elif i + 1 == step:
icon = '<span style="color:#2563eb;">●</span>'
else:
icon = '<span style="color:#adb5bd;">○</span>'
dots += f"<span style='margin:0 8px;'>{icon} {s}</span>"
return f"""
<h2>Cross-Species Gene Comparison</h2>
<div class="card" style="text-align:center;">{dots}</div>
<p>{label}</p>
"""
# Stage 1
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(1, "Loading homologous gene sequences...")),
do_flush=True,
)
genes_dir = await load_genes(gene_set=gene_set, custom_json=custom_json)
# Stage 2
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(2, "Scoring sequences with Carbon...")),
do_flush=True,
)
scores_json = await score_sequences(genes_dir=genes_dir, model_name=model_name)
# Stage 3
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(3, "Aligning sequences with Needleman-Wunsch...")),
do_flush=True,
)
comparison_json = await align_and_compare(scores_json=scores_json, genes_dir=genes_dir)
# Stage 4
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(4, "Folding proteins with ESMFold...")),
do_flush=True,
)
structures_json = await fold_proteins(comparison_json=comparison_json)
# Stage 5
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(5, "Generating summary report...")),
do_flush=True,
)
summary_json = await generate_summary(
comparison_json=comparison_json,
structures_json=structures_json,
)
# Final report
summary = json.loads(summary_json)
comparison = json.loads(comparison_json)
final_html = f"""
<h2>Pipeline Complete</h2>
<div class="stat-grid">
<div class="stat"><div class="value">{summary['gene_name']}</div><div class="label">Gene</div></div>
<div class="stat"><div class="value">{summary['n_species']}</div><div class="label">Species</div></div>
<div class="stat"><div class="value">{summary['avg_dna_identity']:.0%}</div><div class="label">Avg DNA Identity</div></div>
<div class="stat"><div class="value">{summary['avg_protein_identity']:.0%}</div><div class="label">Avg Protein Identity</div></div>
<div class="stat"><div class="value">{summary['avg_plddt']:.1f}</div><div class="label">Avg pLDDT</div></div>
<div class="stat"><div class="value">{summary['n_structures']}</div><div class="label">3D Structures</div></div>
</div>
<div class="card">
<b>Gene:</b> {summary['gene_name']} |
<b>Species:</b> {', '.join(comparison['species'])} |
<b>Model:</b> {model_name}
</div>
<div class="note">
All 4 pipeline stages completed. View individual task reports for DNA/protein
identity heatmaps, phylogenetic trees, interactive 3D protein structures with
pLDDT confidence, Carbon log-likelihood scores, and evolutionary analysis.
</div>
"""
await flyte.report.replace.aio(_wrap_report(final_html), do_flush=True)
log.info("Pipeline complete.")
return comparison_json, structures_json
Run the workflow
From the example directory:
cd v2/tutorials/genomic_gene_comparison
uv run --script genomic_gene_comparison.pyOr submit a specific gene set with the Flyte CLI:
flyte run genomic_gene_comparison.py pipeline --gene_set hemoglobinThis example needs a GPU for Carbon and ESMFold. Open the run URL and check each task’s report tab for heatmaps, dendrograms, and interactive 3D viewers.