Genomic variant effect prediction
Code available here.
This tutorial demonstrates zero-shot variant effect prediction (VEP) with HuggingFace Carbon. The pipeline loads clinically relevant variants across genes such as BRCA2, TP53, CFTR, KRAS, and HBB, scores each mutation with a log-likelihood ratio, and produces rich HTML reports with DNA tracks, lollipop plots, confusion matrices, and ranked pathogenicity tables.
Flyte provides:
- GPU-backed inference for Carbon scoring with live progress reports.
- CPU analysis tasks for visualization and accuracy metrics without holding a GPU.
- End-to-end orchestration from variant loading through summary reporting.
Define the task environments
main_img = flyte.Image.from_uv_script(__file__, name="genomic-variant-effect", pre=True)
gpu_env = flyte.TaskEnvironment(
name="genomic-variant-effect-gpu",
image=main_img,
resources=flyte.Resources(cpu=4, memory="24Gi", gpu=1),
)
cpu_env = flyte.TaskEnvironment(
name="genomic-variant-effect-cpu",
image=main_img,
resources=flyte.Resources(cpu=2, memory="6Gi"),
depends_on=[gpu_env],
)
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "flyte>=2.4.0",
# "torch>=2.9.0",
# "transformers>=4.49.0",
# "accelerate>=0.34.0",
# "numpy",
# ]
# ///Orchestrate the pipeline
The pipeline task loads variants, scores them with Carbon, analyzes classification accuracy against known labels, and generates a summary report.
@cpu_env.task(report=True)
async def pipeline(
variants_json: str = "",
model_name: str = "HuggingFaceBio/Carbon-3B",
) -> tuple[str, str]:
"""
End-to-end genomic variant effect prediction pipeline.
Returns (scores JSON, analysis JSON).
1. Load and validate gene variants
2. Score variants with Carbon (log-likelihood ratio)
3. Analyze effects — accuracy, visualizations, classification
4. Generate comprehensive summary report
"""
log.info("Starting genomic variant effect prediction pipeline...")
def _pipeline_progress(step: int, label: str) -> str:
steps = [
"Load Variants",
"Carbon Scoring",
"Analyze Effects",
"Generate Summary",
]
dots = ""
for i, s in enumerate(steps):
if i + 1 < step:
icon = '<span style="color:#2563eb;">✓</span>'
elif i + 1 == step:
icon = '<span style="color:#2563eb;">●</span>'
else:
icon = '<span style="color:#adb5bd;">○</span>'
dots += f"<span style='margin:0 8px;'>{icon} {s}</span>"
return f"""
<h2>Genomic Variant Effect Prediction</h2>
<div class="card" style="text-align:center;">{dots}</div>
<p>{label}</p>
"""
# Stage 1: Load variants
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(1, "Loading and validating gene variants...")),
do_flush=True,
)
var_dir = await load_variants(variants_json=variants_json)
# Stage 2: Score with Carbon
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(2, "Running Carbon model for variant effect scoring...")),
do_flush=True,
)
scores_json = await score_variants(variants_dir=var_dir, model_name=model_name)
# Stage 3: Analyze effects
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(3, "Analyzing variant effects and generating visualizations...")),
do_flush=True,
)
analysis_json = await analyze_effects(scores_json=scores_json, variants_dir=var_dir)
# Stage 4: Summary
await flyte.report.replace.aio(
_wrap_report(_pipeline_progress(4, "Generating comprehensive summary report...")),
do_flush=True,
)
summary_json = await generate_summary(scores_json=scores_json, analysis_json=analysis_json)
# Final pipeline report
analysis = json.loads(analysis_json)
results = json.loads(scores_json)
final_html = f"""
<h2>Pipeline Complete</h2>
<div class="stat-grid">
<div class="stat"><div class="value">{len(results)}</div><div class="label">Genes Analyzed</div></div>
<div class="stat"><div class="value">{analysis['total_variants']}</div><div class="label">Variants Scored</div></div>
<div class="stat"><div class="value">{analysis['accuracy']:.0%}</div><div class="label">Direction Accuracy</div></div>
<div class="stat"><div class="value">{analysis['precision']:.0%}</div><div class="label">Precision</div></div>
<div class="stat"><div class="value">{analysis['recall']:.0%}</div><div class="label">Recall</div></div>
</div>
<div class="card">
<b>Model:</b> HuggingFace Carbon |
<b>Method:</b> Zero-shot log-likelihood ratio scoring |
<b>Genes:</b> {', '.join(g.split('(')[0].strip() for g in results.keys())}
</div>
<div class="note">
All 4 pipeline stages completed successfully. View individual task reports for detailed
visualizations including DNA sequence tracks, variant lollipop plots, VEP score charts,
confusion matrices, and ranked variant tables.
</div>
"""
await flyte.report.replace.aio(_wrap_report(final_html), do_flush=True)
log.info("Pipeline complete.")
return scores_json, analysis_json
Run the workflow
From the example directory:
cd v2/tutorials/genomic_variant_effect
uv run --script genomic_variant_effect.pyUse a smaller Carbon model for faster iteration:
flyte run genomic_variant_effect.py pipeline --model_name HuggingFaceBio/Carbon-500MNegative VEP scores indicate the model prefers the reference allele over the alternate — a signal correlated with pathogenicity in this zero-shot setup.