Genomic variant effect prediction

Code available here.

This tutorial demonstrates zero-shot variant effect prediction (VEP) with HuggingFace Carbon. The pipeline loads clinically relevant variants across genes such as BRCA2, TP53, CFTR, KRAS, and HBB, scores each mutation with a log-likelihood ratio, and produces rich HTML reports with DNA tracks, lollipop plots, confusion matrices, and ranked pathogenicity tables.

Flyte provides:

  • GPU-backed inference for Carbon scoring with live progress reports.
  • CPU analysis tasks for visualization and accuracy metrics without holding a GPU.
  • End-to-end orchestration from variant loading through summary reporting.

Define the task environments

genomic_variant_effect.py
main_img = flyte.Image.from_uv_script(__file__, name="genomic-variant-effect", pre=True)

gpu_env = flyte.TaskEnvironment(
    name="genomic-variant-effect-gpu",
    image=main_img,
    resources=flyte.Resources(cpu=4, memory="24Gi", gpu=1),
)

cpu_env = flyte.TaskEnvironment(
    name="genomic-variant-effect-cpu",
    image=main_img,
    resources=flyte.Resources(cpu=2, memory="6Gi"),
    depends_on=[gpu_env],
)
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.4.0",
#    "torch>=2.9.0",
#    "transformers>=4.49.0",
#    "accelerate>=0.34.0",
#    "numpy",
# ]
# ///

Orchestrate the pipeline

The pipeline task loads variants, scores them with Carbon, analyzes classification accuracy against known labels, and generates a summary report.

genomic_variant_effect.py
@cpu_env.task(report=True)
async def pipeline(
    variants_json: str = "",
    model_name: str = "HuggingFaceBio/Carbon-3B",
) -> tuple[str, str]:
    """
    End-to-end genomic variant effect prediction pipeline.

    Returns (scores JSON, analysis JSON).

    1. Load and validate gene variants
    2. Score variants with Carbon (log-likelihood ratio)
    3. Analyze effects — accuracy, visualizations, classification
    4. Generate comprehensive summary report
    """
    log.info("Starting genomic variant effect prediction pipeline...")

    def _pipeline_progress(step: int, label: str) -> str:
        steps = [
            "Load Variants",
            "Carbon Scoring",
            "Analyze Effects",
            "Generate Summary",
        ]
        dots = ""
        for i, s in enumerate(steps):
            if i + 1 < step:
                icon = '<span style="color:#2563eb;">&#10003;</span>'
            elif i + 1 == step:
                icon = '<span style="color:#2563eb;">&#9679;</span>'
            else:
                icon = '<span style="color:#adb5bd;">&#9675;</span>'
            dots += f"<span style='margin:0 8px;'>{icon} {s}</span>"
        return f"""
        <h2>Genomic Variant Effect Prediction</h2>
        <div class="card" style="text-align:center;">{dots}</div>
        <p>{label}</p>
        """

    # Stage 1: Load variants
    await flyte.report.replace.aio(
        _wrap_report(_pipeline_progress(1, "Loading and validating gene variants...")),
        do_flush=True,
    )
    var_dir = await load_variants(variants_json=variants_json)

    # Stage 2: Score with Carbon
    await flyte.report.replace.aio(
        _wrap_report(_pipeline_progress(2, "Running Carbon model for variant effect scoring...")),
        do_flush=True,
    )
    scores_json = await score_variants(variants_dir=var_dir, model_name=model_name)

    # Stage 3: Analyze effects
    await flyte.report.replace.aio(
        _wrap_report(_pipeline_progress(3, "Analyzing variant effects and generating visualizations...")),
        do_flush=True,
    )
    analysis_json = await analyze_effects(scores_json=scores_json, variants_dir=var_dir)

    # Stage 4: Summary
    await flyte.report.replace.aio(
        _wrap_report(_pipeline_progress(4, "Generating comprehensive summary report...")),
        do_flush=True,
    )
    summary_json = await generate_summary(scores_json=scores_json, analysis_json=analysis_json)

    # Final pipeline report
    analysis = json.loads(analysis_json)
    results = json.loads(scores_json)

    final_html = f"""
    <h2>Pipeline Complete</h2>
    <div class="stat-grid">
      <div class="stat"><div class="value">{len(results)}</div><div class="label">Genes Analyzed</div></div>
      <div class="stat"><div class="value">{analysis['total_variants']}</div><div class="label">Variants Scored</div></div>
      <div class="stat"><div class="value">{analysis['accuracy']:.0%}</div><div class="label">Direction Accuracy</div></div>
      <div class="stat"><div class="value">{analysis['precision']:.0%}</div><div class="label">Precision</div></div>
      <div class="stat"><div class="value">{analysis['recall']:.0%}</div><div class="label">Recall</div></div>
    </div>
    <div class="card">
      <b>Model:</b> HuggingFace Carbon |
      <b>Method:</b> Zero-shot log-likelihood ratio scoring |
      <b>Genes:</b> {', '.join(g.split('(')[0].strip() for g in results.keys())}
    </div>
    <div class="note">
      All 4 pipeline stages completed successfully. View individual task reports for detailed
      visualizations including DNA sequence tracks, variant lollipop plots, VEP score charts,
      confusion matrices, and ranked variant tables.
    </div>
    """

    await flyte.report.replace.aio(_wrap_report(final_html), do_flush=True)

    log.info("Pipeline complete.")
    return scores_json, analysis_json

Run the workflow

From the example directory:

cd v2/tutorials/genomic_variant_effect
uv run --script genomic_variant_effect.py

Use a smaller Carbon model for faster iteration:

flyte run genomic_variant_effect.py pipeline --model_name HuggingFaceBio/Carbon-500M

Negative VEP scores indicate the model prefers the reference allele over the alternate — a signal correlated with pathogenicity in this zero-shot setup.