Why DuckDB Inside the SDK

How the SDK uses DuckDB for caching, deduplication, and zero-cost re-querying.




⚠️ Preview Feature — The biolmai.pipeline module used in this guide is currently in preview and not yet publicly released. Access is available to early users on request. Contact us to get access.

What you'll learn:

  • How sequence deduplication and cache lookup work under the hood
  • How to query the DuckDB store directly with arbitrary SQL
  • How to resume a pipeline after a crash or session restart

Requirements:

pip install biolmai[pipeline]
export BIOLMAI_TOKEN=your-token-here

Setup

import os
from biolmai.pipeline import (
    DataPipeline, DuckDBDataStore,
    ThresholdFilter, RankingFilter,
    ValidAminoAcidFilter, EmbeddingSpec,
    DiversitySamplingFilter,
)

TOKEN = os.environ.get("BIOLMAI_TOKEN", "")
if not TOKEN:
    raise EnvironmentError(
        "Set BIOLMAI_TOKEN before running.\n"
        "Get one at https://biolm.ai/ui/accounts/user-api-tokens/"
    )

The cache in action

Run a pipeline, then re-run it. The second run fires zero API calls.

PEPTIDES = [
    "GIGKFLHSAKKFGKAFVGEIMNS",
    "GLFDIIKKIAESF",
    "KLAKLAKKLAKLAK",
    "RRWWRRWWRR",
    "KWKLFKKI",
]

ds = DuckDBDataStore("sdk_demo.duckdb")

pipeline = DataPipeline(sequences=PEPTIDES, datastore=ds, run_id="run_v1", verbose=True)
pipeline.add_prediction("temperature-regression", extractions="prediction", columns="melting_temperature")
pipeline.add_prediction("biolmsol", extractions="solubility_score", columns="solubility")
pipeline.run()
print("Run 1 complete — check the API call count above")
# Re-run identical pipeline — 0 API calls
pipeline2 = DataPipeline(sequences=PEPTIDES, datastore=ds, run_id="run_v1", verbose=True)
pipeline2.add_prediction("temperature-regression", extractions="prediction", columns="melting_temperature")
pipeline2.add_prediction("biolmsol", extractions="solubility_score", columns="solubility")
pipeline2.run()
print("Run 2 complete — all served from cache")

Query the DuckDB store directly

The .duckdb file is a standard DuckDB database. You can run arbitrary SQL against it.

# Schema overview
ds.conn.execute("""
    SELECT prediction_type, model_name, COUNT(*) AS n, 
           ROUND(AVG(value), 3) AS mean, ROUND(MIN(value), 3) AS min, ROUND(MAX(value), 3) AS max
    FROM predictions
    GROUP BY prediction_type, model_name
""").df()
# Full prediction history for every sequence
ds.conn.execute("""
    SELECT s.sequence, p.prediction_type, ROUND(p.value, 3) AS value
    FROM sequences s
    JOIN predictions p ON s.sequence_id = p.sequence_id
    ORDER BY s.sequence, p.prediction_type
""").df()

Re-filter without re-predicting

Change your filter threshold and re-run — predictions are already cached.

pipeline3 = DataPipeline(sequences=PEPTIDES, datastore=ds, run_id="run_v3_strict", verbose=True)
pipeline3.add_prediction("temperature-regression", extractions="prediction", columns="melting_temperature")
pipeline3.add_prediction("biolmsol", extractions="solubility_score", columns="solubility")
pipeline3.add_filter(ThresholdFilter("melting_temperature", min_value=50.0))  # stricter threshold
pipeline3.run()
pipeline3.summary()

Cleanup

ds.close()
import os; os.remove("sdk_demo.duckdb")

Next Steps

Check out additional tutorials at jupyter.biolm.ai, or head over to our BioLM Documentation to explore additional models and functionality.

See more use-cases and APIs on your BioLM Console Catalog.


BioLM hosts deep learning models and runs inference at scale. You do the science.

Contact us to learn more.

Accelerate yourLead generation

BioLM offers tailored AI solutions to meet your experimental needs. We deliver top-tier results with our model-agnostic approach, powered by our highly scalable and real-time GPU-backed APIs and years of experience in biological data modeling, all at a competitive price.

CTA

We speak the language of bio-AI

© 2022 - 2026 BioLM. All Rights Reserved.