Omni-DNA 1B is an autoregressive genomic foundation model (1B parameters) enabling GPU-accelerated inference for cross-modal and multi-task DNA sequence analysis. The model supports tasks including DNA classification, functional annotation generation (DNA-to-text), and DNA-to-image mapping, using a unified tokenizer with dynamic vocabulary expansion. Omni-DNA 1B achieves state-of-the-art accuracy on genomic benchmarks (average MCC 0.767 across 18 tasks) and simultaneously solves multiple acetylation and methylation tasks, providing streamlined annotation and prediction workflows for bioinformatics, genomic research, and functional genomics applications.

Predict

Predict log probabilities for input DNA sequences

python
from biolmai import BioLM
response = BioLM(
    entity="omni-dna-1b",
    action="predict",
    params={},
    items=[
      {
        "sequence": "ACTGACTG"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/omni-dna-1b/predict/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "ACTGACTG"
    }
  ]
}'
python
import requests

url = "https://biolm.ai/api/v3/omni-dna-1b/predict/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "ACTGACTG"
        }
      ]
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/omni-dna-1b/predict/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "ACTGACTG"
    )
  )
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/omni-dna-1b/predict/

Predict endpoint for Omni-DNA 1B.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • include (array of strings, default: [“mean”]) — Embedding types to include in response:

      • Enum values: “mean”, “last”

  • items (array of objects, min: 1, max: 2) — Input sequences:

    • sequence (string, min length: 1, max length: 2048, required) — DNA sequence containing only unambiguous nucleotides (A, C, G, T)

Example request:

http
POST /api/v3/omni-dna-1b/predict/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "ACTGACTG"
    }
  ]
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • mean (array of objects, optional) — Mean embedding representation across all tokens in the input sequence:

      • embedding (array of floats, size: 2048) — Mean embedding vector for the input sequence

    • last (array of objects, optional) — Embedding representation of the last token in the input sequence:

      • embedding (array of floats, size: 2048) — Embedding vector for the last token

  • results (array of objects) — One result per input item, in the order requested:

    • log_prob (float, range: negative infinity to 0.0) — Log probability score of the input DNA sequence

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "log_prob": -11.62202262878418
    }
  ]
}

Encode

Generate embeddings for input DNA sequences

python
from biolmai import BioLM
response = BioLM(
    entity="omni-dna-1b",
    action="encode",
    params={},
    items=[
      {
        "sequence": "ACTGACTG"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/omni-dna-1b/encode/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "ACTGACTG"
    }
  ]
}'
python
import requests

url = "https://biolm.ai/api/v3/omni-dna-1b/encode/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "ACTGACTG"
        }
      ]
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/omni-dna-1b/encode/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "ACTGACTG"
    )
  )
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/omni-dna-1b/encode/

Encode endpoint for Omni-DNA 1B.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • include (array of strings, default: [“mean”]) — Embedding types to include in response

      • Allowed values:

        • “mean” — Mean embedding across all sequence positions

        • “last” — Embedding from the last sequence position

  • items (array of objects, min: 1, max: 2) — DNA sequences to encode:

    • sequence (string, required, min length: 1, max length: 2048) — DNA sequence containing only unambiguous nucleotide characters (A, T, G, C)

Example request:

http
POST /api/v3/omni-dna-1b/encode/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "ACTGACTG"
    }
  ]
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • mean (array of objects, optional) — Mean-pooled embeddings across the entire input DNA sequence:

      • embedding (array of floats, size: 2048) — Embedding vector representing the input sequence; dimensions correspond to the model’s hidden size (e.g., 2048 for Omni-DNA 1B); values typically range from approximately -10.0 to 10.0.

    • last (array of objects, optional) — Embeddings from the last token position of the input DNA sequence:

      • embedding (array of floats, size: 2048) — Embedding vector from the final token; dimensions correspond to the model’s hidden size (e.g., 2048 for Omni-DNA 1B); values typically range from approximately -10.0 to 10.0.

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "mean": [
        {
          "embedding": [
            0.1327531933784485,
            -0.3218590319156647,
            "... (truncated for documentation)"
          ]
        }
      ]
    }
  ]
}

Performance

  • Omni-DNA 1B processes sequences with a maximum length of 2048 nucleotides and a batch size of 2 sequences per request.

  • GPU hardware utilized is NVIDIA L4 GPUs with 16 GB memory, optimized specifically for transformer-based genomic inference tasks.

  • Omni-DNA 1B achieves state-of-the-art predictive accuracy on genomic classification tasks, outperforming DNABERT-2 and NT Transformer models across 13 out of 18 benchmark tasks in the Nucleotide Transformer (NT) downstream evaluation.

  • Compared to smaller Omni-DNA variants (20M, 60M, 116M, 300M, 700M), Omni-DNA 1B consistently demonstrates superior predictive performance, with an average accuracy of 0.767 across NT downstream tasks, compared to 0.755 for Omni-DNA 116M, the next-best performing variant.

  • Omni-DNA 1B leverages RoPE positional embeddings, which result in faster convergence and lower test losses compared to ALiBi embeddings, enhancing overall predictive accuracy.

  • Non-parametric LayerNorm used in Omni-DNA 1B maintains more stable training dynamics and higher downstream accuracy compared to RMSNorm used in smaller Omni-DNA variants.

  • Omni-DNA 1B is highly robust to distribution shifts caused by vocabulary expansion during finetuning, showing minimal performance degradation compared to smaller Omni-DNA models.

  • For embedding generation (encode requests), Omni-DNA 1B outputs embeddings as floating-point vectors representing either the mean or the last hidden state of the sequence, depending on user specification.

  • For log probability prediction (predict log prob requests), Omni-DNA 1B returns floating-point log probabilities indicating the likelihood of provided DNA sequences.

Applications

  • Identification and annotation of regulatory DNA elements in genomic sequences, enabling researchers to rapidly pinpoint promoters, enhancers, and splice sites critical for gene expression control; valuable for companies developing synthetic gene circuits or optimizing expression vectors, although less optimal for extremely short sequences (<50 nucleotides) due to reduced contextual information.

  • Simultaneous prediction of multiple epigenetic modifications (e.g., acetylation and methylation) from DNA sequence alone, streamlining epigenomic profiling for biotech firms developing epigenetic therapies or diagnostics; particularly useful when experimental assays are costly or impractical, though accuracy may vary for modifications with limited training data.

  • Cross-modal mapping of DNA sequences to functional textual annotations, allowing rapid functional characterization of novel or engineered genetic constructs; beneficial for synthetic biology companies designing DNA-based biosensors or genetic circuits, but may require additional experimental validation for highly novel or synthetic sequences.

  • DNA sequence-based generation of visual representations (e.g., motif visualization or regulatory pattern recognition), enabling intuitive interpretation of regulatory motifs and facilitating rapid communication of genomic insights within biotech R&D teams; particularly helpful in exploratory analyses, though not optimal for precise quantitative motif strength assessments.

  • Multi-task genomic modeling to classify and annotate DNA sequences across diverse tasks simultaneously, significantly reducing computational overhead and simplifying bioinformatics pipelines for biotech companies performing large-scale genomic screening or synthetic DNA library optimization; however, performance may degrade slightly compared to single-task specialized models when extremely high accuracy is required for a specific task.

Limitations

  • Maximum Sequence Length: The Omni-DNA 1B model supports a maximum sequence length of 2048 nucleotides. Users must ensure that input sequences do not exceed this limit to avoid errors.

  • Batch Size: The API supports a maximum batch size of 2 sequences per request. This constraint is crucial for maintaining efficient processing and resource management.

  • GPU Type: The Omni-DNA 1B model requires a GPU Type of L4 for optimal performance. Users should verify their infrastructure meets this requirement to ensure efficient execution.

  • Cross-Modal Limitations: While Omni-DNA exhibits strong cross-modal capabilities, it may not perform optimally for tasks requiring high-fidelity cross-modality translation, such as detailed DNA-to-image mappings beyond basic motifs.

  • Pretraining Corpus Limitations: The model’s pretraining corpus, while extensive, may not fully capture the diversity of regulatory elements across all species and tissue types, potentially impacting performance on less-represented genomic features.

  • Use Case Suitability: Omni-DNA is not ideal for tasks requiring long-range interaction predictions or variant effect predictions, as these applications may demand specialized models or additional fine-tuning beyond the current capabilities.

How We Use It

Omni-DNA 1B enables integrated genomic analysis within BioLM’s protein engineering workflows, accelerating the identification and validation of functional genetic elements. By providing consistent DNA-to-function annotation and multi-task genomic modeling through scalable API access, Omni-DNA 1B streamlines the discovery of regulatory motifs, promoter regions, and functional sequence annotations. When combined with BioLM’s predictive modeling and generative AI algorithms, Omni-DNA 1B supports efficient screening and prioritization of engineered protein candidates, directly improving experimental outcomes and reducing development timelines.

  • Integrates seamlessly with predictive and generative modeling workflows to inform protein design decisions.

  • Accelerates research by enabling simultaneous annotation, classification, and prioritization of genomic sequences through standardized API interfaces.

References