DNABERT-2

Predict¶

Predict log probabilities for input DNA sequences

Python (biolmai)

python

from biolmai import BioLM
response = BioLM(
    entity="dnabert2",
    action="predict",
    params={},
    items=[
      {
        "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      },
      {
        "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      }
    ]
)
print(response)

cURL

bash

curl -X POST https://biolm.ai/api/v3/dnabert2/predict/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}'

Python Requests

python

import requests

url = "https://biolm.ai/api/v3/dnabert2/predict/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        },
        {
          "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        }
      ],
      "params": {}
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())

R

r

library(httr)

url <- "https://biolm.ai/api/v3/dnabert2/predict/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    ),
    list(
      sequence = "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    )
  ),
  params = list()
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))

POST /api/v3/dnabert2/predict/¶

Predict endpoint for DNABERT-2.

Request Headers:

Content-Type – application/json
Authorization – Token YOUR_API_KEY

Request

params (object, optional) — Configuration parameters:
- batch_size (int, default: 10) — Maximum number of sequences per request batch
- max_sequence_len (int, default: 2048) — Maximum allowed length of DNA sequence
items (array of objects, min: 1, max: 10) — Input DNA sequences:
- sequence (string, min length: 1, max length: 2048, required) — DNA sequence containing only unambiguous nucleotide bases (A, T, C, G)

Example request:

http

POST /api/v3/dnabert2/predict/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}

Status Codes:

200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error

Response

results (array of objects) — One result per input item, in the order requested:
- embedding (array of floats, size: 768) — DNABERT-2 embedding vector representing the input DNA sequence
results (array of objects) — One result per input item, in the order requested:
- log_prob (float, range: negative infinity to 0.0) — Predicted log-probability of the input DNA sequence

Example response:

http

HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "log_prob": -20.883956909179688
    },
    {
      "log_prob": -37.712059020996094
    }
  ]
}

Encode¶

Generate embeddings for input DNA sequences

Python (biolmai)

python

from biolmai import BioLM
response = BioLM(
    entity="dnabert2",
    action="encode",
    params={},
    items=[
      {
        "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      },
      {
        "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      }
    ]
)
print(response)

cURL

bash

curl -X POST https://biolm.ai/api/v3/dnabert2/encode/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}'

Python Requests

python

import requests

url = "https://biolm.ai/api/v3/dnabert2/encode/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        },
        {
          "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        }
      ],
      "params": {}
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())

R

r

library(httr)

url <- "https://biolm.ai/api/v3/dnabert2/encode/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    ),
    list(
      sequence = "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    )
  ),
  params = list()
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))

POST /api/v3/dnabert2/encode/¶

Encode endpoint for DNABERT-2.

Request Headers:

Content-Type – application/json
Authorization – Token YOUR_API_KEY

Request

params (object, optional) — Configuration parameters:
- batch_size (int, default: 10) — Maximum number of sequences per request (max: 10)
- max_sequence_len (int, default: 2048) — Maximum allowed length of DNA sequence (max: 2048)
items (array of objects, min: 1, max: 10) — DNA sequences to encode:
- sequence (string, required, min length: 1, max length: 2048) — DNA sequence containing only unambiguous nucleotides (A, T, C, G)

Example request:

http

POST /api/v3/dnabert2/encode/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}

Status Codes:

200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error

Response

results (array of objects) — One result per input item, in the order requested:
- embedding (array of floats, size: 768) — DNABERT-2 embedding vector representing the input DNA sequence

Example response:

http

HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "embedding": [
        -0.38167357444763184,
        -0.020171739161014557,
        "... (truncated for documentation)"
      ]
    },
    {
      "embedding": [
        -0.3801211416721344,
        -0.01748736947774887,
        "... (truncated for documentation)"
      ]
    }
  ]
}

Performance¶

DNABERT-2 inference is GPU-accelerated on NVIDIA T4 GPUs, optimized specifically for efficient handling of DNA sequence embeddings and predictive tasks.
DNABERT-2 significantly outperforms the original DNABERT across multiple genomic classification benchmarks, achieving an average absolute score improvement of 6 points on the Genome Understanding Evaluation (GUE) benchmark.
Compared to DNABERT, DNABERT-2 delivers approximately 3× faster inference due to its adoption of Byte Pair Encoding (BPE) tokenization, Attention with Linear Biases (ALiBi), and FlashAttention optimizations.
DNABERT-2 achieves comparable predictive accuracy to the Nucleotide Transformer (NT-2500M-multi), but with approximately 21× fewer parameters and 56× lower GPU-time requirements during pre-training, making it significantly more efficient for rapid inference and fine-tuning.
The model architecture leverages FlashAttention, an IO-aware attention mechanism, reducing GPU memory usage and computational overhead, resulting in faster inference times compared to DNABERT and Nucleotide Transformer variants.
DNABERT-2’s use of ALiBi positional encoding allows it to effectively handle longer genomic sequences without performance degradation, unlike DNABERT which is limited by learned positional embeddings.
DNABERT-2’s optimized tokenization reduces sequence length by approximately 5× compared to overlapping k-mer tokenization used by DNABERT, significantly improving computational efficiency and throughput.
BioLM’s deployment of DNABERT-2 incorporates low-precision layer normalization and memory-efficient inference strategies, further enhancing inference speed and throughput on GPU hardware.

Applications¶

Identification and classification of transcription factor binding sites in genomic sequences, enabling researchers to accurately map regulatory regions and understand gene expression patterns; valuable for companies developing targeted gene therapies or synthetic biology constructs, though performance may decrease with very short sequences (<100 bp).
Rapid detection and differentiation of viral genome variants, such as SARS-CoV-2 strains, allowing biotech firms to track viral evolution, inform vaccine updates, and monitor emerging infectious threats; however, accuracy may be limited if input sequences contain ambiguous nucleotides or sequencing errors.
Precise prediction of promoter and core promoter regions in human genomes, facilitating the design of synthetic gene circuits and expression vectors by accurately locating transcription initiation sites; particularly useful in optimizing gene expression for biomanufacturing applications, although less optimal for organisms with significantly divergent promoter architectures.
High-throughput identification of splice donor and acceptor sites in human DNA sequences, enabling accurate annotation of alternative splicing events; essential for biotech companies developing RNA-based therapeutics or gene-editing technologies, but performance may degrade when analyzing non-canonical or highly mutated splice sites.
Genome-wide prediction of epigenetic modifications, such as histone marks, in yeast and related organisms, assisting synthetic biology companies in engineering yeast strains with desired gene expression profiles for bioproduction; however, predictions may not directly translate to more complex eukaryotic systems.

Limitations¶

Maximum Sequence Length: Input sequences are limited to 2048 nucleotides. Longer genomic sequences must be split into smaller segments before submission.
Batch Size: The maximum number of sequences per request is 10. For larger datasets, split requests into multiple batches.
DNABERT-2 uses Byte Pair Encoding (BPE) tokenization, which can lose subtle positional information critical for tasks involving very short sequences (e.g., core promoter detection). For short sequences (<100 bp), consider alternative models or tokenization methods.
The model was pre-trained on multi-species genome data and may not perform optimally for highly specialized tasks requiring fine-grained species-specific genomic context.
DNABERT-2 is designed primarily for DNA sequence embedding and masked language modeling. It is not optimal for tasks requiring generative sequence design or causal prediction.
GPU Type: Inference runs on a T4 GPU, suitable for typical embedding and prediction tasks. For large-scale inference or ultra-low latency requirements, consider models optimized for higher-performance GPUs.

How We Use It¶

DNABERT-2 enables efficient and accurate modeling of DNA sequence data within BioLM’s protein engineering workflows, providing sequence embeddings and predictive insights that accelerate the identification and prioritization of genomic features relevant to enzyme and antibody design. The algorithm integrates seamlessly with BioLM’s generative models and downstream analysis tools, allowing researchers to rapidly assess biological sequences and inform experimental decisions.

Accelerates identification of key genomic regions influencing protein expression and function
Integrates smoothly with BioLM predictive modeling services to streamline multi-round protein optimization

Related¶

Omni-DNA 1B – A large-scale DNA language model that complements DNABERT-2 by providing alternative sequence embeddings and representations for genomic tasks.
nanoBERT – Focuses specifically on nanopore sequencing data, offering complementary capabilities to DNABERT-2 for analyzing noisy, long-read DNA sequences.
Chai-1 – Specializes in predicting chromatin interactions, enhancing DNABERT-2 by adding insights into three-dimensional genome organization.
Evo 2 1B Base – Provides evolutionary-aware DNA sequence embeddings, complementing DNABERT-2 by capturing evolutionary signals across species.

References¶

Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R., & Liu, H. (2023). DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. bioRxiv.

All Models

DNABERT-2

Capabilities

Postman

Predict¶

Request

Response

Encode¶

Request

Response

Performance¶

Applications¶

Limitations¶

How We Use It¶

References¶

Accelerate yourLead generation