DNABERT-2 is a GPU-accelerated Transformer-based foundation model for genome sequence analysis, utilizing Byte Pair Encoding (BPE) tokenization and Attention with Linear Biases (ALiBi) to efficiently process long, multi-species DNA sequences. Compared to previous methods, DNABERT-2 achieves comparable accuracy with approximately 21× fewer parameters and 56× less GPU time during pretraining, supporting tasks such as transcription factor binding prediction, epigenetic mark classification, and splice site detection.

Predict

Predict log probabilities for input DNA sequences

python
from biolmai import BioLM
response = BioLM(
    entity="dnabert2",
    action="predict",
    params={},
    items=[
      {
        "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      },
      {
        "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/dnabert2/predict/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}'
python
import requests

url = "https://biolm.ai/api/v3/dnabert2/predict/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        },
        {
          "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        }
      ],
      "params": {}
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/dnabert2/predict/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    ),
    list(
      sequence = "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    )
  ),
  params = list()
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/dnabert2/predict/

Predict endpoint for DNABERT-2.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • batch_size (int, default: 10) — Maximum number of sequences per request batch

    • max_sequence_len (int, default: 2048) — Maximum allowed length of DNA sequence

  • items (array of objects, min: 1, max: 10) — Input DNA sequences:

    • sequence (string, min length: 1, max length: 2048, required) — DNA sequence containing only unambiguous nucleotide bases (A, T, C, G)

Example request:

http
POST /api/v3/dnabert2/predict/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • embedding (array of floats, size: 768) — DNABERT-2 embedding vector representing the input DNA sequence

  • results (array of objects) — One result per input item, in the order requested:

    • log_prob (float, range: negative infinity to 0.0) — Predicted log-probability of the input DNA sequence

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "log_prob": -20.883956909179688
    },
    {
      "log_prob": -37.712059020996094
    }
  ]
}

Encode

Generate embeddings for input DNA sequences

python
from biolmai import BioLM
response = BioLM(
    entity="dnabert2",
    action="encode",
    params={},
    items=[
      {
        "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      },
      {
        "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/dnabert2/encode/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}'
python
import requests

url = "https://biolm.ai/api/v3/dnabert2/encode/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        },
        {
          "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
        }
      ],
      "params": {}
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/dnabert2/encode/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    ),
    list(
      sequence = "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    )
  ),
  params = list()
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/dnabert2/encode/

Encode endpoint for DNABERT-2.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • batch_size (int, default: 10) — Maximum number of sequences per request (max: 10)

    • max_sequence_len (int, default: 2048) — Maximum allowed length of DNA sequence (max: 2048)

  • items (array of objects, min: 1, max: 10) — DNA sequences to encode:

    • sequence (string, required, min length: 1, max length: 2048) — DNA sequence containing only unambiguous nucleotides (A, T, C, G)

Example request:

http
POST /api/v3/dnabert2/encode/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    },
    {
      "sequence": "GGAAAACCCCACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT"
    }
  ],
  "params": {}
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • embedding (array of floats, size: 768) — DNABERT-2 embedding vector representing the input DNA sequence

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "embedding": [
        -0.38167357444763184,
        -0.020171739161014557,
        "... (truncated for documentation)"
      ]
    },
    {
      "embedding": [
        -0.3801211416721344,
        -0.01748736947774887,
        "... (truncated for documentation)"
      ]
    }
  ]
}

Performance

  • DNABERT-2 inference is GPU-accelerated on NVIDIA T4 GPUs, optimized specifically for efficient handling of DNA sequence embeddings and predictive tasks.

  • DNABERT-2 significantly outperforms the original DNABERT across multiple genomic classification benchmarks, achieving an average absolute score improvement of 6 points on the Genome Understanding Evaluation (GUE) benchmark.

  • Compared to DNABERT, DNABERT-2 delivers approximately 3× faster inference due to its adoption of Byte Pair Encoding (BPE) tokenization, Attention with Linear Biases (ALiBi), and FlashAttention optimizations.

  • DNABERT-2 achieves comparable predictive accuracy to the Nucleotide Transformer (NT-2500M-multi), but with approximately 21× fewer parameters and 56× lower GPU-time requirements during pre-training, making it significantly more efficient for rapid inference and fine-tuning.

  • The model architecture leverages FlashAttention, an IO-aware attention mechanism, reducing GPU memory usage and computational overhead, resulting in faster inference times compared to DNABERT and Nucleotide Transformer variants.

  • DNABERT-2’s use of ALiBi positional encoding allows it to effectively handle longer genomic sequences without performance degradation, unlike DNABERT which is limited by learned positional embeddings.

  • DNABERT-2’s optimized tokenization reduces sequence length by approximately 5× compared to overlapping k-mer tokenization used by DNABERT, significantly improving computational efficiency and throughput.

  • BioLM’s deployment of DNABERT-2 incorporates low-precision layer normalization and memory-efficient inference strategies, further enhancing inference speed and throughput on GPU hardware.

Applications

  • Identification and classification of transcription factor binding sites in genomic sequences, enabling researchers to accurately map regulatory regions and understand gene expression patterns; valuable for companies developing targeted gene therapies or synthetic biology constructs, though performance may decrease with very short sequences (<100 bp).

  • Rapid detection and differentiation of viral genome variants, such as SARS-CoV-2 strains, allowing biotech firms to track viral evolution, inform vaccine updates, and monitor emerging infectious threats; however, accuracy may be limited if input sequences contain ambiguous nucleotides or sequencing errors.

  • Precise prediction of promoter and core promoter regions in human genomes, facilitating the design of synthetic gene circuits and expression vectors by accurately locating transcription initiation sites; particularly useful in optimizing gene expression for biomanufacturing applications, although less optimal for organisms with significantly divergent promoter architectures.

  • High-throughput identification of splice donor and acceptor sites in human DNA sequences, enabling accurate annotation of alternative splicing events; essential for biotech companies developing RNA-based therapeutics or gene-editing technologies, but performance may degrade when analyzing non-canonical or highly mutated splice sites.

  • Genome-wide prediction of epigenetic modifications, such as histone marks, in yeast and related organisms, assisting synthetic biology companies in engineering yeast strains with desired gene expression profiles for bioproduction; however, predictions may not directly translate to more complex eukaryotic systems.

Limitations

  • Maximum Sequence Length: Input sequences are limited to 2048 nucleotides. Longer genomic sequences must be split into smaller segments before submission.

  • Batch Size: The maximum number of sequences per request is 10. For larger datasets, split requests into multiple batches.

  • DNABERT-2 uses Byte Pair Encoding (BPE) tokenization, which can lose subtle positional information critical for tasks involving very short sequences (e.g., core promoter detection). For short sequences (<100 bp), consider alternative models or tokenization methods.

  • The model was pre-trained on multi-species genome data and may not perform optimally for highly specialized tasks requiring fine-grained species-specific genomic context.

  • DNABERT-2 is designed primarily for DNA sequence embedding and masked language modeling. It is not optimal for tasks requiring generative sequence design or causal prediction.

  • GPU Type: Inference runs on a T4 GPU, suitable for typical embedding and prediction tasks. For large-scale inference or ultra-low latency requirements, consider models optimized for higher-performance GPUs.

How We Use It

DNABERT-2 enables efficient and accurate modeling of DNA sequence data within BioLM’s protein engineering workflows, providing sequence embeddings and predictive insights that accelerate the identification and prioritization of genomic features relevant to enzyme and antibody design. The algorithm integrates seamlessly with BioLM’s generative models and downstream analysis tools, allowing researchers to rapidly assess biological sequences and inform experimental decisions.

  • Accelerates identification of key genomic regions influencing protein expression and function

  • Integrates smoothly with BioLM predictive modeling services to streamline multi-round protein optimization

References