DNABERT-2 is a GPU-accelerated Transformer-based foundation model for genome sequence analysis, utilizing Byte Pair Encoding (BPE) tokenization and Attention with Linear Biases (ALiBi) to efficiently process long, multi-species DNA sequences. Compared to previous methods, DNABERT-2 achieves comparable accuracy with approximately 21× fewer parameters and 56× less GPU time during pretraining, supporting tasks such as transcription factor binding prediction, epigenetic mark classification, and splice site detection.
Predict¶
Predict log probabilities for input DNA sequences
- POST /api/v3/dnabert2/predict/¶
Predict endpoint for DNABERT-2.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
batch_size (int, default: 10) — Maximum number of sequences per request batch
max_sequence_len (int, default: 2048) — Maximum allowed length of DNA sequence
items (array of objects, min: 1, max: 10) — Input DNA sequences:
sequence (string, min length: 1, max length: 2048, required) — DNA sequence containing only unambiguous nucleotide bases (A, T, C, G)
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embedding (array of floats, size: 768) — DNABERT-2 embedding vector representing the input DNA sequence
results (array of objects) — One result per input item, in the order requested:
log_prob (float, range: negative infinity to 0.0) — Predicted log-probability of the input DNA sequence
Example response:
Encode¶
Generate embeddings for input DNA sequences
- POST /api/v3/dnabert2/encode/¶
Encode endpoint for DNABERT-2.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
batch_size (int, default: 10) — Maximum number of sequences per request (max: 10)
max_sequence_len (int, default: 2048) — Maximum allowed length of DNA sequence (max: 2048)
items (array of objects, min: 1, max: 10) — DNA sequences to encode:
sequence (string, required, min length: 1, max length: 2048) — DNA sequence containing only unambiguous nucleotides (A, T, C, G)
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embedding (array of floats, size: 768) — DNABERT-2 embedding vector representing the input DNA sequence
Example response:
Performance¶
DNABERT-2 inference is GPU-accelerated on NVIDIA T4 GPUs, optimized specifically for efficient handling of DNA sequence embeddings and predictive tasks.
DNABERT-2 significantly outperforms the original DNABERT across multiple genomic classification benchmarks, achieving an average absolute score improvement of 6 points on the Genome Understanding Evaluation (GUE) benchmark.
Compared to DNABERT, DNABERT-2 delivers approximately 3× faster inference due to its adoption of Byte Pair Encoding (BPE) tokenization, Attention with Linear Biases (ALiBi), and FlashAttention optimizations.
DNABERT-2 achieves comparable predictive accuracy to the Nucleotide Transformer (NT-2500M-multi), but with approximately 21× fewer parameters and 56× lower GPU-time requirements during pre-training, making it significantly more efficient for rapid inference and fine-tuning.
The model architecture leverages FlashAttention, an IO-aware attention mechanism, reducing GPU memory usage and computational overhead, resulting in faster inference times compared to DNABERT and Nucleotide Transformer variants.
DNABERT-2’s use of ALiBi positional encoding allows it to effectively handle longer genomic sequences without performance degradation, unlike DNABERT which is limited by learned positional embeddings.
DNABERT-2’s optimized tokenization reduces sequence length by approximately 5× compared to overlapping k-mer tokenization used by DNABERT, significantly improving computational efficiency and throughput.
BioLM’s deployment of DNABERT-2 incorporates low-precision layer normalization and memory-efficient inference strategies, further enhancing inference speed and throughput on GPU hardware.
Applications¶
Identification and classification of transcription factor binding sites in genomic sequences, enabling researchers to accurately map regulatory regions and understand gene expression patterns; valuable for companies developing targeted gene therapies or synthetic biology constructs, though performance may decrease with very short sequences (<100 bp).
Rapid detection and differentiation of viral genome variants, such as SARS-CoV-2 strains, allowing biotech firms to track viral evolution, inform vaccine updates, and monitor emerging infectious threats; however, accuracy may be limited if input sequences contain ambiguous nucleotides or sequencing errors.
Precise prediction of promoter and core promoter regions in human genomes, facilitating the design of synthetic gene circuits and expression vectors by accurately locating transcription initiation sites; particularly useful in optimizing gene expression for biomanufacturing applications, although less optimal for organisms with significantly divergent promoter architectures.
High-throughput identification of splice donor and acceptor sites in human DNA sequences, enabling accurate annotation of alternative splicing events; essential for biotech companies developing RNA-based therapeutics or gene-editing technologies, but performance may degrade when analyzing non-canonical or highly mutated splice sites.
Genome-wide prediction of epigenetic modifications, such as histone marks, in yeast and related organisms, assisting synthetic biology companies in engineering yeast strains with desired gene expression profiles for bioproduction; however, predictions may not directly translate to more complex eukaryotic systems.
Limitations¶
Maximum Sequence Length: Input sequences are limited to
2048
nucleotides. Longer genomic sequences must be split into smaller segments before submission.Batch Size: The maximum number of sequences per request is
10
. For larger datasets, split requests into multiple batches.DNABERT-2 uses Byte Pair Encoding (BPE) tokenization, which can lose subtle positional information critical for tasks involving very short sequences (e.g., core promoter detection). For short sequences (<100 bp), consider alternative models or tokenization methods.
The model was pre-trained on multi-species genome data and may not perform optimally for highly specialized tasks requiring fine-grained species-specific genomic context.
DNABERT-2 is designed primarily for DNA sequence embedding and masked language modeling. It is not optimal for tasks requiring generative sequence design or causal prediction.
GPU Type: Inference runs on a
T4
GPU, suitable for typical embedding and prediction tasks. For large-scale inference or ultra-low latency requirements, consider models optimized for higher-performance GPUs.
How We Use It¶
DNABERT-2 enables efficient and accurate modeling of DNA sequence data within BioLM’s protein engineering workflows, providing sequence embeddings and predictive insights that accelerate the identification and prioritization of genomic features relevant to enzyme and antibody design. The algorithm integrates seamlessly with BioLM’s generative models and downstream analysis tools, allowing researchers to rapidly assess biological sequences and inform experimental decisions.
Accelerates identification of key genomic regions influencing protein expression and function
Integrates smoothly with BioLM predictive modeling services to streamline multi-round protein optimization
References¶
Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R., & Liu, H. (2023). DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. bioRxiv.