IgBert Paired is an antibody-specific language model trained on over two million naturally paired variable region antibody sequences from the Observed Antibody Space (OAS) database. Utilizing a BERT architecture pretrained on unpaired antibody sequences, IgBert Paired captures cross-chain interactions between heavy and light chains, providing improved prediction accuracy for antibody binding affinity, antigen specificity, and sequence recovery tasks. The API supports GPU-accelerated embedding extraction and sequence inference, serving use-cases in antibody engineering, affinity maturation, and therapeutic antibody development workflows.
Predict¶
Predict log probabilities (or other scores) for paired sequences
- POST /api/v3/igbert-paired/predict/¶
Predict endpoint for IgBert Paired.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (array of strings, default: [“mean”]) — Embedding types to return; allowed values: “mean”, “residue”, “logits”
items (array of objects, min: 1, max: 32) — Input antibody sequences:
heavy (string, optional, min length: 1, max length: 256) — Heavy chain amino acid sequence (standard amino acid codes plus allowed extended characters)
light (string, optional, min length: 1, max length: 256) — Light chain amino acid sequence (standard amino acid codes plus allowed extended characters)
sequence (string, optional, min length: 1, max length: 512) — Unpaired amino acid sequence (standard amino acid codes plus allowed extended characters)
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of floats, size: variable) — Mean embeddings of the sequence
residue_embeddings (array of arrays of floats, dimensions: [sequence_length][embedding_size]) — Per-residue embeddings
logits (array of arrays of floats, dimensions: [sequence_length][vocabulary_size]) — Logits for each residue position
Example response:
Encode¶
Generate embeddings (mean, per-residue, etc.) from paired sequences
- POST /api/v3/igbert-paired/encode/¶
Encode endpoint for IgBert Paired.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (array of strings, default: [“mean”]) — Output types to include: “mean”, “residue”, or “logits”
items (array of objects, max: 32) — Input sequences:
heavy (string, min length: 1, max length: 256, optional) — Heavy chain sequence using standard amino acid codes
light (string, min length: 1, max length: 256, optional) — Light chain sequence using standard amino acid codes
sequence (string, min length: 1, max length: 512, optional) — Unpaired sequence using standard amino acid codes
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of floats, size: 1024, optional) — Mean embedding vector representing the input antibody sequence.
residue_embeddings (array of arrays of floats, shape: [sequence_length, 1024], optional) — Embedding vectors for each residue position in the input antibody sequence.
logits (array of arrays of floats, shape: [sequence_length, vocab_size], optional) — Raw prediction logits for each residue position; vocab_size corresponds to the tokenizer vocabulary (standard amino acids plus special tokens).
Example response:
Generate¶
Generate new sequences from paired inputs containing mask placeholders (*)
- POST /api/v3/igbert-paired/generate/¶
Generate endpoint for IgBert Paired.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (array of strings, default: [“mean”]) — Embedding types to return; allowed values: “mean”, “residue”, “logits”
items (array of objects, min: 1, max: 32) — Input sequences for generation:
heavy (string, optional, min length: 1, max length: 256) — Heavy-chain amino acid sequence containing standard amino acids and at least one “*” placeholder
light (string, optional, min length: 1, max length: 256) — Light-chain amino acid sequence containing standard amino acids and at least one “*” placeholder
sequence (string, optional, min length: 1, max length: 512) — Unpaired amino acid sequence containing standard amino acids and at least one “*” placeholder
Exactly one of the following must be provided:
Both heavy and light
sequence alone
“*” placeholder must appear at least once in the provided sequence(s).
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
heavy (string, optional) — Generated heavy chain amino acid sequence; length: 1–256 residues, standard amino acid single-letter codes (20 standard amino acids)
light (string, optional) — Generated light chain amino acid sequence; length: 1–256 residues, standard amino acid single-letter codes (20 standard amino acids)
sequence (string, optional) — Generated unpaired amino acid sequence; length: 1–512 residues, standard amino acid single-letter codes (20 standard amino acids)
Example response:
Performance¶
IgBert Paired inference runs on NVIDIA T4 GPUs, optimized for efficient GPU-accelerated transformer inference.
Typical inference completion time is approximately 2-3 seconds per batch of up to 32 paired antibody sequences.
IgBert Paired achieves superior predictive accuracy on antibody-specific tasks compared to general protein language models (e.g., ProtBert, ProtT5), particularly in complementarity-determining regions (CDRs) critical for antigen binding specificity.
IgBert Paired correctly recovers approximately 60% of residues in the challenging CDRH3 loop, outperforming general protein models (ProtBert: ~28%, ProtT5: ~34%) and antibody-specific models trained on unpaired sequences (IgBert Unpaired: ~59%).
Across all CDR loops, IgBert Paired consistently demonstrates higher sequence recovery accuracy than IgBert Unpaired, highlighting the value of paired heavy-light chain training data in capturing cross-chain interactions.
IgBert Paired embeddings yield higher correlation with experimentally measured antibody-antigen binding affinities (Pearson correlation up to 0.64) compared to IgBert Unpaired (up to 0.55) and general protein models (ProtBert: up to 0.47, ProtT5: up to 0.56).
IgBert Paired achieves significantly lower pseudo-perplexity scores (1.025) on antibody sequences compared to IgBert Unpaired (1.031), AbLang (1.109), AntiBERTy (1.161), and general protein models (ProtBert: 1.326, ProtT5: 1.349), indicating superior modeling of antibody sequence distributions.
BioLM has optimized IgBert Paired deployment through DeepSpeed ZeRO Stage 2 distributed training and inference strategies, enabling efficient scaling and rapid inference performance on hosted GPUs.
Applications¶
Antibody affinity maturation by predicting beneficial mutations in paired heavy and light chain variable regions, enabling researchers to rapidly identify candidate sequences with improved antigen-binding properties; particularly valuable for therapeutic antibody development where experimental screening is costly and time-consuming; not optimal for predicting antibody expression levels or stability, which require additional experimental validation.
Identification of antibody sequences with high naturalness scores (low pseudo-perplexity), enabling researchers and biotech companies to filter out sequences likely to have developability or immunogenicity issues; useful for early-stage antibody discovery pipelines to reduce downstream attrition; limited to sequence-based predictions and should be complemented with structural or experimental assays for comprehensive developability assessment.
Generation of contextualized embeddings for antibody variable region sequences, facilitating clustering and visualization of antibody repertoires; valuable for biotech companies performing repertoire mining to identify rare or novel antibody variants from large-scale next-generation sequencing datasets; embeddings alone do not provide functional annotations, and further experimental characterization is necessary to confirm antigen specificity or functional properties.
Rapid in silico screening of antibody libraries for predicted antigen-binding affinity, enabling biotech companies to prioritize sequences for experimental validation; particularly useful in large-scale antibody discovery campaigns to reduce experimental workload and costs; predictions are most reliable for antibodies similar to sequences present in the training dataset and may be less accurate for highly divergent or synthetic sequences.
Prediction and restoration of missing residues in antibody variable region sequences, enabling biotech companies to reconstruct incomplete sequences obtained from sequencing errors or low-quality data; valuable for improving data quality in antibody repertoire analysis and ensuring accurate downstream analyses; not suitable for predicting large missing regions or entire variable domains, as accuracy decreases significantly with increasing gap size.
Limitations¶
Maximum Sequence Length: The IgBert API supports a maximum sequence length of 256 for paired sequences (heavy and light) and 512 for unpaired sequences (sequence). Ensure your input sequences do not exceed these limits to avoid errors.
Batch Size: The API can process up to 32 items per request. Exceeding this limit will result in a request failure. Plan your requests accordingly to fit within this batch size constraint.
GPU Type: The IgBert models are optimized for the T4 GPU. While this provides efficient processing for many tasks, users with different GPU setups might experience varying performance.
Algorithmic Limitations: IgBert is specifically designed for antibody sequence analysis. It may not perform optimally on non-antibody protein sequences or tasks that require embeddings for broader protein families. Consider using general protein models for such tasks.
Use Case Suitability: IgBert excels in tasks like sequence recovery and affinity prediction within antibody engineering. However, it may not be the best choice for tasks requiring sequence generation or structural predictions outside the antibody domain.
Output Options: Users can specify the desired output format via the include parameter, selecting from mean embeddings, residue embeddings, or logits. Choose the appropriate option based on your analysis needs.
How We Use It¶
The integration of IgBert Paired into BioLM’s platform significantly accelerates research in antibody engineering by enabling precise analysis of paired antibody sequences. This model, trained on extensive datasets of paired heavy and light chains, facilitates improved binding affinity predictions and sequence recovery tasks. It seamlessly integrates with BioLM’s suite of biological AI models, allowing researchers to leverage cross-chain features for enhanced protein design and optimization. IgBert Paired’s scalable API access ensures consistent and efficient incorporation into diverse biological workflows, enabling faster iteration and validation cycles for therapeutic development.
Enhances predictive accuracy for binding affinity and expression tasks through advanced sequence embeddings.
Integrates seamlessly with BioLM’s existing AI models to support comprehensive protein engineering workflows.
References¶
Dreyer, F. A., & Kenlay, H. (2024). Large scale paired antibody language models. Zenodo.
Olsen, T. H., Boyles, F., & Deane, C. M. (2022). Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1), 141–146.
