IgT5 Unpaired is a GPU-accelerated antibody language model trained specifically for the analysis and design of immunoglobulin variable region sequences. Derived from the ProtT5 architecture and pre-trained on approximately 700 million unpaired antibody sequences from the Observed Antibody Space (OAS), IgT5 Unpaired supports accurate embedding generation, residue masking and recovery, and representation extraction from single-chain sequences. The API service enables applications in antibody discovery, affinity maturation, and repertoire analysis workflows.

Encode

Generate embeddings for input sequences

python
from biolmai import BioLM
response = BioLM(
    entity="igt5-unpaired",
    action="encode",
    params={
      "include": [
        "mean",
        "residue"
      ]
    },
    items=[
      {
        "sequence": "EVQLVESGGGLVQPGGSLR"
      },
      {
        "sequence": "DIVMTQSPDSLAVSLGERG"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/igt5-unpaired/encode/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "params": {
    "include": [
      "mean",
      "residue"
    ]
  },
  "items": [
    {
      "sequence": "EVQLVESGGGLVQPGGSLR"
    },
    {
      "sequence": "DIVMTQSPDSLAVSLGERG"
    }
  ]
}'
python
import requests

url = "https://biolm.ai/api/v3/igt5-unpaired/encode/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "params": {
        "include": [
          "mean",
          "residue"
        ]
      },
      "items": [
        {
          "sequence": "EVQLVESGGGLVQPGGSLR"
        },
        {
          "sequence": "DIVMTQSPDSLAVSLGERG"
        }
      ]
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/igt5-unpaired/encode/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  params = list(
    include = list(
      "mean",
      "residue"
    )
  ),
  items = list(
    list(
      sequence = "EVQLVESGGGLVQPGGSLR"
    ),
    list(
      sequence = "DIVMTQSPDSLAVSLGERG"
    )
  )
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/igt5-unpaired/encode/

Encode endpoint for IgT5 Unpaired.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • include (array of strings, default: [“mean”]) — Embedding types to include in response; allowed values: “mean”, “residue”

  • items (array of objects, min: 1, max: 8) — Input antibody sequences:

    • heavy (string, optional, min length: 1, max length: 256) — Heavy chain amino acid sequence using extended amino acid codes

    • light (string, optional, min length: 1, max length: 256) — Light chain amino acid sequence using extended amino acid codes

    • sequence (string, optional, min length: 1, max length: 512) — Unpaired amino acid sequence using extended amino acid codes

Example request:

http
POST /api/v3/igt5-unpaired/encode/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "params": {
    "include": [
      "mean",
      "residue"
    ]
  },
  "items": [
    {
      "sequence": "EVQLVESGGGLVQPGGSLR"
    },
    {
      "sequence": "DIVMTQSPDSLAVSLGERG"
    }
  ]
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • embeddings (array of floats, size: 1024) — Mean embedding vector representing the entire antibody sequence; numerical values unbounded.

    • residue_embeddings (array of arrays of floats, shape: [sequence_length, 1024]) — Embedding vectors for each residue position in the antibody sequence; numerical values unbounded.

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "embeddings": [
        0.010778225027024746,
        -0.04208502173423767,
        "... (truncated for documentation)"
      ],
      "residue_embeddings": [
        [
          -0.10589797794818878,
          -0.1447126418352127,
          "... (truncated for documentation)"
        ],
        [
          0.10450524091720581,
          -0.02376823127269745,
          "... (truncated for documentation)"
        ],
        "... (truncated for documentation)"
      ]
    },
    {
      "embeddings": [
        0.01170163881033659,
        -0.0017411700682714581,
        "... (truncated for documentation)"
      ],
      "residue_embeddings": [
        [
          -0.12166369706392288,
          -0.04526447504758835,
          "... (truncated for documentation)"
        ],
        [
          0.08949843794107437,
          0.11369754374027252,
          "... (truncated for documentation)"
        ],
        "... (truncated for documentation)"
      ]
    }
  ]
}

Performance

  • IgT5 Unpaired API supports a batch size of up to 8 sequences per request, with a maximum sequence length of 512 amino acids for unpaired sequences.

  • Model inference is GPU-accelerated on NVIDIA T4 GPUs, providing optimized performance and scalability for high-throughput antibody sequence embedding tasks.

  • Typical inference completion time is approximately 2-4 seconds per batch of 8 sequences, enabling rapid processing of large-scale sequence datasets.

  • IgT5 Unpaired significantly outperforms general protein language models, such as ProtT5 and ProtBert, in antibody-specific sequence recovery tasks, particularly in the hypervariable CDR loops (e.g., CDR-H3 recovery accuracy of 60.3% vs. ProtT5’s 33.9%).

  • Compared to other antibody-specific models provided by BioLM, IgT5 Unpaired achieves comparable or slightly better overall sequence recovery accuracy (91.2% total heavy chain accuracy) than IgBert Unpaired (91.0%), and notably higher accuracy than AntiBERTy (90.0%).

  • IgT5 Unpaired embeddings provide superior predictive performance for antibody-antigen binding affinity tasks compared to general protein models (e.g., ProtT5, ProtBert), with an R² of 0.299 vs. ProtT5’s 0.290 on representative benchmark datasets.

  • While IgT5 Unpaired excels in antibody-specific predictive tasks, general protein language models (ProtT5, ProtBert) may yield higher accuracy for broader protein property predictions, such as expression level estimation (ProtT5 R² of 0.697 vs. IgT5 Unpaired’s 0.567).

  • BioLM’s deployment of IgT5 Unpaired includes optimizations such as distributed inference and memory-efficient model serving, ensuring stable and reliable API performance at scale.

Applications

  • Antibody affinity maturation and optimization by predicting beneficial mutations in complementarity-determining regions (CDRs), enabling researchers to rapidly identify variants with improved antigen binding; valuable for accelerating therapeutic antibody development pipelines but less effective for predicting broader antibody expression or stability properties.

  • Identification of antibody paratope residues through sequence embeddings, allowing efficient prioritization of residues critical for antigen binding; useful for rational antibody engineering tasks, such as humanization or specificity tuning, although not optimal for comprehensive structural modeling without additional structure-based methods.

  • High-throughput antibody sequence filtering and ranking based on predicted antigen binding affinity, enabling rapid screening of large antibody libraries; particularly valuable for companies performing antibody discovery campaigns, though limited in predicting developability attributes like aggregation propensity or solubility.

  • Generation of antibody sequence embeddings for clustering and diversity analysis, facilitating informed selection of diverse antibody candidates; beneficial for maintaining antibody library diversity during selection processes, but embeddings alone are insufficient for detailed functional characterization without experimental validation.

  • Prediction of heavy-light chain pairing compatibility through learned cross-chain sequence representations, enabling more accurate pairing predictions in antibody repertoire sequencing data; valuable for single-cell sequencing workflows aimed at reconstructing native antibody pairs, although performance may degrade when applied to repertoires significantly divergent from human antibody sequences.

Limitations

  • Maximum Sequence Length: For paired inputs (heavy and light chains), each chain must be ≤ 256 amino acids. For unpaired inputs (sequence), the maximum length is 512 amino acids.

  • Batch Size: API requests are limited to a maximum of 8 sequences per batch. Larger batch processing must be split into multiple requests.

  • IgT5 embeddings are optimized specifically for antibody sequences; general protein property predictions (e.g., expression levels) may yield lower accuracy compared to general protein language models like ProtT5.

  • The model is trained primarily on human antibody sequences from the Observed Antibody Space (OAS) dataset; performance may degrade for sequences significantly different from typical human antibodies (e.g., antibodies from other species or highly engineered sequences).

  • IgT5 is trained to generate embeddings suitable for downstream predictive tasks, but it is not directly optimized for generative tasks (such as de novo antibody sequence generation) or antibody-antigen structure prediction.

  • The model’s embeddings incorporate cross-chain features when provided with paired heavy and light chains; providing only unpaired sequences (sequence) will not leverage these cross-chain interactions, potentially reducing predictive accuracy for tasks sensitive to chain pairing.

How We Use It

The IgT5 Unpaired model seamlessly integrates into BioLM’s comprehensive workflows, enhancing the precision and efficiency of protein design and antibody engineering. By leveraging extensive training on unpaired antibody sequences, IgT5 Unpaired excels in predicting binding affinities and sequence recovery, which accelerates research timelines and improves therapeutic outcomes. This model complements other BioLM services, including enzyme design and antibody maturation, by providing robust sequence embeddings that inform design choices. The integration of IgT5 Unpaired with standardized APIs facilitates scalable and efficient deployment across diverse scientific applications, ultimately enabling researchers to achieve more accurate and rapid insights in molecular engineering.

  • IgT5 Unpaired enhances workflows by providing detailed sequence embeddings for improved design and prediction accuracy.

  • The model’s integration with BioLM’s suite of tools supports comprehensive protein engineering tasks, from initial design to optimization.

References