nanoBERT is a GPU-accelerated transformer model tailored for predicting feasible amino acid substitutions within nanobody (VHH) sequences. Trained on 10 million camelid nanobody sequences from the Integrated Nanobody Database for Immunoinformatics (INDI), nanoBERT generates context-aware positional mutation predictions without relying on germline annotations. The API supports masked residue prediction (infilling), enabling computational nanobody design, humanization, and thermostability optimization.

Predict

Predict log probabilities for input sequences

python
from biolmai import BioLM
response = BioLM(
    entity="nanobert",
    action="predict",
    params={},
    items=[
      {
        "sequence": "EVQLVESGGG"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/nanobert/predict/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "EVQLVESGGG"
    }
  ]
}'
python
import requests

url = "https://biolm.ai/api/v3/nanobert/predict/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "EVQLVESGGG"
        }
      ]
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/nanobert/predict/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "EVQLVESGGG"
    )
  )
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/nanobert/predict/

Predict endpoint for nanoBERT.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • include (array of strings, default: [“mean”]) — Output types to include in response; allowed values: “mean”, “residue”, “logits”

  • items (array of objects, min: 1, max: 32) — Input sequences:

    • sequence (string, required, min length: 1, max length: 154) — Amino acid sequence using standard unambiguous amino acid codes; for generate requests, “*” placeholders allowed (at least one “*” required)

Example request:

http
POST /api/v3/nanobert/predict/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "EVQLVESGGG"
    }
  ]
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • embeddings (array of floats, size: 320, optional) — Mean embedding vector for the input sequence from nanoBERT_small model (embedding dimension: 320)

    • residue_embeddings (array of arrays of floats, shape: [sequence_length, 320], optional) — Per-residue embedding vectors for each amino acid in the input sequence (embedding dimension: 320)

    • logits (array of arrays of floats, shape: [sequence_length, 21], optional) — Raw prediction logits per residue position for each of the 20 standard amino acids plus one special token (21 total classes, unnormalized scores, no fixed range)

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "log_prob": -0.002538938308134675
    }
  ]
}

Encode

Generate embeddings for input sequences

python
from biolmai import BioLM
response = BioLM(
    entity="nanobert",
    action="encode",
    params={},
    items=[
      {
        "sequence": "EVQLVESGGG"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/nanobert/encode/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "EVQLVESGGG"
    }
  ]
}'
python
import requests

url = "https://biolm.ai/api/v3/nanobert/encode/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "EVQLVESGGG"
        }
      ]
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/nanobert/encode/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "EVQLVESGGG"
    )
  )
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/nanobert/encode/

Encode endpoint for nanoBERT.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • include (array of strings, default: [“mean”]) — Output types to include:

      • Allowed values:

        • “mean” — Mean embedding

        • “residue” — Per-residue embeddings

        • “logits” — Logits

  • items (array of objects, min: 1, max: 32) — Input sequences:

    • sequence (string, required, min length: 1, max length: 154) — Amino acid sequence using standard unambiguous amino acid codes

Example request:

http
POST /api/v3/nanobert/encode/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "EVQLVESGGG"
    }
  ]
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • embeddings (array of floats, size: 320, optional) — Mean embedding vector (size: 320) representing the entire nanobody sequence

    • residue_embeddings (array of arrays of floats, shape: [sequence_length, 320], optional) — Per-residue embedding vectors; each residue represented by a 320-dimensional vector

    • logits (array of arrays of floats, shape: [sequence_length, 21], optional) — Raw prediction logits for each residue position; 21 values per residue corresponding to amino acid probabilities (20 standard amino acids + 1 special token)

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "embeddings": [
        -0.29287976026535034,
        -0.00899506825953722,
        "... (truncated for documentation)"
      ]
    }
  ]
}

Generate

Generate new sequences based on prompts with placeholder symbols

python
from biolmai import BioLM
response = BioLM(
    entity="nanobert",
    action="generate",
    params={},
    items=[
      {
        "sequence": "EVQLVESGGG"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/nanobert/generate/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "EVQLVESGGG"
    }
  ]
}'
python
import requests

url = "https://biolm.ai/api/v3/nanobert/generate/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "EVQLVESGGG"
        }
      ]
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/nanobert/generate/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "EVQLVESGGG"
    )
  )
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/nanobert/generate/

Generate endpoint for nanoBERT.

Request Headers:

Request

  • params (object, optional) — Configuration parameters:

    • include (array of strings, default: [“mean”]) — Output types to include:

      • Allowed values:

        • “mean” — Mean embedding

        • “residue” — Per-residue embeddings

        • “logits” — Logits

  • items (array of objects, min: 1, max: 32) — Input sequences:

    • sequence (string, required, min length: 1, max length: 154) — Amino acid sequence with ‘*’ placeholders indicating positions to infill; allowed characters are standard unambiguous amino acids and ‘*’

Example request:

http
POST /api/v3/nanobert/generate/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "EVQLVESGGG"
    }
  ]
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • sequence (string, length: 1-154 residues) — Nanobody amino acid sequence with masked positions (‘*’) infilled by nanoBERT predictions

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "sequence": "EVQLVESGGG"
    }
  ]
}

Performance

  • nanoBERT inference is GPU-accelerated using NVIDIA A100 GPUs, enabling rapid sequence embedding and mutation prediction.

  • nanoBERT is specifically optimized for nanobody sequences, achieving substantially higher predictive accuracy for nanobody positional mutations compared to general-purpose protein language models such as ESM-2 (76.9% vs. 57.4% mean V-region accuracy).

  • Compared to human antibody-specific models (e.g., AbLangHeavy), nanoBERT demonstrates significantly improved accuracy in predicting nanobody-specific residues, especially within highly variable CDR regions (44.9% vs. 30.3% accuracy).

  • nanoBERT_small (14M parameters) provides comparable accuracy to nanoBERT_big (86M parameters), offering a more computationally efficient option with negligible loss in predictive performance.

  • nanoBERT embeddings and logits outputs are returned as floating-point vectors:

    • Embeddings: Mean-pooled sequence embeddings, shape [embedding_dim] (320-dimensional vectors for nanoBERT_small).

    • Residue embeddings: Per-residue embeddings, shape [sequence_length, embedding_dim].

    • Logits: Per-residue logits for amino acid predictions, shape [sequence_length, 20].

  • nanoBERT’s predictive accuracy on therapeutic nanobody sequences (77.4% mean V-region accuracy) outperforms human antibody models (73.9%) and general protein language models (62.5%), highlighting its suitability for therapeutic nanobody design and optimization tasks.

  • nanoBERT fine-tuning on downstream tasks (e.g., thermostability prediction) consistently outperforms randomly initialized models and human antibody-specific models, achieving Pearson correlations of approximately 0.5 vs. 0.39 on Circular Dichroism datasets, demonstrating its advantage for nanobody-specific predictive tasks.

Applications

  • Predicting feasible amino acid substitutions in therapeutic nanobodies, enabling researchers to efficiently explore mutational space and identify variants with improved binding affinity or specificity; particularly useful for antibody engineering teams developing novel nanobody-based therapeutics, although it is not optimal for canonical antibody sequences due to its nanobody-specific training.

  • Computational humanization of llama-derived nanobodies by suggesting mutations that balance human-like sequence characteristics with retention of nanobody hallmark residues; valuable for biotech companies aiming to reduce immunogenicity risk of therapeutic nanobodies while preserving their stability and binding properties, though experimental validation remains essential.

  • Fine-tuning nanoBERT on experimentally measured nanobody thermostability datasets (e.g., NBThermo) to predict stability-enhancing mutations, enabling protein engineering teams to prioritize candidate sequences for expression and characterization; especially beneficial when experimental screening capacity is limited, but less effective if training data is sparse or poorly annotated.

  • Generating nanobody-specific sequence embeddings for downstream machine learning tasks such as predicting aggregation propensity or solubility, allowing bioinformatics teams to rapidly screen large sequence libraries and filter out problematic candidates early in the discovery pipeline; however, embeddings may not generalize well to non-nanobody antibody formats.

  • Assessing sequence nativeness for quality control of synthetic nanobody libraries, identifying sequences that deviate significantly from natural camelid repertoires; useful for companies synthesizing large-scale nanobody libraries to ensure diversity and biological relevance, though not suitable for evaluating canonical antibody or non-camelid-derived libraries.

Limitations

  • Maximum Sequence Length: Input sequences are limited to 154 amino acids. Longer sequences must be truncated or split into smaller segments before submission.

  • Batch Size: Requests are limited to a maximum of 32 sequences per API call. For larger datasets, multiple API calls must be made.

  • nanoBERT is specifically trained on camelid-derived nanobody sequences. It may not perform optimally on non-camelid antibodies, general protein sequences, or sequences significantly diverging from natural nanobody repertoires.

  • nanoBERT is designed for sequence infilling and predicting positional substitutions within nanobodies. It does not directly predict 3D structures, binding affinities, or biophysical properties without additional fine-tuning on relevant experimental datasets.

  • The model does not rely on germline gene assignments; thus, it may not capture subtle evolutionary or germline-specific constraints relevant to certain engineering tasks, such as precise humanization or germline-based affinity maturation.

  • nanoBERT embeddings (mean, residue) and logits (logits) are available as output options. However, embeddings are not optimized for visualization or clustering tasks, and downstream dimensionality reduction or clustering methods may be required for effective analysis.

How We Use It

nanoBERT enables rapid exploration and optimization of nanobody sequences by predicting biologically feasible amino acid substitutions, significantly accelerating nanobody engineering workflows. Integrated into BioLM’s standardized inference pipelines, nanoBERT facilitates efficient humanization, thermostability enhancement, and liability reduction tasks, ultimately streamlining lead optimization and experimental validation.

  • Integrates seamlessly with predictive models, sequence embedding tools, and structure-informed metrics to prioritize designs.

  • Accelerates iterative design-test cycles, reducing experimental overhead and shortening time-to-market for therapeutic nanobodies.

References