BioLMTox-2 is a GPU-accelerated toxin classification model fine-tuned from ESM-2, offering rapid (<1s per sequence) binary predictions (toxin/not-toxin) directly from amino acid sequences without requiring alignment-based preprocessing. Trained on a unified dataset spanning multiple domains of life, BioLMTox-2 achieves validation accuracy of 0.964 and recall of 0.984, applicable across diverse sequence lengths (6–15,639 residues). Typical use cases include biosecurity screening, therapeutic peptide design, and toxin mechanism research.

Predict

Predict toxin vs. not-toxin for input sequences

python
from biolmai import BioLM
response = BioLM(
    entity="biolmtox2",
    action="predict",
    params={},
    items=[
      {
        "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
      },
      {
        "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/biolmtox2/predict/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
    },
    {
      "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
    }
  ],
  "params": {}
}'
python
import requests

url = "https://biolm.ai/api/v3/biolmtox2/predict/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
        },
        {
          "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
        }
      ],
      "params": {}
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/biolmtox2/predict/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "MKTAYIAKQRQISFVKSHFSRQDILD"
    ),
    list(
      sequence = "GQSYFQPTNGVGYQPTNGVGYQPTN"
    )
  ),
  params = list()
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/biolmtox2/predict/

Predict endpoint for BioLMTox-2.

Request Headers:

Request

  • items (array of objects, min: 1, max: 16) — Input protein sequences:

    • sequence (string, min length: 1, max length: 2048, required) — Protein sequence containing only standard unambiguous amino acid codes

Example request:

http
POST /api/v3/biolmtox2/predict/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
    },
    {
      "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
    }
  ],
  "params": {}
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • label (string, enum: “toxin”, “not-toxin”) — Predicted classification label

    • score (float, range: 0.0–1.0) — Prediction confidence score (probability) for the assigned label

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "label": "not-toxin",
      "score": 0.9999914169311523
    },
    {
      "label": "not-toxin",
      "score": 0.9999960660934448
    }
  ]
}

Encode

Generate embeddings for input sequences

python
from biolmai import BioLM
response = BioLM(
    entity="biolmtox2",
    action="encode",
    params={},
    items=[
      {
        "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
      },
      {
        "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
      }
    ]
)
print(response)
bash
curl -X POST https://biolm.ai/api/v3/biolmtox2/encode/ \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "items": [
    {
      "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
    },
    {
      "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
    }
  ],
  "params": {}
}'
python
import requests

url = "https://biolm.ai/api/v3/biolmtox2/encode/"
headers = {
    "Authorization": "Token YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
      "items": [
        {
          "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
        },
        {
          "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
        }
      ],
      "params": {}
    }

response = requests.post(url, headers=headers, json=payload)
print(response.json())
r
library(httr)

url <- "https://biolm.ai/api/v3/biolmtox2/encode/"
headers <- c("Authorization" = "Token YOUR_API_KEY", "Content-Type" = "application/json")
body <- list(
  items = list(
    list(
      sequence = "MKTAYIAKQRQISFVKSHFSRQDILD"
    ),
    list(
      sequence = "GQSYFQPTNGVGYQPTNGVGYQPTN"
    )
  ),
  params = list()
)

res <- POST(url, add_headers(.headers = headers), body = body, encode = "json")
print(content(res))
POST /api/v3/biolmtox2/encode/

Encode endpoint for BioLMTox-2.

Request Headers:

Request

  • items (array of objects, min: 1, max: 16) — Input sequences:

    • sequence (string, min length: 1, max length: 2048, required) — Protein sequence using standard unambiguous amino acid codes

Example request:

http
POST /api/v3/biolmtox2/encode/ HTTP/1.1
Host: biolm.ai
Authorization: Token YOUR_API_KEY
Content-Type: application/json

      {
  "items": [
    {
      "sequence": "MKTAYIAKQRQISFVKSHFSRQDILD"
    },
    {
      "sequence": "GQSYFQPTNGVGYQPTNGVGYQPTN"
    }
  ],
  "params": {}
}
Status Codes:

Response

  • results (array of objects) — One result per input item, in the order requested:

    • mean_representation (array of floats, size: 1280) — Mean embedding vector for the input protein sequence generated by the BioLMTox2 model

Example response:

http
HTTP/1.1 200 OK
Content-Type: application/json

      {
  "results": [
    {
      "mean_representation": [
        -0.03311070799827576,
        0.09799981117248535,
        "... (truncated for documentation)"
      ]
    },
    {
      "mean_representation": [
        -0.11304271966218948,
        0.0008354551391676068,
        "... (truncated for documentation)"
      ]
    }
  ]
}

Performance

  • Hardware Specifications: Deployed on NVIDIA T4 GPUs with 16 GB RAM and 4 vCPU cores for GPU-accelerated inference.

  • Inference Speed: BioLMTox-2 achieves sub-second inference times per sequence, significantly faster than traditional alignment-based methods (e.g., BLAST) and comparable to other transformer-based models such as ESM-2 650M.

  • Predictive Accuracy: BioLMTox-2 demonstrates state-of-the-art toxin classification performance, achieving an accuracy of 0.964 and recall of 0.984 on comprehensive validation sets, surpassing or matching contemporary methods including ToxDL, TOXIFY, UniDL4BioPep, and CSM-Toxin.

    • Outperforms TOXIFY by approximately 8.6% in accuracy on animal venom datasets.

    • Matches or slightly exceeds ToxDL and UniDL4BioPep accuracy on peptide and protein toxin datasets.

    • Slightly underperforms ToxinPred2 (-6.4% accuracy difference) on datasets containing diverse toxin types, potentially due to differences in toxin labeling criteria.

  • Model Architecture & Optimization: BioLMTox-2 is a transformer-based classifier fine-tuned from the ESM-2 650M pre-trained protein language model, optimized specifically for toxin classification tasks. It does not require computationally expensive alignment-based feature generation, enabling rapid inference.

  • Computational Efficiency: Leveraging fine-tuned transformer embeddings, BioLMTox-2 achieves high computational efficiency, offering faster inference compared to alignment-dependent methods (e.g., PSI-BLAST-based classifiers like ATSE), and comparable or better speed than other transformer-based classifiers (e.g., CSM-Toxin).

  • Embedding Quality: Fine-tuning significantly improves the separation of toxin and not-toxin embeddings compared to the base ESM-2 model, enhancing predictive reliability and interpretability.

  • Input and Output Types:

    • Input: Accepts protein sequences (1–2048 amino acids) as strings containing only unambiguous amino acid characters.

    • Output: Returns binary classification labels (“toxin” or “not-toxin”) along with corresponding prediction confidence scores.

Applications

  • Rapid computational screening of engineered therapeutic proteins and peptides for toxicity, enabling biotech companies to quickly filter candidate molecules before costly lab-based validations; particularly valuable for high-throughput synthetic biology workflows, though not a replacement for experimental safety testing required by regulatory agencies.

  • Biosecurity risk assessment of novel protein sequences generated by generative AI or directed evolution, allowing synthetic biology companies to proactively identify potentially hazardous sequences prior to synthesis; critical for mitigating dual-use concerns, although not optimal for detecting entirely novel toxin mechanisms without similarity to known toxins.

  • Computational prioritization of peptide-based drug candidates by identifying potential toxicity early in the development cycle, significantly reducing downstream attrition rates and resource expenditure; especially useful for biotech startups with limited budgets, but should be complemented by subsequent in vitro and in vivo assays.

  • Safety evaluation of agricultural biotechnology products, such as genetically engineered crops expressing novel proteins, facilitating regulatory compliance by providing rapid initial toxicity classification; valuable for ag-biotech companies aiming to streamline product development timelines, though regulatory approval still necessitates comprehensive experimental validation.

  • Identification and exclusion of potentially toxic sequences in protein-based biomaterials, such as scaffolds or hydrogels, enabling biomaterials companies to design safer products for medical applications; particularly relevant during early design phases, but not sufficient alone for medical-grade certification without additional experimental evidence.

Limitations

  • Maximum Sequence Length: Input sequences must be between 1 and 2048 amino acids; longer sequences must be truncated prior to submission.

  • Batch Size: Each API request supports between 1 and 16 sequences; larger datasets should be split into multiple requests.

  • BioLMTox-2 is a sequence-only classifier and does not incorporate structural, evolutionary, or physicochemical features; this may limit accuracy for highly novel or structurally unique toxins.

  • The model was primarily trained on eukaryotic and bacterial sequences; performance may decrease when classifying sequences from other domains, such as archaea or viruses.

  • BioLMTox-2 provides binary toxin classification (toxin or not-toxin) without additional information on toxin subtypes (e.g., neurotoxin, cardiotoxin) or toxicity mechanisms.

  • This model does not explicitly evaluate homologous sequences or sequence similarity biases; caution is advised when interpreting predictions for proteins closely related to known non-toxic homologs.

How We Use It

BioLMTox-2 enables rapid, scalable protein toxin classification, directly supporting protein engineering and therapeutic development by efficiently screening candidate sequences for potential toxicity early in the design cycle. Integrated seamlessly into BioLM’s predictive modeling and generative AI workflows, BioLMTox-2’s standardized API facilitates iterative optimization processes, quickly filtering and prioritizing safer, viable candidates for downstream synthesis and experimental validation.

  • Integrates smoothly into multi-round optimization cycles alongside masked language models and structural prediction tools.

  • Accelerates therapeutic and antibody development by significantly reducing experimental screening overhead and improving candidate selection accuracy.

References