Antibody Engineering¶

Comprehensive antibody engineering workflow using BioLM.ai models.

Antibodies are highly specific proteins with incredible binding prowess, important immune mediated interactions and high therapuetic potential. Benefiting from recent machine learning success in biology, and specifically protein design , computational antibody engineering has become extremely accesible. In the following guides, motivated by the binder design challenges proposed with the BenchBB dataset and the expediency of the BioLM SDK and API, several ml methods for antibody engineering are demonstrated.

Previously¶

Using a structure and antigen conditioned candidate proposal method, 100 variants were generated for four of the proposed BenchBB targets ('IL7Ra', 'PDL-1', 'EGFR' and 'MBP') containing solved structures of antibodies in complex with the target deposited in the pdb.

While a pre-requisite goal of binder design is molecules that bind, there are additional factors to consider especially with respect to more specific binder goals such as therapuetic potential and manufacturability. The prediction of these properties in-silico for generated variants then becomes another important task, Several of these additional factors can be laeled as intrinsic or extrinsic. The distinction in intrinsic and extrinsic properties can change modeling approaches.

Intrinsic properties are the properties that pertain only to the antibody sequences (and thus structure) themselves or standard interactions (think hydrophobicity as most proteins have to interact with H20 and this shaped the evolutionary trajectory) that are conditioned on. This can include thermostability (which can affect shelf life and transport considerations for therapuetics), solubility and expression (which can affect yeild and manufacturing capability), self aggregation (antibodies ability to bind to itself), viscosity and immunogenicity / humanness (some might argue this belongs in the extrinsic property category).
Extrinsic properties are the properties that pertain to the antibodies interactions with other molecules and environments. Binding falls into this category as does other PPI base properties including specificity and cross reactivity / promiscuity.

Protein language models that have been trained on large corpus of protein sequences can be reasonably expected to capture some intrinsic properties as a base level of thermostability and solubility, etc must have been selected for during evolution in order for these proteins to be functional. These models can score sequences using likelihoods which can be thought of as a ranking proxy for how "natural" the scored sequence is with respect to the models' training data (in the case of PLMs this is natural proteins). A few limitations of this scoring appproach is that without further interpretive tools / analysis, individual intrinsic properties are difficult to decouple from the rankings. Furthermore de-novo and very disimilar proteins might be scored poorly the further out of distribution they are, and again without further analysis it can be difficult to determine why (is it a nonfunctional protein or bypassed evolutionary trajectory).

Below several PLM scoring methods are demonstrated

Import Required Libraries¶

from IPython.display import JSON
import requests
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

from pathlib import Path
from sklearn.decomposition import PCA
from Bio import PDB
from Bio.PDB import *

from biolmai import BioLM
from helpers.styling import apply_plot_styling
apply_plot_styling()

Sequence Based Scoring of Candidates¶

Variants are scored using ESM-C (a high performing general protein language model by Evolutionary Scale) and IgBERT, an antibody specific paired language model that consideres both antibody chains together. ESM-C scores should reflect "general" protein "naturalness" and IgBERT paired antibody "naturalness" potentially capturing or biasing towards different intrinsic properties.

First generated variants are loaded and then scored using the BioLM SDK.

samples_df = pd.read_csv("data/protein/benchbb_antibody_targets/variant_candidates.csv")

def batch_and_score_esmc(
    df,
    column='sequence'
):
    """
    Sends sequences in batches to a model API, collects logits, and computes log-probabilities.

    Assumes:
        - Response includes:
            - logits: list[list[float]] of shape (L+1, 20)
            - vocab_tokens: list[str] of canonical amino acids (length 20)
    
    Returns:
        List of dicts: {'sequence': str, 'log_prob_sum': float}
    """
    results=[]
    sequences = df[column].tolist()

    start = time.time()
    result = BioLM(entity="esmc-300m", action="encode",type="sequence", items=sequences, params={"include": ['logits']})
    end = time.time()
    print(f"ESMC scoring took {end - start:.4f} seconds.")

    for seq, item in zip(sequences, result):
        logits = np.array(item["logits"])  # shape (L+1, 20)
        vocab_tokens = item["vocab_tokens"]

        log_prob_sum = 0.0
        for pos, aa in enumerate(seq):
            try:
                aa_idx = vocab_tokens.index(aa)
            except ValueError:
                log_prob_sum += float('-inf')
                continue

            logits_slice = logits[pos]
            # numerically stable log_softmax
            max_logit = np.max(logits_slice)
            exps = np.exp(logits_slice - max_logit)
            log_probs = logits_slice - max_logit - np.log(np.sum(exps))

            log_prob_sum += log_probs[aa_idx]

        results.append({
            "sequence": seq,
            "log_prob_sum": float(log_prob_sum)
        })

    return results

esmc_heavy = pd.DataFrame(batch_and_score_esmc(samples_df, "heavy")).rename(columns={"log_prob_sum": "esmc_lp_heavy"})
esmc_light = pd.DataFrame(batch_and_score_esmc(samples_df, "light")).rename(columns={"log_prob_sum": "esmc_lp_light"})
samples_df = samples_df.merge(esmc_heavy, how="left", left_on="heavy", right_on="sequence")
samples_df = samples_df.merge(esmc_light, how="left", left_on="light", right_on="sequence")

ESMC scoring took 529.7984 seconds.
ESMC scoring took 680.2055 seconds.

def batch_and_score_igbert(
    df,
    heavy_col='heavy',
    light_col='light'
):
    """
    Sends heavy/light chain pairs in batches to an API, and collects log-probs from response.

    Assumes:
        - Response includes:
            - results: list of dicts
                - each has 'log_prob' key with a float
    Returns:
        List of dicts: {'heavy': str, 'light': str, 'log_prob': float}
    """
    results = []
    heavy_list = df[heavy_col].tolist()
    light_list = df[light_col].tolist()

    data = [{"heavy": h, "light": l} for h, l in zip(heavy_list, light_list)]

    start = time.time()
    result = BioLM(entity="igbert-paired", action="predict", items=data)
    end = time.time()
    
    print(f"IgBERT scoring took {end - start:.4f} seconds.")

    for (h, l), item in zip(zip(heavy_list, light_list), result):
        log_prob = item.get("log_prob", float('-inf'))
        results.append({
            "heavy": h,
            "light": l,
            "log_prob": log_prob
        })

    return results

igbert_paired = pd.DataFrame(batch_and_score_igbert(samples_df)).rename(columns={"log_prob": "igbert_lp"})
samples_df = samples_df.merge(igbert_paired, how="left", on=["heavy","light"])

IgBERT scoring took 847.0409 seconds.

With the lp scores computed, the separate scores for the heavy and light chain by ESMC are added and then all the scores are normalized by the token length of the sequences for comparison across targets as the length of the tokens affects the total number of terms in the sum.

normalized_samples_df = samples_df
normalized_samples_df["esmc_lp_combined"] = normalized_samples_df["esmc_lp_light"] + normalized_samples_df["esmc_lp_heavy"]
normalized_samples_df["heavy_len"] = normalized_samples_df["heavy"].str.len()
normalized_samples_df["light_len"] = normalized_samples_df["light"].str.len()
normalized_samples_df["esmc_lp_combined_normalized"] = normalized_samples_df["esmc_lp_combined"] /  (normalized_samples_df["heavy_len"] +  normalized_samples_df["light_len"])
normalized_samples_df["esmc_lp_heavy_normalized"] = normalized_samples_df["esmc_lp_heavy"] /  normalized_samples_df["heavy_len"]
normalized_samples_df["esmc_lp_light_normalized"] = normalized_samples_df["esmc_lp_light"] /  normalized_samples_df["light_len"]
normalized_samples_df["igbert_lp_normalized"] = normalized_samples_df["igbert_lp"] /  (normalized_samples_df["heavy_len"] +  normalized_samples_df["light_len"])

Finaly visualizing theses normalized scores against the AntiFold scores. The AntiFold score and IgBert lp look to follow a more consistent trend across targets as might be expected of two AntiBody specific models while the ESM-C combined lps have much less overlap between targets and don't show an obvious ranking corresponding to either of the antibody specific scores.

score_features = ['score', "esmc_lp_combined_normalized", "esmc_lp_light_normalized","esmc_lp_heavy_normalized", "igbert_lp_normalized"]
g = sns.pairplot(normalized_samples_df[score_features + ['target']], hue='target', palette='Paired')

<span></span>

All Guides

Antibody Engineering¶

Previously¶

Import Required Libraries¶

Sequence Based Scoring of Candidates¶

Accelerate yourLead generation