Learn how to perform an in-silico deep mutational scan with three of the five ESM-1v models. This method is suitable for finding the most likely mutations 1x - 3x distance from a parent sequence. For going beyond 3x mutations, other methods such as inverse-folding or finetuning a generative model are more appropriate and less time-bound.




Predict Stabilizing Variants via Deep Mutational Scan

While ESM2 has similar performance to ESM-1v, the latter is still a great option for performing an in silico deep mutational scan. ESM-1v is a collection of five models, trained on sequences from UniRef, each with different seeds. Ensembling these five models' scores can produce better results than a single ESM-1v model, for such broad topics like identifying stabilizing mutations.

Since these models were only trained on functional molecules, they are capable of assessing whether a new molecule might also be functional, or whether it has a disastrous mutation.

Imagine only learning and speaking English your whole life - could you identify whether a new English word is English? Probably. What about whether a Korean word is Korean? Or whether a French word is French? You might not be able to identify a Korean word as specifically Korean since you were only exposed to English; but you can tell that some foreign words are at least not-English. The idea is similar with this zero-shot model, ESM-1v: how likely is a sequence functional, since the model was only exposed to functional sequences.

Set Your API Token

In order to use the BioLM API, you need to have a token. You can get one from the User API Tokens page.

Paste the API token you generated in the cell below, as the value of the variable BIOLMAI_TOKEN.

In [1]:
BIOLMAI_TOKEN = " "  # YOUR TOKEN GOES HERE!!!

We do this by masking each position and requesting the scores for each residue at that position. For a sequence of even a couple hundred residues, this is thousands of predictions from each of the 5 GPU-based models. BioLM has APIs to get these results at scale much faster than it would take to spin up these models on multiple GPUs alone.

In [2]:
# Install or import `requests` package 
try:
    # Install packages to make API requests in JLite
    import micropip
    await micropip.install('requests')
    await micropip.install('pyodide-http')
    # Patch requests for in-browser support
    import pyodide_http
    pyodide_http.patch_all()
    await micropip.install('pandas')
    await micropip.install('matplotlib')
    await micropip.install('seaborn')
except ModuleNotFoundError:
    pass  # Won't be using micropip outside of JLite

import requests
In [3]:
# Let's use the sequence from the paper
# https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2.full.pdf
WILDTYPE = "ASKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

print("Sequence length: {}".format(len(WILDTYPE)))
Sequence length: 238

We can copy working Python code for making a request to BioLM.ai's ESM-1v API:

In [4]:
def esm1v_pred(seq, model_num=1):
    """Get un-masking predictions from any of the 5 ESM-1v models via REST API.
    
    From API docs: https://api.biolm.ai/#54d3367a-b5d1-4f18-aa44-23ae3914a446
    """
    SLUG = f'esm1v-n{model_num}'
    url = f'https://biolm.ai/api/v2/{SLUG}/predict/'
    
    assert '<mask>' in seq
    assert 5 >= model_num >= 1

    data = {
      "items": [
        {
          "sequence": seq
        }
      ]
    }
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Token {BIOLMAI_TOKEN.strip()}",
    }
    
    # Make the request!
    response = requests.post(
        url=url,
        headers=headers,
        json=data,
    )

    return response

Now test our function on the first residue and first model.

In [5]:
result = esm1v_pred('<mask>' + WILDTYPE[1:], model_num=1)  # First ESM-1v model

pred = result.json()
pred
Out[5]:
{'results': [[{'token': 20,
    'token_str': 'M',
    'score': 0.90923011302948,
    'sequence': 'M S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 4,
    'token_str': 'L',
    'score': 0.00893514882773161,
    'sequence': 'L S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 7,
    'token_str': 'V',
    'score': 0.007254290860146284,
    'sequence': 'V S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 6,
    'token_str': 'G',
    'score': 0.0071580721996724606,
    'sequence': 'G S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 12,
    'token_str': 'I',
    'score': 0.006673584226518869,
    'sequence': 'I S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 8,
    'token_str': 'S',
    'score': 0.00605518976226449,
    'sequence': 'S S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 15,
    'token_str': 'K',
    'score': 0.005998397711664438,
    'sequence': 'K S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 11,
    'token_str': 'T',
    'score': 0.00533581618219614,
    'sequence': 'T S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 5,
    'token_str': 'A',
    'score': 0.00529194762930274,
    'sequence': 'A S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 18,
    'token_str': 'F',
    'score': 0.005088561214506626,
    'sequence': 'F S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 13,
    'token_str': 'D',
    'score': 0.005014363676309586,
    'sequence': 'D S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 9,
    'token_str': 'E',
    'score': 0.004764892160892487,
    'sequence': 'E S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 17,
    'token_str': 'N',
    'score': 0.004458808805793524,
    'sequence': 'N S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 19,
    'token_str': 'Y',
    'score': 0.004456241149455309,
    'sequence': 'Y S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 14,
    'token_str': 'P',
    'score': 0.0035370574332773685,
    'sequence': 'P S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 10,
    'token_str': 'R',
    'score': 0.0032345277722924948,
    'sequence': 'R S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 16,
    'token_str': 'Q',
    'score': 0.0027003914583474398,
    'sequence': 'Q S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 21,
    'token_str': 'H',
    'score': 0.0017103749560192227,
    'sequence': 'H S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 22,
    'token_str': 'W',
    'score': 0.0015011748764663935,
    'sequence': 'W S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'},
   {'token': 23,
    'token_str': 'C',
    'score': 0.0014352151192724705,
    'sequence': 'C S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}]]}
In [6]:
from IPython.display import JSON

JSON(pred)
Out[6]:
<IPython.core.display.JSON object>

We can see that "M", or Methionine is the most likely starting AA. Methionine is a common starting AA amongst many organisms.

Next, We can get predictions from the other ESM-1v models, then ensemble the likelihood scores for best results.

In [7]:
res = esm1v_pred('<mask>' + WILDTYPE[1:], model_num=2)  # Second ESM-1v model

pred = res.json()
JSON(pred)
Out[7]:
<IPython.core.display.JSON object>

You can observe the probabilities for each token (amino acid residue) differing between the two models.

Loading the data into a DF is simple.

In [8]:
import pandas as pd
In [9]:
pd.DataFrame.from_dict(pred['results'][0], orient='columns')
Out[9]:
token token_str score sequence
0 20 M 0.945713 M S K G E E L F T G V V P I L V E L D G D V N ...
1 4 L 0.004731 L S K G E E L F T G V V P I L V E L D G D V N ...
2 6 G 0.004624 G S K G E E L F T G V V P I L V E L D G D V N ...
3 8 S 0.004320 S S K G E E L F T G V V P I L V E L D G D V N ...
4 15 K 0.003869 K S K G E E L F T G V V P I L V E L D G D V N ...
5 7 V 0.003802 V S K G E E L F T G V V P I L V E L D G D V N ...
6 12 I 0.003527 I S K G E E L F T G V V P I L V E L D G D V N ...
7 5 A 0.003489 A S K G E E L F T G V V P I L V E L D G D V N ...
8 9 E 0.003188 E S K G E E L F T G V V P I L V E L D G D V N ...
9 11 T 0.003166 T S K G E E L F T G V V P I L V E L D G D V N ...
10 13 D 0.002908 D S K G E E L F T G V V P I L V E L D G D V N ...
11 18 F 0.002779 F S K G E E L F T G V V P I L V E L D G D V N ...
12 14 P 0.002708 P S K G E E L F T G V V P I L V E L D G D V N ...
13 17 N 0.002593 N S K G E E L F T G V V P I L V E L D G D V N ...
14 10 R 0.002212 R S K G E E L F T G V V P I L V E L D G D V N ...
15 19 Y 0.002082 Y S K G E E L F T G V V P I L V E L D G D V N ...
16 16 Q 0.001598 Q S K G E E L F T G V V P I L V E L D G D V N ...
17 21 H 0.001131 H S K G E E L F T G V V P I L V E L D G D V N ...
18 22 W 0.000831 W S K G E E L F T G V V P I L V E L D G D V N ...
19 23 C 0.000615 C S K G E E L F T G V V P I L V E L D G D V N ...

Let's scan the whole sequence now, using several of the five ESM-1v models. This should be fairly fast sequentially. Let's simply loop through each model and position. For a final workflow, we would want to average predictions from all five models (the ESM-1v all endpoint does this directly), but for the sake of speed, we'll only use three here.

In [10]:
import copy

results = []

def f(idx, seq, mdl):
    if idx % 20 == 0:
        print("Submitting twenty positions from {}/{} with ESM-1v Model {}".format(idx, len(seq), mdl))
    # Mask that position
    seq_masked = list(copy.copy(seq))
    seq_masked[idx] = '<mask>'
    seq_masked = ''.join(seq_masked)
    r = esm1v_pred(seq_masked, model_num=mdl)
    pred = r.json()
    # Request the probabilities of each AA at that mask
    r_df = pd.DataFrame.from_dict(pred['results'][0], orient='columns')
    r_df['model_num'] = mdl
    r_df['mask_pos'] = idx
    
    return r_df

# We'll just query the first 3 models for now, that's enough
for mdl in range(1, 4):
    ops = [f(i, WILDTYPE, mdl) for i in range(len(WILDTYPE))]
    results = ops
Submitting twenty positions from 0/238 with ESM-1v Model 1
Submitting twenty positions from 20/238 with ESM-1v Model 1
Submitting twenty positions from 40/238 with ESM-1v Model 1
Submitting twenty positions from 60/238 with ESM-1v Model 1
Submitting twenty positions from 80/238 with ESM-1v Model 1
Submitting twenty positions from 100/238 with ESM-1v Model 1
Submitting twenty positions from 120/238 with ESM-1v Model 1
Submitting twenty positions from 140/238 with ESM-1v Model 1
Submitting twenty positions from 160/238 with ESM-1v Model 1
Submitting twenty positions from 180/238 with ESM-1v Model 1
Submitting twenty positions from 200/238 with ESM-1v Model 1
Submitting twenty positions from 220/238 with ESM-1v Model 1
Submitting twenty positions from 0/238 with ESM-1v Model 2
Submitting twenty positions from 20/238 with ESM-1v Model 2
Submitting twenty positions from 40/238 with ESM-1v Model 2
Submitting twenty positions from 60/238 with ESM-1v Model 2
Submitting twenty positions from 80/238 with ESM-1v Model 2
Submitting twenty positions from 100/238 with ESM-1v Model 2
Submitting twenty positions from 120/238 with ESM-1v Model 2
Submitting twenty positions from 140/238 with ESM-1v Model 2
Submitting twenty positions from 160/238 with ESM-1v Model 2
Submitting twenty positions from 180/238 with ESM-1v Model 2
Submitting twenty positions from 200/238 with ESM-1v Model 2
Submitting twenty positions from 220/238 with ESM-1v Model 2
Submitting twenty positions from 0/238 with ESM-1v Model 3
Submitting twenty positions from 20/238 with ESM-1v Model 3
Submitting twenty positions from 40/238 with ESM-1v Model 3
Submitting twenty positions from 60/238 with ESM-1v Model 3
Submitting twenty positions from 80/238 with ESM-1v Model 3
Submitting twenty positions from 100/238 with ESM-1v Model 3
Submitting twenty positions from 120/238 with ESM-1v Model 3
Submitting twenty positions from 140/238 with ESM-1v Model 3
Submitting twenty positions from 160/238 with ESM-1v Model 3
Submitting twenty positions from 180/238 with ESM-1v Model 3
Submitting twenty positions from 200/238 with ESM-1v Model 3
Submitting twenty positions from 220/238 with ESM-1v Model 3
In [11]:
results[0]  # Look at the first result DF we made
Out[11]:
token token_str score sequence model_num mask_pos
0 20 M 0.859892 M S K G E E L F T G V V P I L V E L D G D V N ... 3 0
1 4 L 0.014622 L S K G E E L F T G V V P I L V E L D G D V N ... 3 0
2 7 V 0.011390 V S K G E E L F T G V V P I L V E L D G D V N ... 3 0
3 15 K 0.010787 K S K G E E L F T G V V P I L V E L D G D V N ... 3 0
4 6 G 0.010584 G S K G E E L F T G V V P I L V E L D G D V N ... 3 0
5 12 I 0.009044 I S K G E E L F T G V V P I L V E L D G D V N ... 3 0
6 8 S 0.008830 S S K G E E L F T G V V P I L V E L D G D V N ... 3 0
7 5 A 0.008566 A S K G E E L F T G V V P I L V E L D G D V N ... 3 0
8 11 T 0.008226 T S K G E E L F T G V V P I L V E L D G D V N ... 3 0
9 13 D 0.007893 D S K G E E L F T G V V P I L V E L D G D V N ... 3 0
10 9 E 0.007137 E S K G E E L F T G V V P I L V E L D G D V N ... 3 0
11 18 F 0.007050 F S K G E E L F T G V V P I L V E L D G D V N ... 3 0
12 17 N 0.007002 N S K G E E L F T G V V P I L V E L D G D V N ... 3 0
13 14 P 0.006728 P S K G E E L F T G V V P I L V E L D G D V N ... 3 0
14 19 Y 0.006330 Y S K G E E L F T G V V P I L V E L D G D V N ... 3 0
15 10 R 0.004718 R S K G E E L F T G V V P I L V E L D G D V N ... 3 0
16 16 Q 0.004328 Q S K G E E L F T G V V P I L V E L D G D V N ... 3 0
17 21 H 0.002733 H S K G E E L F T G V V P I L V E L D G D V N ... 3 0
18 22 W 0.002262 W S K G E E L F T G V V P I L V E L D G D V N ... 3 0
19 23 C 0.001721 C S K G E E L F T G V V P I L V E L D G D V N ... 3 0

Concatenate the resulting DFs together, horizontally.

In [12]:
results_df = pd.concat(results, axis=0)

results_df.shape
Out[12]:
(4760, 6)
In [13]:
results_df.sample(5)  # Sample some of the results to get a preview
Out[13]:
token token_str score sequence model_num mask_pos
1 11 T 0.088199 A S K G E E L F T G V V P I L V E L D G D V N ... 3 96
14 20 M 0.034169 A S K G E E L F T G V V P I L V E L D G D V N ... 3 80
13 12 I 0.036599 A S K G E E L F T G V V P I L V E L D G D V N ... 3 170
14 14 P 0.026359 A S K G E E L F T G V V P I L V E L D G D V N ... 3 117
19 23 C 0.004771 A S K G E E L F T G V V P I L V E L D G D V N ... 3 170

Let's quickly fill in the WT character for each row so we can plot the results.

In [14]:
results_df['wt_token_str'] = results_df.mask_pos.apply(lambda x: WILDTYPE[x])

results_df.sample(3)
Out[14]:
token token_str score sequence model_num mask_pos wt_token_str
13 10 R 0.030813 A S K G E E L F T G V V P I L V E L D G D V N ... 3 82 F
12 6 G 0.026441 A S K G E E L F T G V V P I L V E L D G D V N ... 3 201 S
3 6 G 0.069592 A S K G E E L F T G V V P I L V E L D G D V N ... 3 153 A

Lastly, make a DF displaying a heatmap of the most probable stabilizing variants from the above in silico deep mutational scan.

In [15]:
heatmap_df = []  # Each row represents the full sequence, with a single token_str at each position

for residue in results_df.token_str.drop_duplicates().to_list():
    tmp_row = []
    for pos in range(len(WILDTYPE)):
        token_position_score = results_df.query('token_str == @residue & mask_pos == @pos').score.iloc[0]
        tmp_row.append(token_position_score)
    heatmap_df.append(tmp_row)
In [16]:
heatmap_df = pd.DataFrame(heatmap_df)

heatmap_df.shape
Out[16]:
(20, 238)
In [17]:
heatmap_df.head(1).iloc[0][0]
Out[17]:
0.8598923683166504
In [18]:
from matplotlib import pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=[15, 4])

ax = sns.heatmap(heatmap_df)
_ = ax.set_xticklabels(list(WILDTYPE)[::int(len(WILDTYPE)/50)])
_ = ax.set_yticklabels(results_df.token_str.drop_duplicates())
_ = ax.set(xlabel='Wildtype Residue', ylabel='Variant')

This is a quick depiciton of the results. What you want to look for in the DF is mutations where the score is greater than the score of the wild-type residue at the same location. These can be extracted with some quick slicing of results_df, which we'll leave up to you to explore. There appear to be several mutations with greater likelihood that the wild-type sequence.

Next Steps

Check out additional tutorials at jupyter.biolm.ai, or head over to our BioLM Documentation to explore additional models and functionality.

See more use-cases and APIs on your BioLM Console Catalog.


BioLM hosts deep learning models and runs inference at scale. You do the science.

Enzyme Engineering Antibody Engineering Biosecurity
Single-Cell Genomics DNA Sequence Modelling Finetuning

Contact us to learn more.

In [ ]: