Predict Stabilizing Variants via Deep Mutational Scan¶
While ESM2 has similar performance to ESM-1v, the latter is still a great option for performing an in silico deep mutational scan. ESM-1v is a collection of five models, trained on sequences from UniRef, each with different seeds. Ensembling these five models' scores can produce better results than a single ESM-1v model, for such broad topics like identifying stabilizing mutations.
Since these models were only trained on functional molecules, they are capable of assessing whether a new molecule might also be functional, or whether it has a disastrous mutation.
Imagine only learning and speaking English your whole life - could you identify whether a new English word is English? Probably. What about whether a Korean word is Korean? Or whether a French word is French? You might not be able to identify a Korean word as specifically Korean since you were only exposed to English; but you can tell that some foreign words are at least not-English. The idea is similar with this zero-shot model, ESM-1v: how likely is a sequence functional, since the model was only exposed to functional sequences.
We do this by masking each position and requesting the scores for each residue at that position. For a sequence of even a couple hundred residues, this is thousands of predictions from each of the 5 GPU-based models. BioLM has APIs to get these results at scale much faster than it would take to spin up these models on multiple GPUs alone.
import time
from biolmai import BioLM
from IPython.display import JSON # Helpful UI for JSON display
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns# Let's use the sequence from the paper
# https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2.full.pdf
WILDTYPE = "ASKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
print("Sequence length: {}".format(len(WILDTYPE)))We can copy working Python code for making a request to BioLM.ai's ESM-1v API:
def esm1v_pred(seq):
"""Get un-masking predictions from any of the 5 ESM-1v models via REST API.
From API docs: https://api.biolm.ai/#54d3367a-b5d1-4f18-aa44-23ae3914a446
"""
assert '<mask>' in seq
response = BioLM(entity=f"esm1v-all", action="predict", type="sequence", items=[seq])
return responseNow test our function on the first residue and first model.
result = esm1v_pred('<mask>' + WILDTYPE[1:]) # First ESM-1v model
JSON(result)We can see that "M", or Methionine is the most likely starting AA. Methionine is a common starting AA amongst many organisms.
Next, We can examine predictions from the other ESM-1v models, then ensemble the likelihood scores for best results.
JSON(result["esm1v-n2"])You can observe the probabilities for each token (amino acid residue) differing between the two models.
Loading the data into a DF is simple.
pd.DataFrame.from_dict(result["esm1v-n2"], orient='columns')Let's scan the whole sequence now, using several of the five ESM-1v models. This should be fairly fast sequentially. Let's simply loop through each model and position. To ensemble and average predictions from all five models, we use the ESM-1v all endpoint which does this directly.
import copy
results = []
def f(idx, seq):
if idx % 20 == 0:
print("Submitting twenty positions from {}/{} with ESM-1v Model".format(idx, len(seq)))
# Mask that position
seq_masked = list(copy.copy(seq))
seq_masked[idx] = '<mask>'
seq_masked = ''.join(seq_masked)
r = esm1v_pred(seq_masked)
# Request the probabilities of each AA at that mask
for mdl in range(1, 6):
r_df = pd.DataFrame.from_dict(r[f"esm1v-n{mdl}"], orient='columns')
r_df['model_num'] = mdl
r_df['mask_pos'] = idx
results.append(r_df)
ops = [f(i, WILDTYPE) for i in range(len(WILDTYPE))]results[0] # Look at the first result DF we madeConcatenate the resulting DFs together.
results_df = pd.concat(results, axis=0)
results_df.shaperesults_df.sample(5) # Sample some of the results to get a previewLet's quickly fill in the WT character for each row so we can plot the results.
results_df['wt_token_str'] = results_df.mask_pos.apply(lambda x: WILDTYPE[x])
results_df.sample(3)Lastly, make a DF displaying a heatmap of the most probable stabilizing variants from the above in silico deep mutational scan.
heatmap_df = [] # Each row represents the full sequence, with a single token_str at each position
for residue in results_df.token_str.drop_duplicates().to_list():
tmp_row = []
for pos in range(len(WILDTYPE)):
token_position_score = results_df.query('token_str == @residue & mask_pos == @pos').score.iloc[0]
tmp_row.append(token_position_score)
heatmap_df.append(tmp_row)heatmap_df = pd.DataFrame(heatmap_df)
heatmap_df.shapeheatmap_df.head(1).iloc[0][0]fig = plt.figure(figsize=[15, 4])
ax = sns.heatmap(heatmap_df)
_ = ax.set_xticklabels(list(WILDTYPE)[::int(len(WILDTYPE)/50)])
_ = ax.set_yticklabels(results_df.token_str.drop_duplicates())
_ = ax.set(xlabel='Wildtype Residue', ylabel='Variant')This is a quick depiciton of the results. What you want to look for in the DF is mutations where the score is greater than the score of the wild-type residue at the same location. These can be extracted with some quick slicing of results_df, which we'll leave up to you to explore. There appear to be several mutations with greater likelihood that the wild-type sequence.
Next Steps¶
Check out additional tutorials at jupyter.biolm.ai, or head over to our BioLM Documentation to explore additional models and functionality.
See more use-cases and APIs on your BioLM Console Catalog.¶
BioLM hosts deep learning models and runs inference at scale. You do the science.¶
Contact us to learn more.¶
<span></span>