Building a Variant Effect Predictor for Antibody Affinity

Last updated on Mar 10, 2026 7 min read

Introduction

Antibody affinity maturation—the process of improving the binding strength between an antibody and its target antigen—is a critical step in therapeutic antibody development. Natural affinity maturation occurs in vivo during B cell development, where somatic hypermutation generates antibody variants with improved binding. In vitro affinity maturation mimics this process, using techniques like error-prone PCR, DNA shuffling, or computational design to generate variant libraries that are then screened for improved binding. However, experimental affinity maturation remains resource-intensive, requiring the expression and screening of hundreds to thousands of variants. Machine learning offers a transformative approach: predicting the effect of mutations on binding affinity before experimental testing, enabling intelligent design of focused variant libraries and dramatically accelerating the maturation process.

Understanding Antibody-Antigen Binding

Antibody binding affinity is determined by the complementarity between the antibody’s paratope (the binding site, primarily in the complementarity-determining regions or CDRs) and the antigen’s epitope. The binding energy arises from multiple non-covalent interactions:

Hydrogen bonds: Polar interactions between donor and acceptor groups
Van der Waals forces: Weak attractive forces from electron correlation
Electrostatic interactions: Coulombic attractions between charged residues
Hydrophobic effects: Entropic driving force from water exclusion

The dissociation constant $K_D$ quantifies binding affinity:

$$K_D = \frac{k_{off}}{k_{on}}$$

where $k_{off}$ is the dissociation rate and $k_{on}$ is the association rate. Lower $K_D$ indicates higher affinity (stronger binding). Affinity is often reported in molar concentrations, with typical therapeutic antibodies achieving low nanomolar to picomolar affinity.

The change in binding free energy upon mutation, $\Delta \Delta G$, is the key quantity to predict:

$$\Delta \Delta G = \Delta G_{mut} - \Delta G_{wt}$$

where $\Delta G_{wt}$ is the wild-type binding free energy and $\Delta G_{mut}$ is the mutant binding free energy. Positive $\Delta \Delta G$ indicates reduced affinity; negative values indicate improved affinity.

Approaches to Variant Effect Prediction

1. Physics-Based Methods

Physics-based approaches use molecular mechanics force fields or quantum chemical calculations to compute binding energies:

$$E_{binding} = E_{complex} - E_{antibody} - E_{antigen}$$

Popular methods include:

Rosetta ddg
FoldX
Molecular mechanics/generalized Born model
Free energy perturbation (FEP)

While physically rigorous, these methods are computationally expensive and require structural models.

2. Knowledge-Based Potentials

Statistical potentials derive interaction preferences from known protein structures:

$$W_{ij} = -\ln \frac{p_{ij}}{p_i p_j}$$

where $p_{ij}$ is the observed frequency of residue pair $(i,j)$ and $p_i$ is the marginal frequency. These potentials can be used to score variants.

3. Machine Learning Approaches

ML approaches learn sequence-function relationships from experimental data:

Sequence-based models: CNNs, RNNs, or transformers on antibody sequences
Structure-based models: GNNs or 3D CNNs on antibody-antigen complexes
Hybrid models: Combining sequence and structural features

Building a Variant Effect Predictor

Data Preparation

The foundation of any ML model is high-quality training data. For antibody affinity prediction, relevant datasets include:

Therapeutic antibody binding data: Published $K_D$ measurements for antibody variants
Deep mutational scanning: High-throughput assays measuring the effect of all mutations at a specific position
** phage display selections**: Ranking data from enrichment experiments

Each data point consists of:

Wild-type antibody sequence (or structure)
Mutation(s) introduced
Measured affinity change ($\Delta \Delta G$ or fold-change)

Feature Engineering

Features capture the relevant properties of antibodies and mutations:

import numpy as np

# Amino acid physicochemical properties
AA_PROPERTIES = {
    'A': {'hydrophobicity': 1.8, 'charge': 0, 'volume': 88.6, 'polarity': 0, 'mass': 89},
    'R': {'hydrophobicity': -4.5, 'charge': 1, 'volume': 173.4, 'polarity': 52, 'mass': 174},
    # ... complete dictionary
}

def extract_mutation_features(wt_seq, mut_seq, position):
    wt_aa = wt_seq[position]
    mut_aa = mut_seq[position]
    
    features = [
        AA_PROPERTIES[mut_aa]['hydrophobicity'] - AA_PROPERTIES[wt_aa]['hydrophobicity'],
        AA_PROPERTIES[mut_aa]['charge'] - AA_PROPERTIES[wt_aa]['charge'],
        AA_PROPERTIES[mut_aa]['volume'] - AA_PROPERTIES[wt_aa]['volume'],
        AA_PROPERTIES[mut_aa]['polarity'] - AA_PROPERTIES[wt_aa]['polarity'],
        AA_PROPERTIES[mut_aa]['mass'] - AA_PROPERTIES[wt_aa]['mass'],
        # Position-specific features
        position / len(wt_seq),  # Relative position
        1 if position < 3 else 0,  # CDR1
        1 if 50 < position < 65 else 0,  # CDR2
        1 if 95 < position < 110 else 0,  # CDR3
    ]
    return np.array(features)

Sequence Representation

Beyond simple features, modern approaches use learned representations:

One-hot encoding: Simple but high-dimensional
k-mer features: Capturing local sequence context
Pre-trained embeddings: Using ESM-2 or ProtGPT2 for rich contextual representations

from transformers import EsmModel, EsmTokenizer
import torch

class AntibodyEncoder:
    def __init__(self, model_name="facebook/esm2_t33_650M_UR50D"):
        self.model = EsmModel.from_pretrained(model_name)
        self.tokenizer = EsmTokenizer.from_pretrained(model_name)
        
    def encode(self, sequence):
        inputs = self.tokenizer(sequence, return_tensors="pt", 
                               padding=True, truncation=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        # Use mean pooling of last hidden state
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings.numpy()

Model Architectures

1. CNN for Sequence Classification

import torch
import torch.nn as nn

class CNNVariantPredictor(nn.Module):
    def __init__(self, vocab_size=21, embed_dim=64, kernel_sizes=[3, 5, 7]):
        super(CNNVariantPredictor, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        
        self.convs = nn.ModuleList([
            nn.Conv1d(embed_dim, 128, k) for k in kernel_sizes
        ])
        
        self.fc = nn.Sequential(
            nn.Linear(128 * len(kernel_sizes), 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 1)
        )
        
    def forward(self, x):
        x = self.embedding(x).transpose(1, 2)
        conv_outputs = [torch.relu(conv(x)) for conv in self.convs]
        pooled = [torch.max(conv_out, dim=2)[0] for conv_out in conv_outputs]
        concat = torch.cat(pooled, dim=1)
        return self.fc(concat)

2. Transformer Architecture

class TransformerVariantPredictor(nn.Module):
    def __init__(self, d_model=256, nhead=8, num_layers=4):
        super(TransformerVariantPredictor, self).__init__()
        self.embedding = nn.Embedding(21, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=512)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
        
        self.regressor = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(d_model // 2, 1)
        )
        
    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoder(x)
        x = self.transformer(x)
        x = x.mean(dim=1)  # Global average pooling
        return self.regressor(x)

3. Graph Neural Network for Structure-Based Prediction

When antibody-antigen complex structures are available, GNNs can model the binding interface:

import torch
import torch.nn.functional as F
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops

class InterfaceGNN(MessagePassing):
    def __init__(self, node_dim, edge_dim, hidden_dim):
        super(InterfaceGNN, self).__init__(aggr='add')
        self.node_encoder = nn.Linear(node_dim, hidden_dim)
        self.edge_encoder = nn.Linear(edge_dim, hidden_dim)
        self.message_mlp = nn.Sequential(
            nn.Linear(2 * hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )
        self.update_mlp = nn.Sequential(
            nn.Linear(2 * hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )
        
    def forward(self, x, edge_index, edge_attr):
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        edge_attr = self.edge_encoder(edge_attr)
        
        return self.propagate(edge_index, x=x, edge_attr=edge_attr)
    
    def message(self, x_i, x_j, edge_attr):
        return self.message_mlp(torch.cat([x_i, x_j], dim=-1))
    
    def update(self, aggr_out, x):
        return self.update_mlp(torch.cat([x, aggr_out], dim=-1))

Training Strategy

1. Data Splitting

Critical considerations:

Avoid data leakage: variants with similar parent sequences should be in same split
Use sequence clustering to ensure generalization
Validate on held-out targets

from sklearn.model_selection import GroupKFold

# Group variants by parent sequence
groups = variant_df['parent_sequence'].values
gkf = GroupKFold(n_splits=5)

for train_idx, val_idx in gkf.split(X, y, groups):
    # Train/validation split preserving parent groupings

2. Loss Functions

For regression of $\Delta \Delta G$:

MSE Loss: Standard mean squared error
MAE Loss: More robust to outliers
Ranking Loss: Optimize for correct ordering of variants

class RankingLoss(nn.Module):
    def forward(self, pred, true):
        # Ensure predictions correlate with true values
        n = pred.size(0)
        pred_expanded = pred.unsqueeze(1).expand(n, n)
        true_expanded = true.unsqueeze(0).expand(n, n)
        
        # Positive when pred_i > pred_j and true_i > true_j
        ranking = (pred_expanded - pred_expanded.T) * (true_expanded - true_expanded.T)
        return F.relu(1 - ranking).mean()

3. Uncertainty Quantification

Essential for practical use—models should express confidence:

class EnsembleUncertainty:
    def __init__(self, models):
        self.models = models
        
    def predict_with_uncertainty(self, x):
        predictions = torch.stack([model(x) for model in self.models])
        mean = predictions.mean(dim=0)
        std = predictions.std(dim=0)
        return mean, std

Evaluation Metrics

Key metrics for variant effect prediction:

Pearson Correlation: Measures linear relationship between predicted and observed
Spearman Correlation: Measures rank correlation (often more relevant)
RMSE/RMAE: Prediction accuracy in energy units
Top-K Recovery: Fraction of true top-K variants recovered in predicted top-K

Practical Considerations

1. Data Augmentation

Reverse mutations: if A→R has known effect, R→A is related
Sequence masking: predict masked positions

2. Transfer Learning

Pre-train on general protein mutation data, fine-tune on antibody-specific data:

# Pre-trained on general protein variants
base_model = load_pretrained_model("protein_variant_predictor")
# Fine-tune on antibody data
for param in base_model.parameters():
    param.requires_grad = False
    
# Add antibody-specific head
head = nn.Linear(768, 1)
# Fine-tune head on antibody data

3. Active Learning

Iteratively select variants to test experimentally:

def select_variants_for_testing(model, candidates, n_select=10):
    predictions, uncertainties = model.predict_with_uncertainty(candidates)
    # Select high uncertainty + high predicted improvement
    scores = predictions + 0.5 * uncertainties
    return candidates[torch.topk(scores, n_select).indices]

Applications in Affinity Maturation

A complete ML-driven affinity maturation workflow:

Starting point: Antibody with moderate affinity (Kd ~ 10 nM)
Generate candidates: Use ML to predict effects of all possible CDR mutations
Filter: Remove variants with predicted developability issues
Priority ranking: Select top candidates for experimental testing
Iterate: Use new data to retrain models, repeat

This approach can reduce the number of variants that need experimental testing from thousands to hundreds.

Challenges and Future Directions

Data scarcity: High-quality affinity measurement data is limited
Generalization: Models may not transfer across targets
Epistasis: Interactions between multiple mutations are hard to capture
Structural context: Modeling both bound and unbound states

Future directions include:

Foundation models for antibody variants
Integration with structure prediction (AlphaFold)
Diffusion models for variant generation
In silico screening with physics-based refinement

Conclusion

Building an effective variant effect predictor for antibody affinity requires careful attention to data quality, feature engineering, model architecture, and training strategy. While challenging, ML-based approaches offer transformative potential for antibody engineering, enabling rational design of improved therapeutics and dramatically accelerating the affinity maturation process. As training data accumulates and models improve, ML-guided affinity maturation will become standard practice in therapeutic antibody development.

Building a Variant Effect Predictor for Antibody Affinity

Brook Tilahun

Computational Biology Scientist