Inference-Time Scaling for De Novo Antibody Design

Last updated on May 17, 2026 6 min read

Introduction

The insight behind inference-time scaling is simple: generating one candidate and hoping it’s good is rarely optimal. A model with a fixed set of weights can produce dramatically better outputs if, at inference, you spend compute exploring the output space and selecting against a reward signal. This idea drove much of the performance gain behind reasoning-focused LLMs. The same principle applies directly to de novo antibody design — and arguably matters more, because the design objective is explicit (binding, stability, manufacturability) and the scoring models to evaluate it are increasingly capable.

The standard workflow for diffusion-based antibody design generates a batch of sequences, ranks them by some heuristic, and sends the top hits to the bench. That ranking step is where most of the opportunity lives. The question is not just how many candidates to generate, but how to navigate sequence space during generation to concentrate probability on high-reward regions.

The Design Problem

De novo design requires jointly optimizing over a combinatorial sequence space. For a CDR-H3 loop of length 15, the space of possible sequences is $20^{15} \approx 3 \times 10^{19}$ — unreachable by exhaustive search. Generative models approximate the distribution of viable binders, but the tails of that distribution, the exceptional candidates, are what actually matter.

Let $p_\theta(s)$ be the generative model’s distribution over sequences and $r(s)$ be a reward (predicted affinity, stability, or a composite). The goal is to sample from:

$$p^*(s) \propto p_\theta(s) \cdot \exp\left(\frac{r(s)}{\tau}\right)$$

where $\tau$ controls the trade-off between diversity and exploitation. This is the energy-based model view of guided generation. Inference-time strategies differ in how they approximate sampling from $p^*$.

Strategy 1: Best-of-N with a Scoring Model

The simplest approach. Generate $N$ candidates from the base model, score each, return the best. The expected maximum reward over $N$ draws scales as $O(\log N)$ for Gaussian-tailed distributions — meaningful gains without touching the model.

The key is the scoring function. ESM2 pseudo-log-likelihood works as a zero-shot plausibility filter before running more expensive structure-based scorers.

import torch
from transformers import EsmForMaskedLM, EsmTokenizer

tokenizer = EsmTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
model = EsmForMaskedLM.from_pretrained("facebook/esm2_t33_650M_UR50D").eval()

def pseudo_log_likelihood(seq):
    enc = tokenizer(seq, return_tensors="pt")
    ids = enc["input_ids"][0]
    score = 0.0
    for pos in range(1, len(ids) - 1):
        masked = enc["input_ids"].clone()
        masked[0, pos] = tokenizer.mask_token_id
        with torch.no_grad():
            logp = model(input_ids=masked,
                         attention_mask=enc["attention_mask"]).logits[0, pos].log_softmax(-1)
        score += logp[ids[pos]].item()
    return score

def best_of_n(candidates):
    return max(candidates, key=pseudo_log_likelihood)

In practice, a tiered funnel works well: PLL filter at $N = 500$, structure prediction on the top 50, physics-based interface scoring on the top 10. Each tier is expensive enough that you don’t want to run it on everything; inference-time scaling is what makes the funnel tractable.

Strategy 2: Guided Diffusion

Rather than scoring after generation, guidance steers the denoising trajectory itself. At each step $t$, the predicted noise $\epsilon_\theta$ is perturbed by the gradient of a differentiable reward:

$$\tilde{\epsilon}{t} = \epsilon{\theta}(x_{t}, t) - \lambda \cdot \nabla_{x_{t}} r(x_{t})$$

Increasing $\lambda$ (the guidance scale) shifts probability mass toward high-reward regions. This is the inference knob: the model weights are frozen, but guidance strength is a free parameter at generation time.

def guided_denoise_step(model, x_t, t, guide_fn, guidance_scale=5.0):
    x_t = x_t.detach().requires_grad_(True)
    with torch.enable_grad():
        grad = torch.autograd.grad(guide_fn(x_t).sum(), x_t)[0]

    with torch.no_grad():
        eps = model(x_t.detach(), t)
        return eps - guidance_scale * grad

The tradeoff is mode collapse: very high guidance pushes all particles toward the same sequence. In practice, guidance scales of 3–8 improve mean affinity without collapsing diversity, but this is problem-dependent and worth sweeping.

Strategy 3: MCMC on Sequence Space

Starting from an initial candidate (from diffusion, retrieval, or a known binder), Metropolis-Hastings explores the local neighborhood. Each step proposes a point mutation and accepts or rejects based on a score ratio. Temperature controls the exploration-exploitation balance.

$$P(\text{accept}) = \min\left(1,; \exp\left(\frac{r(s’) - r(s)}{\tau}\right)\right)$$

import random, math

AA = list("ACDEFGHIKLMNPQRSTVWY")

def mcmc_design(init_seq, score_fn, n_steps=2000, temp=1.0):
    seq = list(init_seq)
    current = score_fn("".join(seq))

    for _ in range(n_steps):
        pos = random.randrange(len(seq))
        old, seq[pos] = seq[pos], random.choice(AA)
        proposed = score_fn("".join(seq))
        if proposed < current and random.random() > math.exp((proposed - current) / temp):
            seq[pos] = old
        else:
            current = proposed

    return "".join(seq)

Running multiple independent chains from diverse starting points and pooling the survivors avoids getting stuck in local minima. Simulated annealing, where $\tau$ decreases over the run, is a practical improvement: explore broadly early, exploit late.

Strategy 4: Sequential Monte Carlo

SMC generalizes best-of-N by applying resampling inside the diffusion trajectory rather than only at the end. A population of particles denoises in parallel; at each step, particles are reweighted by incremental reward and low-weight particles are pruned and replaced by copies of high-weight ones.

$$w_t^{(i)} \propto \exp\left(r(x_t^{(i)}) - r(x_{t+1}^{(i)})\right)$$

def smc_design(model, score_fn, n_particles=64, n_steps=20):
    particles = model.sample_prior(n_particles)
    log_w = torch.zeros(n_particles)

    for t in reversed(range(n_steps)):
        particles = model.denoise_step(particles, t)
        log_w += score_fn(particles)

        # resample proportional to accumulated weight
        idx = torch.multinomial(log_w.softmax(0), n_particles, replacement=True)
        particles = particles[idx]
        log_w = torch.zeros(n_particles)

    return particles[score_fn(particles).argmax()]

SMC tends to produce better diversity than pure guidance at equal compute, because resampling preserves multiple high-reward modes rather than collapsing toward the gradient’s peak.

The Oracle Problem

All of these strategies amplify the signal in the scoring function. If the oracle is a poor proxy for wet-lab affinity, you will efficiently design sequences that score well in silico and fail at the bench. This is the central risk.

A useful heuristic: if your oracle is a model trained on your own assay data, estimate its out-of-distribution confidence (ensemble variance, conformal prediction intervals) and treat high-uncertainty regions as low-reward. Inference-time scaling should be coupled with uncertainty quantification, not just point estimates.

Practical Tradeoffs

Strategy	Compute	Diversity	Oracle calls
Best-of-N	$O(N)$	High	$N$
Guided diffusion	$O(T \cdot N)$	Medium	per step
MCMC	$O(\text{steps})$	Medium	per step
SMC	$O(N \cdot T)$	High	$N \cdot T$

For most lab budgets, a two-stage approach works well: SMC or guided diffusion for initial generation (captures diversity), followed by MCMC fine-tuning of top candidates (local optimization). Oracle calls, not wall-clock time, are usually the binding constraint.

Conclusion

Inference-time scaling reframes the design problem: instead of asking “how do I train a better generative model?”, ask “given a fixed model, how do I extract better candidates at generation time?” For antibody design specifically, where the scoring signal is interpretable and the compute is manageable, this shift is practical today. The strategies above — Best-of-N, guided diffusion, MCMC, and SMC — span a tradeoff space between compute, diversity, and oracle reliance. Choosing among them depends less on the generative model architecture than on the quality and throughput of whatever scoring function you trust.

Inference-Time Scaling for De Novo Antibody Design

Brook Tilahun

Computational Biology Scientist