Introduction to Graph Neural Networks for Protein-Protein Interactions

Introduction

Proteins rarely act in isolation; instead, they participate in complex networks of physical and functional interactions that orchestrate cellular processes. Understanding these protein-protein interactions (PPIs) is fundamental to elucidating cellular pathways, identifying drug targets, and developing therapeutic interventions. Traditional experimental approaches to mapping PPIs—yeast two-hybrid screening, co-immunoprecipitation, affinity purification mass spectrometry—have generated valuable datasets but remain incomplete and cannot capture the full complexity of the interactome. Machine learning, particularly graph neural networks (GNNs), offers a powerful approach to predict PPIs, inferring interactions from protein properties and network topology.

Protein-Protein Interactions: Biological Context

Protein-protein interactions can be classified into several categories:

  1. Transient vs. Stable: Transient interactions are temporary and often regulatory, while stable interactions form permanent complexes
  2. Physical vs. Functional: Physical interactions involve direct molecular contact; functional interactions involve participation in the same pathway
  3. Domain-Mediated: Many PPIs are mediated by specific protein domains that recognize complementary motifs

The interactome—the complete set of protein interactions in an organism—varies significantly across species. The human interactome is estimated to contain between 100,000 and 650,000 interactions, though current experimental data covers only a fraction.

Why Graph Representations?

The natural representation of PPI data is a graph, or network, where:

  • Nodes represent proteins
  • Edges represent physical interactions between proteins
  • Node and edge features encode protein properties and interaction characteristics

This graph structure contains rich topological information that encodes functional relationships. Proteins in the same pathway or complex tend to cluster together, and the network position of a protein often correlates with its biological function. GNNs are specifically designed to learn from such graph-structured data, making them ideal for PPI prediction.

Graph Neural Network Fundamentals

GNNs generalize convolutional neural networks to irregular graph structures. The core principle is that nodes update their representations by aggregating information from their neighbors.

Message Passing Framework

The standard GNN operates through a message passing framework, also known as graph convolution. At each layer $l$, node representations are updated according to:

$$h_v^{(l+1)} = \text{UPDATE}\left(h_v^{(l)}, \text{AGG}\left({m_{u \to v}^{(l)} : u \in \mathcal{N}(v)}\right)\right)$$

where:

  • $h_v^{(l)}$ is the representation of node $v$ at layer $l$
  • $\mathcal{N}(v)$ is the set of neighbors of node $v$
  • $m_{u \to v}^{(l)}$ is the message from neighbor $u$ to node $v$
  • AGG is an aggregation function
  • UPDATE is an update function

Message Functions

Common message function choices include:

  1. Graph Convolutional Network (GCN): $$m_{u \to v} = \frac{1}{\sqrt{d_u d_v}} W h_u$$

where $d_u$ and $d_v$ are node degrees and $W$ is a learnable weight matrix.

  1. GraphSAGE: $$m_{u \to v} = W \cdot \text{CONCAT}(h_v, h_u)$$

  2. Graph Attention Networks (GAT): $$m_{u \to v} = \alpha_{uv} W h_u$$

where attention coefficients $\alpha_{uv}$ are computed via: $$\alpha_{uv} = \frac{\exp(\text{LeakyReLU}(a^T[W h_u | W h_v]))}{\sum_{k \in \mathcal{N}(v)} \exp(\text{LeakyReLU}(a^T[W h_k | W h_v]))}$$

Aggregation Functions

Aggregation functions combine messages from multiple neighbors:

  1. Mean Aggregation: $\text{AGG}({m}) = \frac{1}{|\mathcal{N}|} \sum_{u \in \mathcal{N}} m_u$
  2. Max Aggregation: $\text{AGG}({m}) = \max_{u \in \mathcal{N}} m_u$
  3. Sum Aggregation: $\text{AGG}({m}) = \sum_{u \in \mathcal{N}} m_u$

Architectures for PPI Prediction

1. Sequence-Based GNNs

In this approach, proteins are represented as nodes with features derived from amino acid sequences. Each protein sequence is processed to generate a feature vector:

  • One-hot encoding or k-mer frequencies
  • Physicochemical property vectors
  • Pre-trained protein language model embeddings (ESM, ProtBERT)

These features initialize node representations, which are then refined through message passing to incorporate information from the PPI network topology.

import torch
import torch.nn as nn
import torch.nn.functional as F

class GraphAttentionLayer(nn.Module):
    def __init__(self, in_features, out_features, dropout=0.1, alpha=0.2):
        super(GraphAttentionLayer, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.dropout = dropout
        
        self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))
        nn.init.xavier_uniform_(self.W.data, gain=1.414)
        self.a = nn.Parameter(torch.zeros(size=(2*out_features, 1)))
        nn.init.xavier_uniform_(self.a.data, gain=1.414)
        
        self.leakyrelu = nn.LeakyReLU(alpha)
        
    def forward(self, input_data, adj):
        h = torch.mm(input_data, self.W)
        N = h.size()[0]
        
        a_input = torch.cat([h.repeat(1, N).view(N * N, -1), h.repeat(N, 1)], dim=1).view(N, -1, 2 * self.out_features)
        e = self.leakyrelu(torch.matmul(a_input, self.a).squeeze(2))
        
        zero_vec = -9e15*torch.ones_like(e)
        attention = torch.where(adj > 0, e, zero_vec)
        attention = F.softmax(attention, dim=1)
        attention = F.dropout(attention, self.dropout, training=self.training)
        h_prime = torch.matmul(attention, h)
        
        return F.elu(h_prime)

2. Structure-Based GNNs

When 3D structural information is available, proteins can be represented as graphs of residues with edges connecting spatially proximate residues. This captures:

  • Domain architecture
  • Interface residues
  • Allosteric relationships

3. Heterogeneous Networks

More sophisticated approaches incorporate multiple node and edge types:

  • Different protein types (enzyme, receptor, structural)
  • Different interaction types (binding, phosphorylation, transcriptional regulation)
  • Edge weights representing confidence or experimental evidence

Training Objectives

PPI prediction models are typically trained with one of several objectives:

  1. Link Prediction: Binary classification: does an edge exist between two proteins? $$ \mathcal{L}{link} = -\sum{(u,v) \in \mathcal{E}^+} \log \sigma(\hat{y}{uv}) - \sum{(u,v) \in \mathcal{E}^-} \log(1 - \sigma(\hat{y}_{uv})) $$

  2. Node Classification: Predict protein function based on network position $$ \mathcal{L}{node} = -\sum{v \in \mathcal{V}} \sum_{c} y_{vc} \log(\hat{y}_{vc}) $$

  3. Link Ranking: Predict interaction confidence scores $$ \mathcal{L}{rank} = \max(0, \gamma - s{pos} + s_{neg}) $$

Datasets and Evaluation

Common benchmark datasets for PPI prediction include:

  • STRING: Database of known and predicted protein interactions
  • BioGRID: General repository of genetic and protein interactions
  • HuRI: Human Reference Interactome
  • SIREN: Synthetic interactome for method development

Evaluation metrics typically include:

  • Area Under the ROC Curve (AUC-ROC)
  • Area Under the Precision-Recall Curve (AUC-PR)
  • Accuracy, Precision, Recall, F1

Applications in Computational Biology

1. Interactome Mapping

GNNs can predict novel interactions by analyzing patterns in known interaction networks, filling gaps in experimental maps.

2. Disease Module Detection

Proteins associated with the same disease often cluster in the interactome. GNNs can identify disease modules by detecting dense subnetworks.

3. Drug Target Identification

For a given disease, GNNs can identify proteins whose perturbation is likely to be therapeutic based on their network position and connectivity.

4. Protein Complex Prediction

GNNs can predict which proteins form stable complexes by identifying densely connected cliques or clusters in the interactome.

5. Antibody-Antigen Interaction Prediction

In therapeutic antibody development, predicting whether an antibody will bind a target antigen is crucial. GNN-based approaches can model both the antibody CDR and antigen epitope as graphs and predict binding affinity.

Challenges and Future Directions

  1. Data Quality: Network data contains noise, missing interactions, and false positives. Robust training strategies and confidence weighting are needed.

  2. Temporal Dynamics: PPIs are dynamic, changing with cellular conditions. Static network models cannot capture this dynamism.

  3. Cross-Species Generalization: Models trained on one species may not transfer to others due to evolutionary differences.

  4. Scalability: Whole-proteome networks are large; efficient GNN architectures and sampling strategies are needed.

  5. Integration: Combining sequence, structure, function, and network data in unified frameworks remains challenging.

Future Directions

  • Self-supervised Pretraining: Pre-training GNNs on large unlabeled networks using objectives like contrastive learning or graph autoencoders
  • Geometric Deep Learning: Incorporating 3D structural information more directly using equivariant message passing
  • Foundation Models: Large-scale pretrained models that can be fine-tuned for specific PPI tasks
  • Dynamic Graphs: Time-evolving GNNs that capture interaction changes across conditions

Conclusion

Graph neural networks provide a powerful framework for learning from protein-protein interaction networks. By leveraging both node features and topological information, GNNs can predict novel interactions, identify functional modules, and uncover relationships that are not apparent from sequence or structure alone. As experimental interactome data accumulates and GNN architectures mature, these methods will become increasingly valuable for understanding cellular biology and accelerating therapeutic protein development.

Brook Tilahun
Brook Tilahun
Computational Biology Scientist

Applying machine learning and AI to accelerate therapeutic antibody discovery and protein engineering.