top of page

AlphaFold: AI’s Biggest Breakthrough — with Dr Jennifer Fleming (AIBIO-UK Mini-Series)

  • Charlie Harrison
  • Oct 8
  • 8 min read

This episode features an interview with Dr Jennifer Fleming, the coordinator of the Protein Data Bank in Europe (PDBe) and AlphaFold Protein Structure Database (AFDB) lead. AlphaFold is a program that predicts the structure of proteins. If you follow computational biology at all, or even if you don’t, you might well have heard of it – it won a Nobel Prize in 2024 for Google DeepMind co-founder Sir Demis Hassabis and chief Scientist John Jumper, the quickest ever awarded! It represents such a major breakthrough that it’s been said to usher in a whole new era of biology.

 

Listen to the episode here, or read on for a breakdown of the topics covered, along with some extra background.






ree


A Quick Bit of Background


Proteins and the protein folding problem

Proteins are the molecular machines that drive almost every process that happens in living cells. They are polymers, consisting of chains or sequences of amino acids. Proteins in all living things share a library of twenty different amino acids. A protein sequence can contain anything from a few dozen, up to thousands or even tens of thousands of amino acids. These sequences are manufactured in cells, and they fold up into intricate three-dimensional structures with very specific conformations of chemical and electromagnetic properties. These structures can behave in astonishingly intricate ways – forming highly selective pores in cell membranes that only let certain molecules pass, functioning as motors that propel through fluids, or binding to multiple molecules and bringing them close together in just the right formation to allow them to react. The precise structure of a protein determines its function, and it must be optimised for the type of environment in which the protein operates.


Complex 3D shapes emerge from a string of amino acids (Google DeepMind, 2020).
Complex 3D shapes emerge from a string of amino acids (Google DeepMind, 2020).

 

Protein folding is the process by which a string of amino acids adopts a three-dimensional structure. Because the structure of the protein is so precise, folding is critical, and it can and does go wrong in living cells all the time, sometimes with grave consequences. Single mutations in a gene sequence can change the sequence and thus the structure of a protein, in ways that are difficult to predict. A misfolded protein is at best unable to perform its function, and at worst can be toxic to the cell or the organism. Some proteins can misfold in a way that causes other copies of the same protein to also follow the same erroneous path. These forms, known as prions, aggregate into plaques that usually form in the nervous system and brain, and are associated with many neurodegenerative diseases with no known cure, including Alzheimer’s and Parkinson’s disease.

 

Understanding the process and the outcomes of protein folding is a central problem in biology, but determining the structure of a protein experimentally is very difficult – much harder than determining the sequence. Structural determination relies on X-ray crystallography, where X-ray light is shone through protein crystals. By examining the diffraction patterns, it’s possible to determine the symmetries in the crystal lattice, and work back from there to calculate the structure of the individual proteins in the crystal. These calculations are complicated, but the hardest part is purifying and crystallising the protein in the first place. Forming the crystal lattices takes a lot of trial and error, varying solvents and precipitating agents, pH, temperature, the presence of so-called chaperone molecules, and sometimes even portions of the protein sequence deemed unimportant to its function and likely to impede crystallisation (McPherson et al., 2014). In comparison, it’s relatively easy to determine the sequence of a protein, through mass spectrometry or Edman degradation. As an analogy, it’s easy to knock down a house and figure out what materials it was made of, but it’s very difficult to look at a pile of bricks, mortar, insulation, wiring and plumbing, windows and doorframes, and work out what the house is going to look like. The protein structure prediction problem is the key to the inner workings of biological systems.


CASP and DeepMind

The CASP competition was founded in 1994. CASP – Critical Assessment of Structure Prediction – sought to formalise the process of structure prediction, to rigorously compare the performance of different methods and algorithms and accelerate research into this key bioscience problem. The CASP team worked with research groups around the world to obtain newly discovered protein structures, before their publication. The corresponding sequences were given to competitors, and their predictions were assessed against the known structures using a specially developed metric called the Global Distance Test (GDT). GDT ranges from 0-100, and scores above 90 were considered to be about as accurate as experimentally determined structures, accounting for variation in the crystallography process. CASP competitions run every two years, and until the late 2010s, the best results from predictions in competition scored GDTs of around 40.

 

Enter DeepMind, under the leadership of former chess prodigy Demis Hassabis. Founded in 2010 and acquired by Google in 2014, DeepMind’s first high-profile successes were in building AI models to play board games. Chess bots had dominated against human players since the late 90s, but DeepMind’s AlphaGo became the first computer program to defeat the world champion human player in the even more complex game of Go. DeepMind then developed a more general bot, AlphaZero, which learned to play multiple games using pure reinforcement learning and defeated the reigning champion AI engines in chess, shogi, and Go. DeepMind applied its expertise to the protein folding problem, and in 2018, it entered AlphaFold into CASP-13, winning easily and achieving new benchmark GDT scores of around 60. In CASP-14 in 2020, AlphaFold2 built on this success, with GDT scores close to 90, prompting the organisers of CASP to declare that:


“The problem has been largely solved for single proteins."


How AlphaFold Works

The first edition of AlphaFold took a fairly generic deep learning architecture and applied it to the protein folding problem. AlphaFold2 was fully redesigned from end to end, purpose-built to solve the protein folding problem. There were two keys to the breakthrough. One was in hardware, with tensor processing units (TPUs) providing enormous parallel computing power. The other was a novel neural network architecture called Evoformer, which leverages the evolutionary information encoded in related proteins taken from different species by building a multiple sequence alignment (MSA) based on the input protein sequence.


High-level overview of how AlphaFold2 predicts a protein’s structure from its amino acid sequence (EMBL-EBI, 2025).
High-level overview of how AlphaFold2 predicts a protein’s structure from its amino acid sequence (EMBL-EBI, 2025).

 

An MSA is a matrix-like structure containing a set of similar sequences, one per row. The similar sequences are identified by searching a database, and gaps are inserted in the rows where necessary to match up the properties of the amino acids in each column as closely as possible. Related proteins from different organisms, sharing similar sequences, also likely share an evolutionary history, and similar structures and functions too. These related proteins can be viewed as viable examples of variation that preserve the key elements of a structure, allowing it to fold correctly and perform its function. An MSA shows sets of positions in the sequence that tend to vary together – that is, a change in one position is likely matched with a corresponding change in another position. This covariance implies that the amino acids at those positions in the sequence are likely to be closely located in the folded protein. AlphaFold2’s Evoformer network uses an MSA to build a representation of the relative positions of every pair of amino acids in the input sequence. It then iteratively updates both the MSA and the pair representation, with information flowing in both directions, eventually arriving at a final prediction of the protein structure. Predictions are accompanied by confidence scores at the local and global levels.

 

Another important technical development was self-distillation. The Protein Data Bank (PDB) contains about 150,000 known structures, which is a lot, but not enough to saturate the learning potential of an AI model. AlphaFold was trained on the known structures and then used to make predictions for unknown structures. Predictions with high confidence scores were fed back into the training process and used to further enhance the accuracy of the model.

 

AlphaFold3 was a further extension to AlphaFold2, gaining the ability to predict interactions between multiple molecules. This allows it to predict the structures of multi-chain proteins, as well as ligand and RNA binding.



The Impact of AlphaFold

AlphaFold is a computationally intensive model and requires significant resources to run. To help disseminate the benefits of AlphaFold’s predictions, Google teamed up with the EBI to create AlphaFold DB. Scientists can now access predicted structures for over 200 million protein sequences, the majority of the sequences in UniProt. The predictions are free to use, and can be browsed, searched, and visualised online in the interactive portal or downloaded for further use.

 

AlphaFold DB is managed by the EBI, which is also home to the European Protein Data Bank (PDBe) containing experimentally-determined structures, and the accompanying knowledge base (PDB-KB), which contains rich annotations on how proteins function and interact with other molecules. This colocation allows for maximal knowledge exchange between the different, complementary databases. PDB is over 50 years old, and the data is very high-quality, very well curated and labelled, which was essential for the success of AlphaFold. AlphaFold DB uses the same file formats as PDB, so the structural data is more or less interchangeable – though experimentally determined structures are accompanied by additional data like electron densities or diffraction patterns.

 

By 2023, AlphaFoldDB had over 2 million users in 190 countries. Open-sourcing the predictions has saved a huge amount of computational and research effort. All EBI data is released under CC-BY 4.0, which allows freedom of use with attribution, to maximise access. AlphaFold3 caused some controversy because the code and the model itself were initially not open-sourced, though they were subsequently shared with the scientific community after protestations. However, the multi-molecule interaction predictions from AlphaFold3 are released under a more restrictive licence than its predecessor, one which prohibits commercial activity. In 2021, DeepMind’s parent company, Alphabet, established a new spin-off called Isomorphic Labs with the aim of building on AlphaFold3 and applying it to the expensive enterprise of drug discovery.

 

Drug discovery and design is one of the areas of research where protein structure prediction shows the most potential, with the promise of speeding up development pipelines, reducing failure rates in clinical trials, and even personalising drugs to the needs of an individual. Thinking more broadly, if the goal of functional protein design is fully realised, the possibilities are almost endless – imagine enzymes that can break down plastics, produce biofuels, or make crops more resilient. This requires iterative pipelines where proteins are sequenced and their structures and functions are tested. Performing these steps in silico can make the impossible feasible.

 

In the meantime, the AlphaFold DB team are working hard to make the resource as valuable as possible. There is a lot of potential for experimental data to improve the model, so they plan to add a feedback mechanism for continuous updates. They plan to add as many new features as possible, like linking to related resources to add context, and providing ways to compare structures, while making sure that every additional piece of information is high-quality and readily understandable. Most importantly, they want to involve as many people as possible. The real impacts of this work will continue to be felt for decades to come, and it’s the wider community of researchers that will realise it.



If you enjoyed reading, don’t forget to subscribe to our newsletter for more, share it with a friend or family member, and let us know your thoughts—whether it’s feedback, future topics, or guest ideas, we’d love to hear from you!



Comments

Couldn’t Load Comments
It looks like there was a technical problem. Try reconnecting or refreshing the page.
bottom of page