How AI Understands Your DNA — with Prof Ewan Birney (AIBIO-UK Mini-Series)

Charlie Harrison
Aug 28
5 min read

Updated: Sep 17

How AI Understands Your DNA, featuring an interview with Professor Ewan Birney. Prof Birney is Interim Executive Director of the European Molecular Biology Laboratory (EMBL) and Non-Executive Director of Genomics England, and played a pivotal role in the Human Genome Project. Read on for some selected highlights from the new episode, along with some extra background. Catch up on the first episode by listening to or reading about it.

by Charlie Harrison

Listen Now

Episode highlights

DNA is like a massive book, written in a foreign language. It also has lots of obscure passages that we won’t really understand. To use Leo Tolstoy's book, War and Peace, as an analogy, do we really need all that stuff about farming in 19th-century Russia to explore the themes of love, war, and the human condition?
Before the human genome project was completed, Prof Birney ran a sweepstake called GeneSweep, taking guesses on how many protein-coding genes would be found. Most guesses ranged from 40k to 100k, which seemed reasonable given that a nematode worm (the first multicellular organism whose genome was sequenced in full, completed in 1998) has around 20k. It turns out that the actual number in humans is also approximately 20k, far fewer than most thought. However, the actual number of genes is difficult to define precisely because, like everything in biology, transcription is complicated! For example, cells perform alternative splicing of transcripts and post-translational modifications, genomes contain overlapping reading frames and bidirectional genes, and a large number of transcribed regions are functional but don’t code for proteins.
Protein-coding regions make up only around 1.5% of the human genome. The human genome has about 3 billion base pairs. Compare this to the nematode worm, which has about 100 million base pairs for the same number of genes! Much of the rest consists of transposons and retrotransposons – regions of the genome that copy and insert themselves into different positions within a genome, either directly or via an RNA intermediate. These act like genetic parasites, bloating the genome and imposing a metabolic burden on the cell, which has mechanisms to try to repress them; but they also play an important role in evolution.
Protein-coding regions are of less primary importance than expected. The non-coding regions of the genome are far from inert; in fact, they’re very active. Many regions that are completely dissociated from protein-coding genes are involved in protein binding or RNA synthesis. Structural changes to genome packing trigger changes in gene regulation that can lead to dramatic outcomes, including cancer.
We used to think that the genome behaved like a mould, or a blueprint – giving complete and explicit instructions to make a perfect replica. Now we know that it’s more like a script – there’s lots of room for interpretation, and context is important.
The Human Genome Project was a landmark achievement in the open data movement, thanks to the principles enshrined in the Bermuda Accord, which stated that human genome data would be released into the public domain within 24 hours of its production and would not be held under licence by private companies. The contemporary landscape is more varied. Large private companies play important collaborative roles in major bioinformatics projects, but they also have their own agendas and can’t be forced to share everything. In AI, “open” can mean different things – pseudocode, an open-source repository, or a fully trained model with weights.
The cost of sequencing a human genome has reduced from around $3 billion for the first ever completed in 2003, to around $200 in 2022 – a five-million-fold reduction. Large modern facilities can sequence 20 thousand human genomes per year, and the UK Biobank holds more than 500k complete genomes along with health data. This is made possible by improvements in technology, but also by the existence of the reference genome, which makes it simpler to assemble sequence data into a coherent structure.
A single human genome sequence stored in plain text or FASTA format would take up around 3 GB. It can be stored in a more compact TwoBit format, which represents nucleotide bases in binary (T as 00, C as 01, A as 10, and G as 11), taking up just 800 MB. However, the full human reference genome contains much more information than just the basic sequence; there is structural information about how the sequence was assembled, which chromosome each element belongs to, annotations about genes and RNA coding sequences and the surrounding regions. There are also versions of the genome with highly repetitive sequences hidden, to avoid biasing search algorithms (this is called “masking“), and versions in multiple file formats. Many of these files can be compressed, which is often highly effective because the sequences contain many repeating regions. Taken altogether, the reference genome can be downloaded to about 18 GB on disk.
The breakthrough in genomic sequencing is multifaceted. Improvements in algorithms, a drop in sequencing cost leading to greater availability of data, and an increase in computing power all played crucial roles
Modern AI research is very experimental. There is a lot of trial and error; trying to find a model that works well, then later trying to explain why it’s so effective. This is a paradigm shift compared with earlier techniques like Bayesian modelling. Deep learning models can be very counterintuitive, but they can be interrogated.
Prof Birney cites three concerns about AI:
- Hype. To many people, “AI” now means “chatbots”, and these are both impressive and disappointing. There is a worry that disillusionment could overshadow some of the amazing achievements, like AlphaFold, advances in image analysis, and genome decoding.
- Reliability. In medicine especially, reliability is crucial. How do we avoid weird behaviours, like the hallucinations that ChatGPT and other LLMs sometimes display? One way is to put a human in the loop, but this has its own problems, like preventing boredom – especially when the AI is right most of the time. In airport security systems, computer-generated images of banned items are occasionally projected onto luggage scans to make sure staff remain alert, so maybe a similar method could work in medicine.
- Bioweapons. There is the potential for AI to invent new toxins or viruses that pose a real danger to humanity. This is not completely new, as lots of dangerous chemicals and organisms exist in nature, but AI changes the shape of the problem.

If you enjoyed reading, don’t forget to subscribe to our newsletter for more, share it with a friend or family member, and let us know your thoughts—whether it’s feedback, future topics, or guest ideas, we’d love to hear from you!