In our recent Future of Drug Discovery podcast episodes with Rahmad Akbar and Tommy Tang (two prominent scientists in computational drug design), the problems budding from data silos were mentioned.
A data silo is a metaphor describing data that is not shared publicly. Locked away data only benefits those who imprison it. Consequently, researchers often unknowingly duplicate data that might already exist. This causes great inefficiencies in our collective scientific research efforts.
Data is particularly important for computational drug designers, who rely on it to train effective artificial intelligence models. Data silos are just the tip of the iceberg; there are plenty of interesting discussion points surrounding data in drug discovery. Therefore, we will use this newsletter to explore the importance of data in artificial intelligence (AI)-driven drug discovery and discuss the current challenges.
Before we begin, we'll clarify what artificial intelligence (AI) and machine learning (ML) mean in a few sentences.
Artificial intelligence is most often used when a computer performs human-like tasks. 'Human-like' tasks can take on numerous forms, so the term is used very broadly.
Machine learning is one method for creating artificial intelligence. Computers are taught how to 'learn' and make decisions based on previous experiences, taking inspiration from how humans learn and make decisions.
Both are very complicated topics with multiple sublayers; these terms help explain them briefly for swifter communication. This is similar to how physicists use 'black holes' or 'quantum mechanics' to summarise very intricate phenomena.
Why is data important?
The performance of a machine learning model is dependent on the data used to train it. You can think of machine learning models like students and training data like textbooks. Studying textbooks helps students perform better on the exam, which can represent the output of an AI model. Having more textbooks to study from supports the students' answers. Likewise, having more training data helps machine learning models make better decisions. Therefore, machine learning engineers put immense effort into data curation and selection, taking the role of tutors.
Machine learning models can be developed with various types of structures, which guide their performance. For example, how the neurons in our brains encode different pathways to memorise data, directing how 'smart' we are, machine learning models can be structured with different algorithms. This is a complex subject, unfortunately beyond the scope of this newsletter. Moving forward, we can assume our students can all get 100% on the exam (perhaps with the aid of coffee).
Data is difficult to access.
Creating training data costs a lot of money, especially in drug discovery. Generating biological data relies on expensive and time-consuming laboratory work. Because of biology's interminable intricacies, machine learning models need lots of highly diverse data to understand all the parameters to design drugs effectively.
Because data costs money to generate, a culture has consequently emerged where data is rarely shared. Researchers who want to do computational drug design have to generate data themselves. This is wasteful, dramatically slowing down our ability to bring therapies to patients. As we mentioned in the beginning, the data likely already exists elsewhere.
Public data can be unreliable.
When data is accessible, it can be riddled with biases and errors. In our textbook analogy, we can imagine the chaos if parts of a textbook were incorrect. Naturally, the students who studied the textbook would provide the wrong answers in the exam.
Many forces ingenuously introduce bias into scientific data. In academia, positive and 'groundbreaking' results receive the most grants, encouraging researchers to publish successes and conceal failures to sustain their research funding, often repeating experiments until they prove their hypothesis by brute-force repetition. Replication studies are the reactionary force essential to removing bias and validating pre-existing research. Unfortunately, their lack of novelty results in their underfunding, discouraging researchers from performing them. Because of the uncertainties with public datasets, machine learning engineers must be selective with the data they use to train algorithms, but this often feels like playing Minesweeper without the hints. Public datasets can be appealing, particularly because of their immense sizes, but this poses a risk to the quality of the ML-model outcome. One must also consider that quality assurance steals time from engineers' frantically moving hands.
Some drug targets have no data.
For some targets, data is difficult to generate. Let's use G-protein coupled receptors (GPCRs) as a case study. Determining the crystal structure of GPCRs is challenging because they are membrane-bound receptors. Tearing them from the hydrophobic membrane environment causes them to become disfigured and deformed. However, most drug designers need native structures to design drugs that bind in the body. Fortunately, new technologies, such as Cryo-Electron Microscopy and others, have helped the number of GPCR structures determined to grow yearly, a trend we hope to see continue.
Currently, only 3-FDA-approved antibodies targeting GPCRs are available on the market, despite GPCR-targeting drugs making up over a third of all FDA-approved drugs. Human GPCR receptors are numerous and diverse, meaning these antibodies capture a minute fraction of the feasible GPCR-antibody interactions we can curate. This is barely representative, making computational antibody discovery for GPCRs extremely difficult.
Prospects are slightly brighter for small molecules, but they cannot staff the full arsenal against GPCRs. Their small size makes them more likely to exhibit off-target binding than antibodies, leading to undesirable side effects and toxicity. Biologics are a mostly underexplored tool for many challenging targets; there are large gorges to bridge with biologics, but the result could transform millions of patient lives, making them worth the investment. In addition, there are plenty of small molecule databases, and the number of known GPCR structures continues to grow; prospects are bright for both formats.
How do we solve the 'data problem'?
For Antiverse, we are resolving these issues by generating as much high-quality data as possible. Data generated from our ongoing internal and partnership programmes is relayed to our machine learning models. This is like giving students their exam results. By understanding where they went wrong, they can correct their mistakes and try again, gradually improving their grade each time.
Because public datasets have unpredictable quality, our machine learning engineers scrutinize them before implementation. Guaranteeing data quality is challenging, so we have developed a suite of methods and models to assess it before incorporation.
However, many of the most pressing data issues we have addressed require broader cultural changes: addressing publish-or-perish culture and the dissolution of data silos. These issues will take time to tackle, and we can only change them with collective force. Pushing resources to replication studies is a good start. Data silos are a more complex matter, rooted deeply in capitalism and intellectual property rights. Decentralised science would support an ideal future for drug discovery.
When is data not an issue?
As we have discussed, data is important, but we don't always need it. Traditional lab-based methods have grown immensely over the last two decades, providing many new and life-changing therapies. The most famous example is Humira, an antibody prescribed for multiple diseases. Being one of the first biologics on the market, it has become the most profitable drug ever.
However, many GPCR targets have been challenging for traditional methods. Currently, around 220 GPCRs with known disease links remain undrugged. There is huge potential to create new therapies, but we need new methods to target them. Fortunately, artificial intelligence has let a slither of light creep in through the door, providing a new opportunity to begin drugging targets linked to diseases without treatments.
Although data is expensive, the scarcity of promising alternatives for challenging targets makes artificial intelligence worth the investment, especially due to the potential impact on patients. Once the models have been established, the return on investment will be fruitful, a vision we hope to achieve for GPCRs and other challenging targets. In cases where traditional discovery approaches would work, taking a computational approach is like bringing a tank for a snowball fight. The resources could be used to start new discovery programmes, especially those with smaller markets, such as rare diseases.
Despite the issues with data, AI in drug discovery is creating many new and promising therapies. Roughly 200 companies using AI have pushed 15 drugs into clinical trials, with the size of the AI-made drug pipeline growing at an annual rate of 40%, according to the Boston Consulting Group. These therapies are being created in fractions of the time and costs observed in the past, a stellar achievement in the face of data uncertainties.
Achieve your biologics discovery goals for challenging drug targets. Contact us today.
Future of Drug Discovery Podcast
Check out our chat with Rahmad Akbar, Senior Data Scientist in Antibody Design at Novo Nordisk. Rahmad is among the most passionate and inspiring people we have spoken with on the show. His enthusiasm for collaboration and education is a defining feature of this episode, making this episode informative and inspiring for anyone working in drug discovery.
Available on YouTube and Spotify.
You could learn:
The impact antibody design has on patients.
How AI is helping to push antibody design forward, bringing new therapies to patients.
What will antibody design look like in the future?
This Month In Antibody Discovery
𝗦𝗲𝗽 𝟮𝟯𝗿𝗱: Context Therapeutics Buys BioAtla’s T-Cell Engaging Antibody in Deal Worth Up to $133.5M
https://hubs.la/Q02VZ4zh0
𝗦𝗲𝗽 𝟮𝟳𝘁𝗵: Novartis Partners with Generate:Biomedicines in $1B+ AI-Driven Drug Discovery Deal
https://hubs.la/Q02VZ34J0
𝗢𝗰𝘁 𝟭𝘀𝘁: Lonza Acquires Vacaville Biologics Site from Roche for $1.2 Billion
https://hubs.la/Q02VZ4HQ0
𝗢𝗰𝘁 𝟮𝗻𝗱: Triveni Bio Secures $115M Series B Funding to Advance Preclincal Immunology Antibody Pipeline
https://hubs.la/Q02VY-FK0
𝗢𝗰𝘁 𝟴𝘁𝗵: Merck Signs $1.9B Deal with Mestag to Develop Antibody Therapies Targeting Fibroblasts for Inflammation
https://hubs.la/Q02VZ4yp0
𝗢𝗰𝘁 𝟵𝘁𝗵: ModeX Secures $35M from BARDA to Advance Multispecific COVID Antibodies
https://hubs.la/Q02VZ4GT0
𝗢𝗰𝘁 𝟭𝟬𝘁𝗵: Ono Pharmaceutical Enters $700M Deal with LigaChem for Preclinical ADC Development
https://hubs.la/Q02VZ30_0
𝗢𝗰𝘁 𝟭𝟭𝘁𝗵: XtalPi’s Ailux Biologics Partners with Janssen Biotech for AI-Powered Biologics Discovery
https://hubs.la/Q02VZ2Yq0
𝗢𝗰𝘁 𝟮𝟮𝗻𝗱: Roche Walks Away from $120M Bet on UCB’s Anti-Tau Alzheimer’s Antibody Bepranemab
https://hubs.la/Q02VZ0lg0
𝗢𝗰𝘁 𝟮𝟮𝗻𝗱: Samsung Biologics Secures Record $1.2B CDMO Contract with Asia-Based Pharma
https://hubs.la/Q02VZ5lV0
𝗢𝗰𝘁 𝟮𝟰𝘁𝗵: AI-Driven Protein Design Startup Fable Therapeutics Raises $53.5M for Next-Gen Obesity Drugs
https://hubs.la/Q02VZ2TH0
About Antiverse
Antiverse is an artificial intelligence-driven techbio company that specialises in antibody design against challenging targets, including G-protein coupled receptors (GPCRs) and ion channels. Headquartered in Cardiff, UK and with offices in Boston, MA, Antiverse combines state-of-the-art machine learning techniques and advanced cell line engineering to develop de novo antibody therapeutics. With a main focus on establishing long-term partnerships, Antiverse has collaborated with three top global pharmaceutical companies. In addition, they are developing a strong internal pipeline of antibodies against several challenging drug targets across various indications. For more information, please visit:
https://www.antiverse.io