Solving the Big Data problem in pharma innovation
I originally published this article at PharmaPhorum.
The effectiveness of Artificial IntelligenceI applications can be undermined by the volumes of unstructured data prevalent in the pharma industry. What can be done to overcome this issue?
We live in an exciting time for the pharmaceutical industry. Cutting-edge technologies like artificial intelligence and Blockchain are making headlines or revolutionizing everything from drug discovery to clinical trials. Many of these innovations are built upon the same foundation: Big Data. But a longstanding challenge within Big Data must be overcome in order for technologies like AI to achieve their full potential. That challenge is unstructured data.
Unstructured data and pharmaceutical Artificial Intelligence
The need to overcome this challenge can be illustrated by examining the consequences of unstructured data for the effectiveness of Artificial Intelligence applications within the pharmaceutical and life science industries.
As I’ve written about in the past, the history of Artificial Intelligence can be seen through the lens of three distinct waves. The first wave brought ‘knowledge engineering’ software that enabled efficient solutions to practical challenges. The second wave brought machine learning programs that enabled automated pattern recognition and advanced statistical analysis. We’ve now entered the third wave of AI, which has the power to generate novel hypotheses by analyzing massive sets of data.
Third-wave AI has the potential to significantly accelerate the research and development process for new drugs, as companies like Merck & Co and Sanofi have begun to discover. Applications of third-wave Artificial Intelligence programs have powered medical discoveries such as the connection between fish oil and Raynaud’s disease.
But third-wave AI applications have also suffered a series of failures in healthcare and pharmaceutical contexts. MD Anderson’s problems with IBM Watson serve as a notable example. In that instance, the problems all started when MD Anderson changed its electronic medical record (EMR) provider, preventing Watson from accessing the data that it needed. This example illustrates the challenge posed by unstructured data and the corresponding need for greater data integrity within life science industries.
Data integrity in life sciences
Many of today’s Artificial Intelligence programs depend on good, clean data in order to operate effectively. If access to such data is compromised, the Artificial Intelligence program’s ability to conduct analysis and generate hypotheses is undermined.
Data sets within the pharmaceutical and life science industries pose a particular challenge for Artificial Intelligence programs because of the unusual density, depth, and diversity of biological data. Because the complexity of biological data renders it incomprehensible to many Artificial Intelligence programs, the majority of pharmaceutical research today is carried out manually. Human researchers curate data, generate hypotheses, and perform experiments in much the same way that they have for decades. Lacking automation, the drug discovery, development, and testing process is inefficient, expensive, and often inaccurate.
The inefficiency of this process causes prolonged delays between the completion of an experiment and the publication of its results in scientific journals or databases. This delay has resulted in a significant problem with publication bias and inaccuracy in the industry. Even the open-science movement, which is attempting to increase access to not-yet-published clinical research results, depends on manually-curated datasets that are usually created by companies with proprietary interests.
Even heavily-curated data sets are often too inconsistent to be meaningfully analyzed by Artificial Intelligence. Take, for example, the challenge posed by abbreviations and acronyms within the pharmaceutical industry. The same abbreviation may carry different meanings depending on its context. ‘Ca’, for instance, could mean ‘cancer’ in one context and ‘calcium’ in another. Most Artificial Intelligence depends on accurate and nuanced contextual information, and manually-curated data sets often fall short of this mark.
Overcoming the unstructured data challenge
Fortunately, some of the world’s leading firms have begun to explore two possible ways to overcome these challenges. One approach is to simply improve the state of available data sets. 2009’s HITECH Act modeled this approach by standardizing EMR systems to create richer, more comprehensive, and more up-to-date, biological data sets. As a result, diverse data from biological patents, clinical trials, academic theses, and other sources can increasingly be analyzed by advanced Artificial Intelligence programs.
The second way to overcome the unstructured data challenge is simply to build better Artificial Intelligence. Recent innovations have brought ‘context normalization’ Artificial Intelligence technology that can process and analyze unstructured, heterogeneous data points using a combination of natural language processing, machine learning, and cutting-edge text analytics. Finally, the most advanced Artificial Intelligence programs are able to utilize disparate, incongruous data to generate novel hypotheses without the need for costly human curation.
Innovations like these are allowing researchers to analyze data, generate hypotheses, and conduct conclusive clinical trials at unprecedented levels of speed and accuracy. This is good news for pharmaceutical companies, medical professionals, and consumers alike.