AI Is All Hype If We Don’t Have Access to the Right Data
- AI could potentially speed drug discovery and save time in rejecting treatments that are unlikely to yield worthwhile results
- AI has evolved to parse this heterogeneous data, with programs that combine machine learning, natural language processing, and advanced text analytics.
- AI has evolved from first-wave optimization programs or “knowledge engineering,” to second-wave statistical learning programs or “machine learning,” finally arriving at third-wave hypothesis generation programs or “contextual normalization.”
As the pioneering neuroscientist Vilayanur S. Ramachandran says, “The fact that hype exists doesn’t prove that something is not important.”
It is true, that Artificial Intelligence is a hot topic. Currently, there are a lot of buzzwords and marketing dollars going into “data science” and “machine learning.” However, throughout the foreshortened modern history of technological innovation, disruptive technologies tend to undergo substantial revision evolving from what could happen to what does happen.
In this article, we will examine an underappreciated limitation of AI’s real-world utility: access to real-world datasets.
Artificial Intelligence in the Life Sciences
By definition, a disruptive technology is an innovation that generates a novel industry. In this context, the new AI programs constitute a significant future disruption. AI has evolved from first-wave optimization programs or “knowledge engineering,” to second-wave statistical learning programs or “machine learning,” finally arriving at third-wave hypothesis generation programs or “contextual normalization.” Now third-wave AI programs have the potential to look at big data, find the statistical patterns in it, then generate novel algorithms that explain why these patterns exist.
The potential of this technology to discover novel treatments has prompted pharmaceutical giants such as GlaxoSmithKline, Merck & Co, Johnson & Johnson, and Sanofi to invest in it as a potential competitive edge. Previously loose associations have proven capable of generating novel treatments such as the well-publicized and manually generated link between Raynaud’s disease and fish oil. Automating this process with the extended capacity of AI could potentially speed drug discovery and save time in rejecting treatments that are unlikely to yield worthwhile results.
Computers have certainly demonstrated superiority and even novel insights into analyzing patterns from complex datasets. In 2011, IBM Watson beat a human on Jeopardy requiring mastery of general knowledge and natural language processing. Pattern recognition programs are routinely used in ECG interpretation, although as an adjunct or “clinical decision support.” However, without context, and specifically vertical context, this analysis might have little transferability and even less real-world impact.
Indeed, AI also has several well-publicized failures. MD Anderson’s recent problems with IBM Watson highlight a vital issue in the field, namely that of dataset integrity. According to a University of Texas audit, when MD Anderson changed its electronic medical record (EMR) provider, the Watson software couldn’t access the data, and its conclusions became out-of-date.
Perhaps looking at this data issue in more depth can guide us in defending AI for the future. It’s possible the baby is fine, and it’s the bathwater that’s the problem.
How Data-Access Limits AI
AI and machine learning on their own are not enough. While there has been a lot of progress on advanced algorithms, if there is not enough access to all the data that’s out there, the algorithms can’t genuinely do their job.
Biological data is deep, dense, and diverse. In the past, most life-sciences datasets were insufficient to represent biological data accurately. Biological research relied on manually curated datasets collected and cleaned specifically to test a preconceived hypothesis. Curators allayed the expense of generating these datasets with proprietary interests in controlling access or an incentive to market the results. Dissemination of results through academic journals meant significant delays making conclusions obsolete and limiting access through profit-based portals that were industry and discipline-specific.
The results of the legacy system for creating and analyzing biological datasets have included system-wide publication bias and inaccuracies in the medical science.
Even the recent open-science movement — which attempts to democratize access to unpublished clinical research datasets and raw data from clinical trials — relies on narrow, manually curated datasets, often created by companies with proprietary interests.
While first-wave AI might be able to parse biased datasets, second-wave AI is heavily dependent on properly coded data-sets for training. However, the real limitation is for third-wave AI software, which takes observations from seemingly unrelated contexts and normalizes them.
A classic example is abbreviations in medical terminology, where one medical acronym might be the same as another, but interpreted differently by context, does “Ca” mean “cancer” or “calcium”?
Third-wave AI needs complex contextual information in order to optimize its function and manually curated datasets inherently reduce its utility.
A Change In Data
With the 2009 HITECH Act, medicine began the legislated introduction of EMR systems. The result has been pooled datasets of real-time, comprehensive biological information. This is in addition to datasets from elsewhere in the innovation ecosystem, data from biological patents, clinical trials, congresses, theses and more.
Previously this unstructured data was inaccessible to computing systems without human collation, checkboxes, drop-down menus, or diagnosis codes. Now, AI has evolved to parse this heterogeneous data, with programs that combine machine learning, natural language processing, and advanced text analytics.
Previously we had outdated, incomplete, and inaccessible structured data. Now for the first time, we can structure previously unstructured data, enabling real-time analysis from a wider pool of loosely-associated contexts.
With third-generation AI, we can get clean data, all in one place, that mirrors the complexity of true biological systems. Parsing this data give us quick, crisp and summarized snapshots of the current biomedical landscape in any given context.
About the author:
Gunjan Bhardwaj is the founder and CEO of Innoplexus, a leader in AI and analytics as a service for life science industries. With a background at Boston Consulting Group and Ernst & Young, he bridges the worlds of AI, consulting, and life science to drive innovation.