Training algorithms to identify and extract Life Sciences-specific data
The English dictionary is full of words and definitions that can be applied to various contexts. The second edition of Oxford dictionary in 1989 recorded 228,132 words, including popularly used words, obsolete words, as well as derivatives. Words are used all over the world to discuss, convey, and explain millions of things. Thus, we can arrive at the fact that word meanings are relative. The same word can be used while talking about a number of topics. A word can be a noun as well as a verb. Infact, some nouns can be used in different context and mean something completely different. For example, the word ‘card’ is used for a playing card in a game of poker, while in the domain of computer hardware it may refer to a video card.
Humans understand these differences. But how will a computer differentiate between two documents using the same word in completely separate contexts? How will computers differentiate between the words and phrases specific to a query versus the relatively unimportant supporting text? This is the huge challenge when it comes to data extraction. Moreover, words can have separate meanings within the same domain. A classic example is abbreviations in medical terminology, where one medical acronym might be the same as another, but interpreted differently by context, such as “Ca” could mean “cancer” or “calcium”.
Data extraction was made necessary by the exploding span of big data, and was subsequently made convenient with technologies such as Natural Language Processing (NLP), Computer Vision and Image Recognition. It is estimated that in 2020, Life Sciences data will double approximately every 73 days (Densen, 2011). At this speed, the volume and density of big data, both structured and unstructured, will create a problem when it comes to data extraction and data analysis.
AI enables computers to extract information from documents and images. The NLP technology empowers computers to understand the language of humans, through machine learning algorithms used for NLP, computers can identify the type of entity to be extracted, summarize text, as well as identify sentiments expressed through a string of words. On the other hand, computer vision technology facilitates the computer with object recognition capabilities, using which symbols, characters, sections etc. can be differentiated. But, this is all about how a computer extracts information. However, before this, we need to tackle another, even bigger question.
How does a computer know ‘what’ to extract?
The internet is flooding with data. This data is spread in various forms and formats. Data is not only unstructured within a domain, but also unstructured as a whole. Information on the web from different domains can intersect, making it much more difficult and increasing the span of effort put in in search and analytics. Secondly, the language of a domain, it’s ontology, can be totally different from what we normally understand. For example, lawyers understand the terms used in law. The phrase ‘low temperature’ could mean totally different things to an astronaut and to a doctor. Similarly, only a medical researcher can understand the language of Life Sciences.
In order to find relevant information to extract, specific words are used. For example, information on cancer can be extracted using the word ‘cancer.’ Well, not really true! What will you find when you Google search the word ‘cancer’? Next, to information on animals and horoscopes, there are gazillions of articles containing general information about the disease. However, to scientists, researchers, and medical practitioners, this information is absolutely unnecessary and useless. What is most important for a patient to know about cancer, such as its physical symptoms, may be the most basic information to a scientist working to develop a cure for it. The information a researcher requires is often complex to understand. So, the information extracted using the word ‘cancer’ may not be helpful at all to generate insights or to facilitate research.
Moreover, domain experts usually use domain language. That means scientists and researchers use scientific names, codes, or refer to an indication with terms different than what a layman understands. How does the computer know if a particular set of images and documents have such information that can be used by a domain expert to generate useful insights? Computer algorithms trained to extract information may not be able to verify whether that information is relevant or not for a specific person or domain expert. Here’s where ontology plays a big role.
An ontology narrows down the search to specific terms which can help identify the most important and domain-specific articles. It also helps extract deep information which researchers and medical practitioners all over the world are looking for. Therefore, teaching computers to understand the specific ontology of life sciences and using this language of medical terms to extract information will lead to much useful, more accurate information specific to the use of research and development in life sciences.
What does all this mean? Using AI technologies such as computer vision, natural language processing, etc., without using the specific language of the domain may not be accurate for extracting meaningful content.
Another challenge faced during data extraction is understanding the concepts in relation to one another. For instance, when you hear Metformin, what comes to the mind is ‘Diabetes’. So, how can we induce such a capability in a computer? The word ‘intervention’ is described in reference to three different meanings in the Oxford dictionary, however, it is not once related to the word ‘drug’. In addition, most Life Sciences related terminology is not included in a dictionary. A person who is not from the field of Life Sciences may not know what HER-1 refers to or that it can be talked about in relation to, for instance, breast cancer. Neither will a computer know this information, unless, it is trained to understand the terms, concepts, and moreover, their relations or connections with one another.
For this purpose, a life sciences ontology is implemented. An ontology is a set of narratives encompassing common concepts, their principles, and the relationships between those concepts. It helps the computer focus on most relevant Life Sciences content, identify topics using domain-focused words, synonyms, and related terminology to the subject, in order to extract only the data most relevant to the domain. Ontology further helps in segregating the extracted, structured or unstructured, data. The end result: faster research in less time. A Life Sciences ontology helps establish relationships between biological entities such as genes, proteins, diseases, and drugs, as well as helps in discovering new connections.
On this note, here’s what Gunjan Bhardwaj, Founder and CEO of the Innoplexus group, recently mentioned during a Ted Talk:
“An AI that specializes to the deepest level in the specific language of life science can be more effective than a generic system that can address everything from science to architecture to the design of sneakers but is not optimized for anything.
Hence, an AI machine that has been trained on life science language can crawl, extract and aggregate that exploding data universe and understand the context. It will also be able to disambiguate cases where EGFR is a target, a biomarker, a gene or a protein. It will also be able to identify and analyze semantic associations between concepts.
Such a trained AI provides the tool to tame the huge flood of data. It enables users to filter the data only for relevant insights by specifying a context of interest. Researcher, physicians or patients no longer would have to read through hundreds of papers to finally find the right one.“