The importance of quality control in the vast data ocean of Life Sciences
Through the Internet, we have access to a vast ocean of life sciences data, and AI provides us with the tools to tame it. In data analytics, for example, it is important to collect the data most useful for generating relevant knowledge. AI enables this by specifying the context of interest to filter data by relevancy. Nonetheless, it is often difficult to determine the authenticity of the information available on the Internet. The World Wide Web is an unrestricted space, collecting data from all sources: patients, doctors, researchers, and amateurs of all kinds. Information can be false, modified, or manipulated. Therefore, validating the data we access and analyze is essential to ensuring only the most relevant data is used to make important business decisions.
In 2015, Forbes predicted that the Internet would grow to 44 Zettabytes (ZB) by 2020. The total volume of data on the Internet in May 2017 was 4.4 ZB, and at the rate which it increased to 33ZB in 2018, it would not be surprising that the Internet exceeded 50 ZB by next year. In case you are wondering, one Zettabyte is equal to a trillion GB). Scientific data alone continuously flows from thousands of new publications, patents, scientific conferences, dissertations, clinical trials, research papers, and patient forums. The volume of scientific data is expected to double every 73 days by next year. The problem with so much data is that it is unstructured and scattered. Although, relevant information is available, it is dense and diverse, ie. it is present in various types and formats such as sensor data, text, log files, click streams, video, audio, etc. Therefore, it is difficult to identify the relevant data in regards to the searched query. Moreover, with such a massive amount of data, it’s highly likely that there will be duplicative content. So how can we check the quality of everything we crawl?
Let’s first understand what kind of data we are talking about! As we have already established, data on the web is dense and diverse, which means it is present in structured as well as unstructured form. Also, the data comes in different formats, such as documents, images, and PDFs. In order to extract this data, we need to use a number of artificial intelligence technologies, such as Natural Language Processing, to understand the context of the information. Computer vision and image recognition technology assist by recognizing characters and extracting data from PDFs and images. While entity normalization reduces the error percentage of missing entities resulting from wrong spellings and synonyms.
The Innoplexus life sciences data ocean is vast, consisting of 95% of publicly available data, including more than 35 million publications, 2.3 million grants, 1.1 million patents, 833k congress presentations, 681k theses & dissertations, 500k clinical trials, 73k drug profiles, and 40k gene profiles. This data contains information about authors, researchers, hospitals, regulatory body decisions, HTA body decisions, treatment guidelines, biological databases of\genes, proteins, and pathways, patient advocacy groups, patient forums, social media posts, news, and blogs. We crawl, aggregate, analyze and visualize this data using AI technologies implemented through the use of our proprietary CAAV framework.
In order to ensure we have access to the most specific content and relevant information, quality control of the data we crawl and analyze is important. To validate the data we crawl before generating insights, results need to be checked and cleaned. Several methods can be used to implement validation. First, the quality control of the data source, or, the assessment of the credibility of a publication. Second, triangulation, or, confirming the same result from several sources. And third, ontology, where the machine is given a specific context in which to work.
Innoplexus has both an automated as well as manual validation process. Innoplexus crawls multiple sources using its self learning life sciences ontology, an automated self updating database of Life Sciences terms and concepts. Once data is crawled and extracted, normalization begins. As new information is added in real time from various sources, for eg. new publications, it is verified with the help of AI technologies and algorithms. Data is aggregated into relevant datasets and structured. Moreover, relevant tags are to offer accurate search query results later. With the automated validation at every step, manual procedure is also carried out by PhDs and Post-Doc personnels to ensure accuracy of our Life Sciences data ocean.
With automated and manual quality control of data at every step of the crawling, aggregating, and analyzing process, we ensure that the data visualized is verified and relevant, making it possible for the pharmaceutical industry to generate the most relevant insights.