Standardizing Clinical Data with Python
2019-06-03, 11:45–12:10, Hall 2 (PyData)

The centralized database that holds clinical trial data is in need of standardization - python tools are used to help this effort


This is joint work with Joshua Schertz

ClinicalTrials.Gov is the database where clinical trial data from all over the world is registered. Today some clinical trials are required to report their finding in this database according to U.S. law. Today this database holds over 300,000 clinical trials with over 10% with numeric results. However, since many entities are entering data into this fast growing database, the data is not standardized. Specifically, numerical data cannot be comprehended since the units are not standardized. There are over 23K different units detected from this database in 2019 - many of those units are similar only written differently. This talk will discuss how we use python tools to 1) process and index the data, 2) find similar units using NLP and machine learning, 3) create a web site to support user mapping of those units. We created ClinicalUnitMapping.com to support the standardization effort of those units. New elements of this presentation will discuss how units from existing medical standards such as UCUM, RTMMS , and CDISC are incorporated in the python processing pipeline. The intention is to create a unit standard that will be able to map all units reported by clinical trials. With such a database, the data in this clinical trials database would become machine comprehensible.

CLICK ON THIS TEXT TO ACCESS THE PRESENTATION