Medical Microdata Compendium (Open Biomedical Datasets with schema.org annotation) -- was: Re: New proposal: health & medical extensions to schema.org

Dear all,

I published a first prototype of the "Medical Microdata Compendium", a collection of open medical and pharmacological datasets with markup conforming to the recently updated schema.org and the microdata format. The long-term goal of this project is to provide structured medical and pharmacological information to search engines to enable better decision making by doctors and patients. The far more humble short-term goal is to research how microdata can be used for retrieving and querying biomedical information, and to come up with interesting demonstrations and use-cases.

The data can be viewed here:

http://samwald.info/medical_microdata/

At the moment this is a flat list of web pages, with each page describing a formulated pharmaceutical or a substance. The data were derived from the DailyMed and DrugBank datasets from the LODD collection. 

Example of a DrugBank resource:
http://samwald.info/medical_microdata/drugbank_resource_drugs_DB00175.html

Example of a DailyMed resource:
http://samwald.info/medical_microdata/dailymed_resource_drugs_3580.html

You can extract the structured data from these pages with a variety of tools. For example, You can use the Sindice inspector:
http://inspector.sindice.com/inspect?url=http%3A%2F%2Fsamwald.info%2Fmedical_microdata%2Fdrugbank_resource_drugs_DB00175.html

At the moment I am evaluating how different search engines can cope with the data. For example, the microdata can already be used by Google Custom Search Engines. Other 'semantic' search engines such as http://sindice.com/ or the medical search engine developed by the http://khresmoi.eu/ project should also be evaluated. 

If you are interested in joining the effort to evaluate how semantic markup can be used to improve medical information search and decision making, please send me an e-mail! I would like to see this work published as a journal paper, and could use some co-authors. I appreciate every feedback or idea!

Regarding the Medical Microdata Compendium, there are several issues that still need to be taken care of:

1) The DailyMed resources are still riddled with character encoding issues -- this is a problem of the LODD data source and will be remedied by switching to a newer version of this dataset, Richard's 'Linked Structured Product Labels'.
2) Only a fraction of the properties of the source datasets have been mapped, namely those where a close fit between a property in the source dataset and schema.org could be found. This means that a lot of useful data is not captured. I will look into using the proposed schema.org extension mechanism to see if it could help to capture these additional properties and types.
3) More datasets need to be converted, such as ClinicalTrials.gov (and its linked data mirror http://linkedct.org/). This will also help to better demonstrate interlinking of different datasets (e.g., from disease to drug to ongoing clinical trials in the area).
4) The generation of http://schema.org/MedicalCode entities needs to be fixed. Also, we need to check how we can align with controlled vocabularies that already have URIs (e.g. to BioPortal taxonomies)
5) General clean-up, code formatting and improvement of web design

Cheers,
Matthias Samwald

Received on Wednesday, 4 July 2012 21:22:07 UTC