CORD-19 semantic annotations - 11am Tuesday (Boston time) - Scott Malec on computable knowledge extraction

Tomorrow (Tuesday) 11am Boston time Scott Malec will discuss his work on 
computable knowledge extraction using the CORD-19 dataset that was 
released by the Allen Institute.

We will use this google hangout:
http://tinyurl.com/fhirrdf

More on Scott's work:
https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects#project-name-cord-semantictriples

We still have time for one other presentation tomorrow about CORD-19 
semantic annotation.  If anyone else is ready (with slides) to present 
for 20 minutes, please let me know.

Thanks,
David Booth

-----------------------------------------------

MEETING NOTES 7-Apr-2020
Present: David Booth <david@dbooth.org>, Sebastian Kohlmeier 
<sebastiank@allenai.org>, Lucy Lu Wang <lucyw@allenai.org>, Kyle Lo 
<kylel@allenai.org>, Jim McCusker <mccusker@gmail.com>, Scott Malec 
<sam413@pitt.edu>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Todor Primov 
<todor.primov@ontotext.com>

Sebastian: Allen Institute, Semantic Scholar, Non-profit AI institute, w 
Lucy and Kyle.  Engaged in COVID-19 because as non-profit could develop 
a corpus that we can make available.  Created CORD-19 dataset.  Goal: 
Standardized format that's easy for machines to read, to enable quick 
analysys of the literature.  Working to extend it.  Weekly updates, but 
want to get to daily updates.  Want to also get to to entity and 
relation extraction.

Guoqian: Identifiers used?  SHA numbers for full text, but also IDs 
linked to DOIs and Pubmed IDs.  Should discuss best way to have unique 
ID for publication.

Kyle: Added unique IDs: cord_UID.  SHA is a hash of PDF, and sometimes 
there are multiple PDFs for a single paper.

Jim: DOIs?

Lucy: Some papers do not have a DOI.

Jim: Building a KG using generalized tools from another projects, used 
in many domains.  Looking to do drug repurposing using CORD-19.  Using 
an extract of CORD-19.  Does deep extraction of named entities and 
relationships.  Use PROV ont and nanopublications, for rich modeling and 
provenance for probabilistic KG.  Arcs in picture are based on 
confidence level.  Allows high precision on drugs that have been tested 
on melanoma before.  Re-applying this to COVID-19.  We focus on open 
ontologies, and not using FHIR.  Unpublished yet.  Page-rank based 
analysis of pubmed citation graph, to compute community trust in a paper.

Guoqian: What ont?

Jim: Drugbank mostly.  Lots of targets.

Kyle: Relation-entity set.  Closed set?

Jim: We have drug graph, protein-protein interaction, and drugbank has 
drug-protein interaction.  Molecular interaction.  CTD Comparative 
Toxinomic Database, Heng Ji Lab database started with it.

Kyle: Trying to add more KB entities?

Jim: Want to expand the interaction set.  Also entities.  We have all 
human proteins and drugbank drugs.  If you have a drug with an effect on 
a target similar protein in COVID-19, will there be hits, directly or 
indirectly?  To do that, we want to score it also based on confidence in 
the research.

Scott: My research approach is to integrate structured knowledge from 
literature or other curated sources, and combine with observational data 
to facilitate more reliable inference.  General idea is that contextual 
info can help interpret and identify confounders.  Confounders are 
common causes of the predictor and outcome.  What I did with CORD-19: 
took pubmed IDs, and found what machine reading performed and created 
KG.  Machine reading can run for months.  Jim's work on citation 
analysis is cool.  Using semrep, developed by NLM, over titles and 
abstracts in pubmed.  Using Pubmed central IDs from metadata table, in 
beginning of March, 31k papers, with 28k in pubmed central.  Seemed like 
a good place to start building a KG quickly, to see the big picture 
quickly.  Pulled 106k semantic predications in the 21k docs, pulled into 
cytoscape and computed network centrality, and from that ranked. Fitered 
by biomedicl entities, diseases, syndromes, amino acids, peptides or 
pharm substances.  Ranked themm by centrality to understnad their 
importance.  Very prelim analysis.  Interested to see how I might expand 
on this and learn what others are doing.

Guoqian: Can cytoscape support RDF graphs?  David: Yes.  Jim: Yes, and 
you can form SPARQL queries to extract specific interactions.  Not 1:1 
mapping of RDF graph to bio network.

Todor: There are different plugins, one is SPARQL endpoint.  Others for 
other visualizations.  Keep expectations low.

Jim: It also includes a knowledge exploration interface, built on 
cytoscape.js, a re-implementation of cytoscape.  The implementation I'm 
using has some interface element.

Lucy: How does Coronavirus ont relate?

Guoqian: Using this ont to annotate the papers.

Lucy: Where did that ont come from?

Jim: Built using OBO foundries?  Guoqian: Yes.

Jim: We use OBO ont.  Oliver has a lot of tools for subsetting and 
extracting for app ontologies.

Guoqian: Also collaborating with Cochrane PICO ontology, devloping 
COVID-19 PICO ont, specific subtypes of the high level types, eg, 
subtypes of population with particular co-morbilitidies.  The ont is 
also avail on github.

Guoqian: How to collaborate?  Need a registry for KG from this community?

Lucy: Working on semantic annotation of entity and rel.  Lots of people 
are doing bottom-up annotation, without formal vocab, then linking to 
UMLS.  But haven't seen COVID-19 ont.

Guoqian: Also should look at use cases that different groups have. 
Community said they want open vocab instead of SNOMED-CT, such as UMLS.

Lucy: Also working with a group at AWS, KB of concepts, link to ICD-10 
and RXNorm, also lots of requests for protein and interactions.

Guoqian: Also procedure datasets.

Lucy: What use cases are these projects addressing?

Guoqian: For EBMonFHIR, they are focused on review of evidence, and 
clinical concepts.  Other team looking at using OBO ont to analyse DB to 
mine underlying mechanisms.  Ideally we should have linkage across 
vocabularies.  Eg UMLS can link many ont.  But for OBO it might be  a 
challenge.

Jim: From microbio perspectvie, most useful from this group would be 
having cross mapping from clinical/FHIR/SNOMED-ish world and OBO bio 
world, with translation between the two.  E.g. I use uniprot IDs.  Is 
that a problem?  What about drug IDs?  IDs are the hardest part -- agree 
on some, and mappings for others.

Guoqian: If we can provide a list of ont each team prefers, we can discuss.

Lucy: Would be great to be able to share annotations.  Centralized 
vocab?  Central KB?  Use cases are key.

Scott: Mapping problems with COVID-19 are same as other mapping 
problems.  Should have a central place to share projects.  Should keep 
use cases in mind.

Sebastian: Please give us feedback on the dataset!

Todor: Focus on specific questions that you want to answer, then map 
using common IDs to address them.

Daniel: What formats?  Right now we're using FHIR.  Use others?

Jim: identifier.org might be useful for mapping.

David: Useful to have each group present use cases and vocab.

We'll meet weekly, same time, 1 hour.  Each group will present their 
work in more detail, with focus on:
what use cases they are addressing; and
what vocabularies / ontologies they're using.

Each group will present for 20 min presents, 10 min questions.

ADJOURNED

Received on Monday, 20 April 2020 15:56:57 UTC