Re: CORD-19 semantic annotations - 11am Tuesday (Boston time) - Jin-Dong Kim (Schedule change) from David Booth on 2020-04-21 (public-semweb-lifesci@w3.org from April 2020)

From: David Booth <david@dbooth.org>
Date: Tue, 21 Apr 2020 10:47:50 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-ID: <1bb6d501-454f-ddad-5e01-d45e0ab0b890@dbooth.org>
Last minute schedule change for today's call: Instead of Scott Malec, 
Jin-Dong Kim will present his work on "An open collaboration for richly 
annotating Covid-19 Literature".  Slides are here:
https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0

David Booth

On 4/20/20 11:56 AM, David Booth wrote:
> Tomorrow (Tuesday) 11am Boston time Scott Malec will discuss his work on 
> computable knowledge extraction using the CORD-19 dataset that was 
> released by the Allen Institute.
> 
> We will use this google hangout:
> http://tinyurl.com/fhirrdf
> 
> More on Scott's work:
> https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects#project-name-cord-semantictriples 
> 
> 
> We still have time for one other presentation tomorrow about CORD-19 
> semantic annotation.  If anyone else is ready (with slides) to present 
> for 20 minutes, please let me know.
> 
> Thanks,
> David Booth
> 
> -----------------------------------------------
> 
> MEETING NOTES 7-Apr-2020
> Present: David Booth <david@dbooth.org>, Sebastian Kohlmeier 
> <sebastiank@allenai.org>, Lucy Lu Wang <lucyw@allenai.org>, Kyle Lo 
> <kylel@allenai.org>, Jim McCusker <mccusker@gmail.com>, Scott Malec 
> <sam413@pitt.edu>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Todor Primov 
> <todor.primov@ontotext.com>
> 
> Sebastian: Allen Institute, Semantic Scholar, Non-profit AI institute, w 
> Lucy and Kyle.  Engaged in COVID-19 because as non-profit could develop 
> a corpus that we can make available.  Created CORD-19 dataset.  Goal: 
> Standardized format that's easy for machines to read, to enable quick 
> analysys of the literature.  Working to extend it.  Weekly updates, but 
> want to get to daily updates.  Want to also get to to entity and 
> relation extraction.
> 
> Guoqian: Identifiers used?  SHA numbers for full text, but also IDs 
> linked to DOIs and Pubmed IDs.  Should discuss best way to have unique 
> ID for publication.
> 
> Kyle: Added unique IDs: cord_UID.  SHA is a hash of PDF, and sometimes 
> there are multiple PDFs for a single paper.
> 
> Jim: DOIs?
> 
> Lucy: Some papers do not have a DOI.
> 
> Jim: Building a KG using generalized tools from another projects, used 
> in many domains.  Looking to do drug repurposing using CORD-19.  Using 
> an extract of CORD-19.  Does deep extraction of named entities and 
> relationships.  Use PROV ont and nanopublications, for rich modeling and 
> provenance for probabilistic KG.  Arcs in picture are based on 
> confidence level.  Allows high precision on drugs that have been tested 
> on melanoma before.  Re-applying this to COVID-19.  We focus on open 
> ontologies, and not using FHIR.  Unpublished yet.  Page-rank based 
> analysis of pubmed citation graph, to compute community trust in a paper.
> 
> Guoqian: What ont?
> 
> Jim: Drugbank mostly.  Lots of targets.
> 
> Kyle: Relation-entity set.  Closed set?
> 
> Jim: We have drug graph, protein-protein interaction, and drugbank has 
> drug-protein interaction.  Molecular interaction.  CTD Comparative 
> Toxinomic Database, Heng Ji Lab database started with it.
> 
> Kyle: Trying to add more KB entities?
> 
> Jim: Want to expand the interaction set.  Also entities.  We have all 
> human proteins and drugbank drugs.  If you have a drug with an effect on 
> a target similar protein in COVID-19, will there be hits, directly or 
> indirectly?  To do that, we want to score it also based on confidence in 
> the research.
> 
> Scott: My research approach is to integrate structured knowledge from 
> literature or other curated sources, and combine with observational data 
> to facilitate more reliable inference.  General idea is that contextual 
> info can help interpret and identify confounders.  Confounders are 
> common causes of the predictor and outcome.  What I did with CORD-19: 
> took pubmed IDs, and found what machine reading performed and created 
> KG.  Machine reading can run for months.  Jim's work on citation 
> analysis is cool.  Using semrep, developed by NLM, over titles and 
> abstracts in pubmed.  Using Pubmed central IDs from metadata table, in 
> beginning of March, 31k papers, with 28k in pubmed central.  Seemed like 
> a good place to start building a KG quickly, to see the big picture 
> quickly.  Pulled 106k semantic predications in the 21k docs, pulled into 
> cytoscape and computed network centrality, and from that ranked. Fitered 
> by biomedicl entities, diseases, syndromes, amino acids, peptides or 
> pharm substances.  Ranked themm by centrality to understnad their 
> importance.  Very prelim analysis.  Interested to see how I might expand 
> on this and learn what others are doing.
> 
> Guoqian: Can cytoscape support RDF graphs?  David: Yes.  Jim: Yes, and 
> you can form SPARQL queries to extract specific interactions.  Not 1:1 
> mapping of RDF graph to bio network.
> 
> Todor: There are different plugins, one is SPARQL endpoint.  Others for 
> other visualizations.  Keep expectations low.
> 
> Jim: It also includes a knowledge exploration interface, built on 
> cytoscape.js, a re-implementation of cytoscape.  The implementation I'm 
> using has some interface element.
> 
> Lucy: How does Coronavirus ont relate?
> 
> Guoqian: Using this ont to annotate the papers.
> 
> Lucy: Where did that ont come from?
> 
> Jim: Built using OBO foundries?  Guoqian: Yes.
> 
> Jim: We use OBO ont.  Oliver has a lot of tools for subsetting and 
> extracting for app ontologies.
> 
> Guoqian: Also collaborating with Cochrane PICO ontology, devloping 
> COVID-19 PICO ont, specific subtypes of the high level types, eg, 
> subtypes of population with particular co-morbilitidies.  The ont is 
> also avail on github.
> 
> Guoqian: How to collaborate?  Need a registry for KG from this community?
> 
> Lucy: Working on semantic annotation of entity and rel.  Lots of people 
> are doing bottom-up annotation, without formal vocab, then linking to 
> UMLS.  But haven't seen COVID-19 ont.
> 
> Guoqian: Also should look at use cases that different groups have. 
> Community said they want open vocab instead of SNOMED-CT, such as UMLS.
> 
> Lucy: Also working with a group at AWS, KB of concepts, link to ICD-10 
> and RXNorm, also lots of requests for protein and interactions.
> 
> Guoqian: Also procedure datasets.
> 
> Lucy: What use cases are these projects addressing?
> 
> Guoqian: For EBMonFHIR, they are focused on review of evidence, and 
> clinical concepts.  Other team looking at using OBO ont to analyse DB to 
> mine underlying mechanisms.  Ideally we should have linkage across 
> vocabularies.  Eg UMLS can link many ont.  But for OBO it might be  a 
> challenge.
> 
> Jim: From microbio perspectvie, most useful from this group would be 
> having cross mapping from clinical/FHIR/SNOMED-ish world and OBO bio 
> world, with translation between the two.  E.g. I use uniprot IDs.  Is 
> that a problem?  What about drug IDs?  IDs are the hardest part -- agree 
> on some, and mappings for others.
> 
> Guoqian: If we can provide a list of ont each team prefers, we can discuss.
> 
> Lucy: Would be great to be able to share annotations.  Centralized 
> vocab?  Central KB?  Use cases are key.
> 
> Scott: Mapping problems with COVID-19 are same as other mapping 
> problems.  Should have a central place to share projects.  Should keep 
> use cases in mind.
> 
> Sebastian: Please give us feedback on the dataset!
> 
> Todor: Focus on specific questions that you want to answer, then map 
> using common IDs to address them.
> 
> Daniel: What formats?  Right now we're using FHIR.  Use others?
> 
> Jim: identifier.org might be useful for mapping.
> 
> David: Useful to have each group present use cases and vocab.
> 
> We'll meet weekly, same time, 1 hour.  Each group will present their 
> work in more detail, with focus on:
> what use cases they are addressing; and
> what vocabularies / ontologies they're using.
> 
> Each group will present for 20 min presents, 10 min questions.
> 
> ADJOURNED
Received on Tuesday, 21 April 2020 14:48:06 UTC