Re: CORD-19 semantic annotations - 11am Tuesday (Boston time)

If anyone else is doing semantic annotation of the CORD-19 dataset from 
Allen Institute, please let me know.

  -----------------------------------------------------------

Notes from today's call:

Tomas and Gollam presented their work: 
https://docs.google.com/presentation/d/1eX9eTb0C8roy7pYK8li3V5YcWhFByN6is6hz_b62AcA/edit?usp=sharing
Two projects; one that does entity extraction of CORD-19 to make an 
occurrance matrix, of which entities appear in which articles, then feed 
this into Easy Miner to discover which entities cause an article to be 
more highly cited.

Second project works directly from RDF or relational data to derive 
rules from the data.  Implemented in Scala.

Q: What RDF dataset are you using? Tomas: Only recently applying this to 
COVID, using this dataset: 
https://github.com/Knowledge-Graph-Hub/kg-covid-19

Tomas; Can also predict missing triples (and create them) using rules 
that were derived from the data.  Citations are from MS.

Q: Ideas for how this might be used?  Scott: Bibliometric part was 
interesting, ranking confidence in articles, as a proxy for 
authoritativeness.  Spreading activation of articles that cite other 
articles.  Might allow us to see what the hub articles are, in the web 
of influence.  Perhaps to see what journals are involved.  Expand to 
look at what semantic types different journals work with.  Interested to 
see the scopus data with bibliometric info.

David: What RDF vocabularies are you using?  Tomas: Our tool does not 
currently support OWL.  Or whatever vocabularies are in the data.

Scott: Looks like SpiSpacey embeddings that are compatible with some of 
the standard biomed vocabularies.  Gollam: Looking into that module, but 
only finding a few of those entities in the articles. Also using 
ConceptNet and experimenting with DBPedia.  David: This also points to 
the need for other groups, who are doing semantic annotation, to 
generate the annotations using those established biomed vocabularies, so 
that your work can pick up on them.

ADJOURNED

On 4/13/20 4:47 PM, David Booth wrote:
> Tomorrow (Tuesday) 11am Boston time Tomáš Kliegr and Gollam Rabb from 
> VSE University in Prague will discuss their work on extracting 
> associations from the CORD-19 dataset that was released by the Allen 
> Institute.
> 
> We will use this google hangout:
> http://tinyurl.com/fhirrdf
> 
> Below are notes from last week's call.
> 
> Please let me know if you are using CORD-19 so that I can add you to our 
> list.
> 
> Thanks,
> David Booth
> 
> -----------------------------------------------
> 
> MEETING NOTES 7-Apr-2020
> Present: David Booth <david@dbooth.org>, Sebastian Kohlmeier 
> <sebastiank@allenai.org>, Lucy Lu Wang <lucyw@allenai.org>, Kyle Lo 
> <kylel@allenai.org>, Jim McCusker <mccusker@gmail.com>, Scott Malec 
> <sam413@pitt.edu>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Todor Primov 
> <todor.primov@ontotext.com>
> 
> Sebastian: Allen Institute, Semantic Scholar, Non-profit AI institute, w 
> Lucy and Kyle.  Engaged in COVID-19 because as non-profit could develop 
> a corpus that we can make available.  Created CORD-19 dataset.  Goal: 
> Standardized format that's easy for machines to read, to enable quick 
> analysys of the literature.  Working to extend it.  Weekly updates, but 
> want to get to daily updates.  Want to also get to to entity and 
> relation extraction.
> 
> Guoqian: Identifiers used?  SHA numbers for full text, but also IDs 
> linked to DOIs and Pubmed IDs.  Should discuss best way to have unique 
> ID for publication.
> 
> Kyle: Added unique IDs: cord_UID.  SHA is a hash of PDF, and sometimes 
> there are multiple PDFs for a single paper.
> 
> Jim: DOIs?
> 
> Lucy: Some papers do not have a DOI.
> 
> Jim: Building a KG using generalized tools from another projects, used 
> in many domains.  Looking to do drug repurposing using CORD-19.  Using 
> an extract of CORD-19.  Does deep extraction of named entities and 
> relationships.  Use PROV ont and nanopublications, for rich modeling and 
> provenance for probabilistic KG.  Arcs in picture are based on 
> confidence level.  Allows high precision on drugs that have been tested 
> on melanoma before.  Re-applying this to COVID-19.  We focus on open 
> ontologies, and not using FHIR.  Unpublished yet.  Page-rank based 
> analysis of pubmed citation graph, to compute community trust in a paper.
> 
> Guoqian: What ont?
> 
> Jim: Drugbank mostly.  Lots of targets.
> 
> Kyle: Relation-entity set.  Closed set?
> 
> Jim: We have drug graph, protein-protein interaction, and drugbank has 
> drug-protein interaction.  Molecular interaction.  CTD Comparative 
> Toxinomic Database, Heng Ji Lab database started with it.
> 
> Kyle: Trying to add more KB entities?
> 
> Jim: Want to expand the interaction set.  Also entities.  We have all 
> human proteins and drugbank drugs.  If you have a drug with an effect on 
> a target similar protein in COVID-19, will there be hits, directly or 
> indirectly?  To do that, we want to score it also based on confidence in 
> the research.
> 
> Scott: My research approach is to integrate structured knowledge from 
> literature or other curated sources, and combine with observational data 
> to facilitate more reliable inference.  General idea is that contextual 
> info can help interpret and identify confounders.  Confounders are 
> common causes of the predictor and outcome.  What I did with CORD-19: 
> took pubmed IDs, and found what machine reading performed and created 
> KG.  Machine reading can run for months.  Jim's work on citation 
> analysis is cool.  Using semrep, developed by NLM, over titles and 
> abstracts in pubmed.  Using Pubmed central IDs from metadata table, in 
> beginning of March, 31k papers, with 28k in pubmed central.  Seemed like 
> a good place to start building a KG quickly, to see the big picture 
> quickly.  Pulled 106k semantic predications in the 21k docs, pulled into 
> cytoscape and computed network centrality, and from that ranked. Fitered 
> by biomedicl entities, diseases, syndromes, amino acids, peptides or 
> pharm substances.  Ranked themm by centrality to understnad their 
> importance.  Very prelim analysis.  Interested to see how I might expand 
> on this and learn what others are doing.
> 
> Guoqian: Can cytoscape support RDF graphs?  David: Yes.  Jim: Yes, and 
> you can form SPARQL queries to extract specific interactions.  Not 1:1 
> mapping of RDF graph to bio network.
> 
> Todor: There are different plugins, one is SPARQL endpoint.  Others for 
> other visualizations.  Keep expectations low.
> 
> Jim: It also includes a knowledge exploration interface, built on 
> cytoscape.js, a re-implementation of cytoscape.  The implementation I'm 
> using has some interface element.
> 
> Lucy: How does Coronavirus ont relate?
> 
> Guoqian: Using this ont to annotate the papers.
> 
> Lucy: Where did that ont come from?
> 
> Jim: Built using OBO foundries?  Guoqian: Yes.
> 
> Jim: We use OBO ont.  Oliver has a lot of tools for subsetting and 
> extracting for app ontologies.
> 
> Guoqian: Also collaborating with Cochrane PICO ontology, devloping 
> COVID-19 PICO ont, specific subtypes of the high level types, eg, 
> subtypes of population with particular co-morbilitidies.  The ont is 
> also avail on github.
> 
> Guoqian: How to collaborate?  Need a registry for KG from this community?
> 
> Lucy: Working on semantic annotation of entity and rel.  Lots of people 
> are doing bottom-up annotation, without formal vocab, then linking to 
> UMLS.  But haven't seen COVID-19 ont.
> 
> Guoqian: Also should look at use cases that different groups have. 
> Community said they want open vocab instead of SNOMED-CT, such as UMLS.
> 
> Lucy: Also working with a group at AWS, KB of concepts, link to ICD-10 
> and RXNorm, also lots of requests for protein and interactions.
> 
> Guoqian: Also procedure datasets.
> 
> Lucy: What use cases are these projects addressing?
> 
> Guoqian: For EBMonFHIR, they are focused on review of evidence, and 
> clinical concepts.  Other team looking at using OBO ont to analyse DB to 
> mine underlying mechanisms.  Ideally we should have linkage across 
> vocabularies.  Eg UMLS can link many ont.  But for OBO it might be  a 
> challenge.
> 
> Jim: From microbio perspectvie, most useful from this group would be 
> having cross mapping from clinical/FHIR/SNOMED-ish world and OBO bio 
> world, with translation between the two.  E.g. I use uniprot IDs.  Is 
> that a problem?  What about drug IDs?  IDs are the hardest part -- agree 
> on some, and mappings for others.
> 
> Guoqian: If we can provide a list of ont each team prefers, we can discuss.
> 
> Lucy: Would be great to be able to share annotations.  Centralized 
> vocab?  Central KB?  Use cases are key.
> 
> Scott: Mapping problems with COVID-19 are same as other mapping 
> problems.  Should have a central place to share projects.  Should keep 
> use cases in mind.
> 
> Sebastian: Please give us feedback on the dataset!
> 
> Todor: Focus on specific questions that you want to answer, then map 
> using common IDs to address them.
> 
> Daniel: What formats?  Right now we're using FHIR.  Use others?
> 
> Jim: identifier.org might be useful for mapping.
> 
> David: Useful to have each group present use cases and vocab.
> 
> We'll meet weekly, same time, 1 hour.  Each group will present their 
> work in more detail, with focus on:
> what use cases they are addressing; and
> what vocabularies / ontologies they're using.
> 
> Each group will present for 20 min presents, 10 min questions.
> 
> ADJOURNED

Received on Tuesday, 14 April 2020 16:05:40 UTC