Re: CORD-19 semantic annotations - 11am Tuesday (Boston time) - Jin-Dong Kim (Schedule change) from David Booth on 2020-04-21 (public-semweb-lifesci@w3.org from April 2020)

From: David Booth <david@dbooth.org>
Date: Tue, 21 Apr 2020 13:32:58 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-ID: <4daa9e4c-f04c-e78c-dada-40a482ecc80a@dbooth.org>
[Apologies for reaching the google hangout participant limit today, and 
thank you to Victor Mireles-Chavez for allowing us to switch over to his 
zoom instead!  I will find a better solution for next week.]

Below are meeting notes from today's call.  If you would like to present 
your work on CORD-19 semantic annotations, please email me so that I can 
put you on the schedule.  You do not need to have results yet.  Even if 
you are just starting out, it is helpful to learn what others are doing.

                            ----------------------------

MEETING NOTES 21-Apr-2020

Present: David Booth, Jin-Dong Kim, Víctor Mireles, Oliver Giles,Harry 
Hochheiser, Franck Michel, James Malone, Kyle Lo, Sebastian Kohlmeier, 
Guoqian Jiang, Gaurav Vaidya, Gollam Rabby, Oliver, Tomas Kliegr

Introductions

David Booth: Many years in semantic web technology, applying it to 
healthcare and life sciences for the past 10years.  Involved in 
standardizing the RDF representation of HL7 FHIR: 
https://www.hl7.org/fhir/rdf.html

Gaurav: U of NC (https://renci.org/staff/gaurav-vaidya/), sem web tech, 
using CORD-19, trying to annotate ont terms as part of Robokop 
(https://robokop.renci.org/).

Harry: U of Pittsburgh, involved w W3C, drug-drug interaction, cancer 
information models, not actively using CORD-19.

James Malone: CTO SciBite in UK, provide sem enrichment tooling to 
pharma, KG building.  Background, applying ont to public data, machine 
learning, building ontologies.

Kyle: Researcher at Allen Institute, NLP, working on CORD-19.

Oliver: Machine learning at SciBite w James, NLP, machine learning.

Sebastian: Prog mgr at Allen Institute, CORD-19.

Tomas: Working on rule learning, trying to apply it to CORD-19.

Victor Mireles: Researcher at sem web co in Austria, looking at 
annotations that others have been doing on CORD-19, trying to make them 
match, and our own annotations.

Presentation by Jin-Dong Kim

Slides: 
https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 


Jin-Dong: Tokyo, database center for life science, Japan gov funded, 
bioinformatics, NLP, text mining, esp biomedical literature.

(Jin-Dong presents his slides)

Jin-Dong: Using multiple datasets.  Multiple groups producing 
annotations, isolation.  PubAnnotation is a 10-year-old project to 
integrate annotations to literature.  Collecting annotations for 
COVID-19 literature to integrate and release them for other use. 
PubAnnotation is an open repo of biomed text annotations.  Anyone can 
submit to it.  All annotations are aligned to the canonical texts.

Jin-Dong: PubAnnotation also provides RESTful web services. Many 
annotators compatible with PubAnnotation.  Also collecting manual 
annotations using Testae.

Q: Are the annotations from a controlled vocab or ont?  Jin-Dong: Both 
free text or from ont.

Jin-Dong: Every text span has a URL.  You can see what projects include 
a doc.  And you can choose a span of text and see what projects used 
that span.

Q: What is a project?  Jin-Dong: We collect any kind of annotations. 
Project identifies the source of people who have contributed annotations.

Jin-Dong: Annotations can be accessed via a span URL.  Also converting 
annotationsn into RDF.  Still experimenting.  Also have a search 
interface.  SPARQL queries.
https://covid19.pubannotation.org/

Jin-Dong: Trying to add annotations for temporation notations.

Jin-Dong: Literature includes CORD-19 and LitCovid, from NCBI.  Uploaded 
all the test to PubAnnotation 
(http://pubannotation.org/collections/LitCovid)  Anyone can contribute. 
To contribute, you can download, annotation, then create a new project 
and add it to the LitCovid collection and it will appear.  Open 
platform.  Same setup for CORD-19.  Received 6 contributions so far. 
Need to analyze them.  Planning to call for wider contributions soon, 
maybe next week.  Plan to continuously update.

Guoqian: Any specific research questions using these annotations? 
Particular use cases?  Jin-Dong: Need to find out. Clinicians began with 
manual annotations.  Will figure out missing parts and try to fill the 
gaps.  Many annotations are concept annotations using ont -- many 
similar.  But we think there are still important missing annotations, 
such as temporal expressions.  Looking to add those.  Also quantitative 
traits annotations are missing.  Looking for those too.

Q: How might these be used?

Franck: I'm in Inria/CNRS/Univ Côte d'Azur, contacts with Inserm (French 
NIH) point to the need to search literature with questions like: "What 
are the papers that link Coronavirus with other diseases like diabetes 
or cancer?"

James: Released COVID-specific annotations. Pharma using them: looking 
for co-risk factors, or drugs interacting.  Comes down to: want to 
narrow down to a set of papers to read.  Anything that gets them to the 
paper.  Want to read the o

Franck: Summarizing the main claim of the paper helps also, to narrow 
down the search.

Victor: Drug-drug interactions.  Many other KGs, to link to drug-drug or 
protein-protein interaction databases we need URIs, so pubAnnotations 
can query and get URIs from it, so I can see what drugs are mentioned in 
this span.  Is this supported?

Jin-Dong: Group in China is working on annotations for drug repurposing. 
  I think they're using drug ont.

Franck: How can we consume the annotations that have been contributed? 
Jin-Dong: Download in JSON or CSV, or access as RDF.

Tomas: We detect entities, then try to do semantic extension.  Would 
there be a way to use this for semantic extension of entities, or get a 
list of highly specific concepts that appear in the article.  Jin-Dong: 
Yes, because they're in RDF, could do that.  Tomas: How to match doc in 
your DB with doc in other DB?  Jin-Dong: Every doc is identified by a 
pair: DB identifier, and ID within that DB.

Tomas: How many annotations average per document?  Jin-Dong: Conversion 
is not entirely done.  RDF statements only partially done.  Jin-Dong: in 
CORD-PICO, for 26k docs, 69k annotations for PICO.

ADJOURNED

-----------------------------------------------------------------------

On 4/21/20 10:47 AM, David Booth wrote:
> Last minute schedule change for today's call: Instead of Scott Malec, 
> Jin-Dong Kim will present his work on "An open collaboration for richly 
> annotating Covid-19 Literature".  Slides are here:
> https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 
> 
> 
> David Booth
> 
> On 4/20/20 11:56 AM, David Booth wrote:
>> Tomorrow (Tuesday) 11am Boston time Scott Malec will discuss his work 
>> on computable knowledge extraction using the CORD-19 dataset that was 
>> released by the Allen Institute.
>>
>> We will use this google hangout:
>> http://tinyurl.com/fhirrdf
>>
>> More on Scott's work:
>> https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects#project-name-cord-semantictriples 
>>
>>
>> We still have time for one other presentation tomorrow about CORD-19 
>> semantic annotation.  If anyone else is ready (with slides) to present 
>> for 20 minutes, please let me know.
>>
>> Thanks,
>> David Booth
>>
>> -----------------------------------------------
>>
>> MEETING NOTES 7-Apr-2020
>> Present: David Booth <david@dbooth.org>, Sebastian Kohlmeier 
>> <sebastiank@allenai.org>, Lucy Lu Wang <lucyw@allenai.org>, Kyle Lo 
>> <kylel@allenai.org>, Jim McCusker <mccusker@gmail.com>, Scott Malec 
>> <sam413@pitt.edu>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Todor 
>> Primov <todor.primov@ontotext.com>
>>
>> Sebastian: Allen Institute, Semantic Scholar, Non-profit AI institute, 
>> w Lucy and Kyle.  Engaged in COVID-19 because as non-profit could 
>> develop a corpus that we can make available.  Created CORD-19 
>> dataset.  Goal: Standardized format that's easy for machines to read, 
>> to enable quick analysys of the literature.  Working to extend it.  
>> Weekly updates, but want to get to daily updates.  Want to also get to 
>> to entity and relation extraction.
>>
>> Guoqian: Identifiers used?  SHA numbers for full text, but also IDs 
>> linked to DOIs and Pubmed IDs.  Should discuss best way to have unique 
>> ID for publication.
>>
>> Kyle: Added unique IDs: cord_UID.  SHA is a hash of PDF, and sometimes 
>> there are multiple PDFs for a single paper.
>>
>> Jim: DOIs?
>>
>> Lucy: Some papers do not have a DOI.
>>
>> Jim: Building a KG using generalized tools from another projects, used 
>> in many domains.  Looking to do drug repurposing using CORD-19.  Using 
>> an extract of CORD-19.  Does deep extraction of named entities and 
>> relationships.  Use PROV ont and nanopublications, for rich modeling 
>> and provenance for probabilistic KG.  Arcs in picture are based on 
>> confidence level.  Allows high precision on drugs that have been 
>> tested on melanoma before.  Re-applying this to COVID-19.  We focus on 
>> open ontologies, and not using FHIR.  Unpublished yet.  Page-rank 
>> based analysis of pubmed citation graph, to compute community trust in 
>> a paper.
>>
>> Guoqian: What ont?
>>
>> Jim: Drugbank mostly.  Lots of targets.
>>
>> Kyle: Relation-entity set.  Closed set?
>>
>> Jim: We have drug graph, protein-protein interaction, and drugbank has 
>> drug-protein interaction.  Molecular interaction.  CTD Comparative 
>> Toxinomic Database, Heng Ji Lab database started with it.
>>
>> Kyle: Trying to add more KB entities?
>>
>> Jim: Want to expand the interaction set.  Also entities.  We have all 
>> human proteins and drugbank drugs.  If you have a drug with an effect 
>> on a target similar protein in COVID-19, will there be hits, directly 
>> or indirectly?  To do that, we want to score it also based on 
>> confidence in the research.
>>
>> Scott: My research approach is to integrate structured knowledge from 
>> literature or other curated sources, and combine with observational 
>> data to facilitate more reliable inference.  General idea is that 
>> contextual info can help interpret and identify confounders.  
>> Confounders are common causes of the predictor and outcome.  What I 
>> did with CORD-19: took pubmed IDs, and found what machine reading 
>> performed and created KG.  Machine reading can run for months.  Jim's 
>> work on citation analysis is cool.  Using semrep, developed by NLM, 
>> over titles and abstracts in pubmed.  Using Pubmed central IDs from 
>> metadata table, in beginning of March, 31k papers, with 28k in pubmed 
>> central.  Seemed like a good place to start building a KG quickly, to 
>> see the big picture quickly.  Pulled 106k semantic predications in the 
>> 21k docs, pulled into cytoscape and computed network centrality, and 
>> from that ranked. Fitered by biomedicl entities, diseases, syndromes, 
>> amino acids, peptides or pharm substances.  Ranked themm by centrality 
>> to understnad their importance.  Very prelim analysis.  Interested to 
>> see how I might expand on this and learn what others are doing.
>>
>> Guoqian: Can cytoscape support RDF graphs?  David: Yes.  Jim: Yes, and 
>> you can form SPARQL queries to extract specific interactions.  Not 1:1 
>> mapping of RDF graph to bio network.
>>
>> Todor: There are different plugins, one is SPARQL endpoint.  Others 
>> for other visualizations.  Keep expectations low.
>>
>> Jim: It also includes a knowledge exploration interface, built on 
>> cytoscape.js, a re-implementation of cytoscape.  The implementation 
>> I'm using has some interface element.
>>
>> Lucy: How does Coronavirus ont relate?
>>
>> Guoqian: Using this ont to annotate the papers.
>>
>> Lucy: Where did that ont come from?
>>
>> Jim: Built using OBO foundries?  Guoqian: Yes.
>>
>> Jim: We use OBO ont.  Oliver has a lot of tools for subsetting and 
>> extracting for app ontologies.
>>
>> Guoqian: Also collaborating with Cochrane PICO ontology, devloping 
>> COVID-19 PICO ont, specific subtypes of the high level types, eg, 
>> subtypes of population with particular co-morbilitidies.  The ont is 
>> also avail on github.
>>
>> Guoqian: How to collaborate?  Need a registry for KG from this community?
>>
>> Lucy: Working on semantic annotation of entity and rel.  Lots of 
>> people are doing bottom-up annotation, without formal vocab, then 
>> linking to UMLS.  But haven't seen COVID-19 ont.
>>
>> Guoqian: Also should look at use cases that different groups have. 
>> Community said they want open vocab instead of SNOMED-CT, such as UMLS.
>>
>> Lucy: Also working with a group at AWS, KB of concepts, link to ICD-10 
>> and RXNorm, also lots of requests for protein and interactions.
>>
>> Guoqian: Also procedure datasets.
>>
>> Lucy: What use cases are these projects addressing?
>>
>> Guoqian: For EBMonFHIR, they are focused on review of evidence, and 
>> clinical concepts.  Other team looking at using OBO ont to analyse DB 
>> to mine underlying mechanisms.  Ideally we should have linkage across 
>> vocabularies.  Eg UMLS can link many ont.  But for OBO it might be  a 
>> challenge.
>>
>> Jim: From microbio perspectvie, most useful from this group would be 
>> having cross mapping from clinical/FHIR/SNOMED-ish world and OBO bio 
>> world, with translation between the two.  E.g. I use uniprot IDs.  Is 
>> that a problem?  What about drug IDs?  IDs are the hardest part -- 
>> agree on some, and mappings for others.
>>
>> Guoqian: If we can provide a list of ont each team prefers, we can 
>> discuss.
>>
>> Lucy: Would be great to be able to share annotations.  Centralized 
>> vocab?  Central KB?  Use cases are key.
>>
>> Scott: Mapping problems with COVID-19 are same as other mapping 
>> problems.  Should have a central place to share projects.  Should keep 
>> use cases in mind.
>>
>> Sebastian: Please give us feedback on the dataset!
>>
>> Todor: Focus on specific questions that you want to answer, then map 
>> using common IDs to address them.
>>
>> Daniel: What formats?  Right now we're using FHIR.  Use others?
>>
>> Jim: identifier.org might be useful for mapping.
>>
>> David: Useful to have each group present use cases and vocab.
>>
>> We'll meet weekly, same time, 1 hour.  Each group will present their 
>> work in more detail, with focus on:
>> what use cases they are addressing; and
>> what vocabularies / ontologies they're using.
>>
>> Each group will present for 20 min presents, 10 min questions.
>>
>> ADJOURNED
Received on Tuesday, 21 April 2020 17:33:13 UTC