CORD-19 semantic annotations - 11am Tuesday (Boston time) - Victor Mireles on RDFizing CORD-19 annotations from David Booth on 2020-04-27 (public-semweb-lifesci@w3.org from April 2020)

From: David Booth <david@dbooth.org>
Date: Mon, 27 Apr 2020 12:04:59 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Cc: Victor Mireles <victor.mireles@semantic-web.com>
Message-ID: <8c89d311-3cb7-fc76-cf06-17655517ee43@dbooth.org>
Tomorrow (Tuesday) 11am Boston time Victor Mireles will present his work 
on RDFizing several annotations on the Cord19 dataset that are around in 
different vocabularies. Current vocabularies: gene ontology, ChEBI, 
human phenotype ontology, MeSH disease.

Details for joining the call will be posted in a follow-up message.

Thanks,
David Booth

On 4/21/20 1:32 PM, David Booth wrote:
> [Apologies for reaching the google hangout participant limit today, and 
> thank you to Victor Mireles-Chavez for allowing us to switch over to his 
> zoom instead!  I will find a better solution for next week.]
> 
> Below are meeting notes from today's call.  If you would like to present 
> your work on CORD-19 semantic annotations, please email me so that I can 
> put you on the schedule.  You do not need to have results yet.  Even if 
> you are just starting out, it is helpful to learn what others are doing.
> 
>                             ----------------------------
> 
> MEETING NOTES 21-Apr-2020
> 
> Present: David Booth, Jin-Dong Kim, Víctor Mireles, Oliver Giles,Harry 
> Hochheiser, Franck Michel, James Malone, Kyle Lo, Sebastian Kohlmeier, 
> Guoqian Jiang, Gaurav Vaidya, Gollam Rabby, Oliver, Tomas Kliegr
> 
> Introductions
> 
> David Booth: Many years in semantic web technology, applying it to 
> healthcare and life sciences for the past 10years.  Involved in 
> standardizing the RDF representation of HL7 FHIR: 
> https://www.hl7.org/fhir/rdf.html
> 
> Gaurav: U of NC (https://renci.org/staff/gaurav-vaidya/), sem web tech, 
> using CORD-19, trying to annotate ont terms as part of Robokop 
> (https://robokop.renci.org/).
> 
> Harry: U of Pittsburgh, involved w W3C, drug-drug interaction, cancer 
> information models, not actively using CORD-19.
> 
> James Malone: CTO SciBite in UK, provide sem enrichment tooling to 
> pharma, KG building.  Background, applying ont to public data, machine 
> learning, building ontologies.
> 
> Kyle: Researcher at Allen Institute, NLP, working on CORD-19.
> 
> Oliver: Machine learning at SciBite w James, NLP, machine learning.
> 
> Sebastian: Prog mgr at Allen Institute, CORD-19.
> 
> Tomas: Working on rule learning, trying to apply it to CORD-19.
> 
> Victor Mireles: Researcher at sem web co in Austria, looking at 
> annotations that others have been doing on CORD-19, trying to make them 
> match, and our own annotations.
> 
> Presentation by Jin-Dong Kim
> 
> Slides: 
> https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 
> 
> 
> Jin-Dong: Tokyo, database center for life science, Japan gov funded, 
> bioinformatics, NLP, text mining, esp biomedical literature.
> 
> (Jin-Dong presents his slides)
> 
> Jin-Dong: Using multiple datasets.  Multiple groups producing 
> annotations, isolation.  PubAnnotation is a 10-year-old project to 
> integrate annotations to literature.  Collecting annotations for 
> COVID-19 literature to integrate and release them for other use. 
> PubAnnotation is an open repo of biomed text annotations.  Anyone can 
> submit to it.  All annotations are aligned to the canonical texts.
> 
> Jin-Dong: PubAnnotation also provides RESTful web services. Many 
> annotators compatible with PubAnnotation.  Also collecting manual 
> annotations using Testae.
> 
> Q: Are the annotations from a controlled vocab or ont?  Jin-Dong: Both 
> free text or from ont.
> 
> Jin-Dong: Every text span has a URL.  You can see what projects include 
> a doc.  And you can choose a span of text and see what projects used 
> that span.
> 
> Q: What is a project?  Jin-Dong: We collect any kind of annotations. 
> Project identifies the source of people who have contributed annotations.
> 
> Jin-Dong: Annotations can be accessed via a span URL.  Also converting 
> annotationsn into RDF.  Still experimenting.  Also have a search 
> interface.  SPARQL queries.
> https://covid19.pubannotation.org/
> 
> Jin-Dong: Trying to add annotations for temporation notations.
> 
> Jin-Dong: Literature includes CORD-19 and LitCovid, from NCBI.  Uploaded 
> all the test to PubAnnotation 
> (http://pubannotation.org/collections/LitCovid)  Anyone can contribute. 
> To contribute, you can download, annotation, then create a new project 
> and add it to the LitCovid collection and it will appear.  Open 
> platform.  Same setup for CORD-19.  Received 6 contributions so far. 
> Need to analyze them.  Planning to call for wider contributions soon, 
> maybe next week.  Plan to continuously update.
> 
> Guoqian: Any specific research questions using these annotations? 
> Particular use cases?  Jin-Dong: Need to find out. Clinicians began with 
> manual annotations.  Will figure out missing parts and try to fill the 
> gaps.  Many annotations are concept annotations using ont -- many 
> similar.  But we think there are still important missing annotations, 
> such as temporal expressions.  Looking to add those.  Also quantitative 
> traits annotations are missing.  Looking for those too.
> 
> Q: How might these be used?
> 
> Franck: I'm in Inria/CNRS/Univ Côte d'Azur, contacts with Inserm (French 
> NIH) point to the need to search literature with questions like: "What 
> are the papers that link Coronavirus with other diseases like diabetes 
> or cancer?"
> 
> James: Released COVID-specific annotations. Pharma using them: looking 
> for co-risk factors, or drugs interacting.  Comes down to: want to 
> narrow down to a set of papers to read.  Anything that gets them to the 
> paper.  Want to read the o
> 
> Franck: Summarizing the main claim of the paper helps also, to narrow 
> down the search.
> 
> Victor: Drug-drug interactions.  Many other KGs, to link to drug-drug or 
> protein-protein interaction databases we need URIs, so pubAnnotations 
> can query and get URIs from it, so I can see what drugs are mentioned in 
> this span.  Is this supported?
> 
> Jin-Dong: Group in China is working on annotations for drug repurposing. 
>   I think they're using drug ont.
> 
> Franck: How can we consume the annotations that have been contributed? 
> Jin-Dong: Download in JSON or CSV, or access as RDF.
> 
> Tomas: We detect entities, then try to do semantic extension.  Would 
> there be a way to use this for semantic extension of entities, or get a 
> list of highly specific concepts that appear in the article.  Jin-Dong: 
> Yes, because they're in RDF, could do that.  Tomas: How to match doc in 
> your DB with doc in other DB?  Jin-Dong: Every doc is identified by a 
> pair: DB identifier, and ID within that DB.
> 
> Tomas: How many annotations average per document?  Jin-Dong: Conversion 
> is not entirely done.  RDF statements only partially done.  Jin-Dong: in 
> CORD-PICO, for 26k docs, 69k annotations for PICO.
> 
> ADJOURNED
> 
> -----------------------------------------------------------------------
> 
> On 4/21/20 10:47 AM, David Booth wrote:
>> Last minute schedule change for today's call: Instead of Scott Malec, 
>> Jin-Dong Kim will present his work on "An open collaboration for 
>> richly annotating Covid-19 Literature".  Slides are here:
>> https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 
>>
>>
>> David Booth
>>
>> On 4/20/20 11:56 AM, David Booth wrote:
>>> Tomorrow (Tuesday) 11am Boston time Scott Malec will discuss his work 
>>> on computable knowledge extraction using the CORD-19 dataset that was 
>>> released by the Allen Institute.
>>>
>>> We will use this google hangout:
>>> http://tinyurl.com/fhirrdf
>>>
>>> More on Scott's work:
>>> https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects#project-name-cord-semantictriples 
>>>
>>>
>>> We still have time for one other presentation tomorrow about CORD-19 
>>> semantic annotation.  If anyone else is ready (with slides) to 
>>> present for 20 minutes, please let me know.
>>>
>>> Thanks,
>>> David Booth
>>>
>>> -----------------------------------------------
>>>
>>> MEETING NOTES 7-Apr-2020
>>> Present: David Booth <david@dbooth.org>, Sebastian Kohlmeier 
>>> <sebastiank@allenai.org>, Lucy Lu Wang <lucyw@allenai.org>, Kyle Lo 
>>> <kylel@allenai.org>, Jim McCusker <mccusker@gmail.com>, Scott Malec 
>>> <sam413@pitt.edu>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Todor 
>>> Primov <todor.primov@ontotext.com>
>>>
>>> Sebastian: Allen Institute, Semantic Scholar, Non-profit AI 
>>> institute, w Lucy and Kyle.  Engaged in COVID-19 because as 
>>> non-profit could develop a corpus that we can make available.  
>>> Created CORD-19 dataset.  Goal: Standardized format that's easy for 
>>> machines to read, to enable quick analysys of the literature.  
>>> Working to extend it. Weekly updates, but want to get to daily 
>>> updates.  Want to also get to to entity and relation extraction.
>>>
>>> Guoqian: Identifiers used?  SHA numbers for full text, but also IDs 
>>> linked to DOIs and Pubmed IDs.  Should discuss best way to have 
>>> unique ID for publication.
>>>
>>> Kyle: Added unique IDs: cord_UID.  SHA is a hash of PDF, and 
>>> sometimes there are multiple PDFs for a single paper.
>>>
>>> Jim: DOIs?
>>>
>>> Lucy: Some papers do not have a DOI.
>>>
>>> Jim: Building a KG using generalized tools from another projects, 
>>> used in many domains.  Looking to do drug repurposing using CORD-19.  
>>> Using an extract of CORD-19.  Does deep extraction of named entities 
>>> and relationships.  Use PROV ont and nanopublications, for rich 
>>> modeling and provenance for probabilistic KG.  Arcs in picture are 
>>> based on confidence level.  Allows high precision on drugs that have 
>>> been tested on melanoma before.  Re-applying this to COVID-19.  We 
>>> focus on open ontologies, and not using FHIR.  Unpublished yet.  
>>> Page-rank based analysis of pubmed citation graph, to compute 
>>> community trust in a paper.
>>>
>>> Guoqian: What ont?
>>>
>>> Jim: Drugbank mostly.  Lots of targets.
>>>
>>> Kyle: Relation-entity set.  Closed set?
>>>
>>> Jim: We have drug graph, protein-protein interaction, and drugbank 
>>> has drug-protein interaction.  Molecular interaction.  CTD 
>>> Comparative Toxinomic Database, Heng Ji Lab database started with it.
>>>
>>> Kyle: Trying to add more KB entities?
>>>
>>> Jim: Want to expand the interaction set.  Also entities.  We have all 
>>> human proteins and drugbank drugs.  If you have a drug with an effect 
>>> on a target similar protein in COVID-19, will there be hits, directly 
>>> or indirectly?  To do that, we want to score it also based on 
>>> confidence in the research.
>>>
>>> Scott: My research approach is to integrate structured knowledge from 
>>> literature or other curated sources, and combine with observational 
>>> data to facilitate more reliable inference.  General idea is that 
>>> contextual info can help interpret and identify confounders. 
>>> Confounders are common causes of the predictor and outcome.  What I 
>>> did with CORD-19: took pubmed IDs, and found what machine reading 
>>> performed and created KG.  Machine reading can run for months.  Jim's 
>>> work on citation analysis is cool.  Using semrep, developed by NLM, 
>>> over titles and abstracts in pubmed.  Using Pubmed central IDs from 
>>> metadata table, in beginning of March, 31k papers, with 28k in pubmed 
>>> central.  Seemed like a good place to start building a KG quickly, to 
>>> see the big picture quickly.  Pulled 106k semantic predications in 
>>> the 21k docs, pulled into cytoscape and computed network centrality, 
>>> and from that ranked. Fitered by biomedicl entities, diseases, 
>>> syndromes, amino acids, peptides or pharm substances.  Ranked themm 
>>> by centrality to understnad their importance.  Very prelim analysis.  
>>> Interested to see how I might expand on this and learn what others 
>>> are doing.
>>>
>>> Guoqian: Can cytoscape support RDF graphs?  David: Yes.  Jim: Yes, 
>>> and you can form SPARQL queries to extract specific interactions.  
>>> Not 1:1 mapping of RDF graph to bio network.
>>>
>>> Todor: There are different plugins, one is SPARQL endpoint.  Others 
>>> for other visualizations.  Keep expectations low.
>>>
>>> Jim: It also includes a knowledge exploration interface, built on 
>>> cytoscape.js, a re-implementation of cytoscape.  The implementation 
>>> I'm using has some interface element.
>>>
>>> Lucy: How does Coronavirus ont relate?
>>>
>>> Guoqian: Using this ont to annotate the papers.
>>>
>>> Lucy: Where did that ont come from?
>>>
>>> Jim: Built using OBO foundries?  Guoqian: Yes.
>>>
>>> Jim: We use OBO ont.  Oliver has a lot of tools for subsetting and 
>>> extracting for app ontologies.
>>>
>>> Guoqian: Also collaborating with Cochrane PICO ontology, devloping 
>>> COVID-19 PICO ont, specific subtypes of the high level types, eg, 
>>> subtypes of population with particular co-morbilitidies.  The ont is 
>>> also avail on github.
>>>
>>> Guoqian: How to collaborate?  Need a registry for KG from this 
>>> community?
>>>
>>> Lucy: Working on semantic annotation of entity and rel.  Lots of 
>>> people are doing bottom-up annotation, without formal vocab, then 
>>> linking to UMLS.  But haven't seen COVID-19 ont.
>>>
>>> Guoqian: Also should look at use cases that different groups have. 
>>> Community said they want open vocab instead of SNOMED-CT, such as UMLS.
>>>
>>> Lucy: Also working with a group at AWS, KB of concepts, link to 
>>> ICD-10 and RXNorm, also lots of requests for protein and interactions.
>>>
>>> Guoqian: Also procedure datasets.
>>>
>>> Lucy: What use cases are these projects addressing?
>>>
>>> Guoqian: For EBMonFHIR, they are focused on review of evidence, and 
>>> clinical concepts.  Other team looking at using OBO ont to analyse DB 
>>> to mine underlying mechanisms.  Ideally we should have linkage across 
>>> vocabularies.  Eg UMLS can link many ont.  But for OBO it might be  a 
>>> challenge.
>>>
>>> Jim: From microbio perspectvie, most useful from this group would be 
>>> having cross mapping from clinical/FHIR/SNOMED-ish world and OBO bio 
>>> world, with translation between the two.  E.g. I use uniprot IDs.  Is 
>>> that a problem?  What about drug IDs?  IDs are the hardest part -- 
>>> agree on some, and mappings for others.
>>>
>>> Guoqian: If we can provide a list of ont each team prefers, we can 
>>> discuss.
>>>
>>> Lucy: Would be great to be able to share annotations.  Centralized 
>>> vocab?  Central KB?  Use cases are key.
>>>
>>> Scott: Mapping problems with COVID-19 are same as other mapping 
>>> problems.  Should have a central place to share projects.  Should 
>>> keep use cases in mind.
>>>
>>> Sebastian: Please give us feedback on the dataset!
>>>
>>> Todor: Focus on specific questions that you want to answer, then map 
>>> using common IDs to address them.
>>>
>>> Daniel: What formats?  Right now we're using FHIR.  Use others?
>>>
>>> Jim: identifier.org might be useful for mapping.
>>>
>>> David: Useful to have each group present use cases and vocab.
>>>
>>> We'll meet weekly, same time, 1 hour.  Each group will present their 
>>> work in more detail, with focus on:
>>> what use cases they are addressing; and
>>> what vocabularies / ontologies they're using.
>>>
>>> Each group will present for 20 min presents, 10 min questions.
>>>
>>> ADJOURNED
Received on Monday, 27 April 2020 16:05:14 UTC