Re: CORD-19 semantic annotations - 11am Tuesday (Boston time) - Victor Mireles on RDFizing CORD-19 annotations from David Booth on 2020-04-27 (public-semweb-lifesci@w3.org from April 2020)

From: David Booth <david@dbooth.org>
Date: Mon, 27 Apr 2020 16:00:56 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Cc: Victor Mireles <victor.mireles@semantic-web.com>
Message-ID: <6210556c-2117-f363-7047-2908cf55c3bc@dbooth.org>
We will use this zoom:

Zoom Link: 
https://us02web.zoom.us/j/89011102533?pwd=SU9CdDYxUlRtUkNBdjFUN0x4MTRxUT09
password:  82AY02Rt66

Thanks,
David Booth

On 4/27/20 12:04 PM, David Booth wrote:
> Tomorrow (Tuesday) 11am Boston time Victor Mireles will present his work 
> on RDFizing several annotations on the Cord19 dataset that are around in 
> different vocabularies. Current vocabularies: gene ontology, ChEBI, 
> human phenotype ontology, MeSH disease.
> 
> Details for joining the call will be posted in a follow-up message.
> 
> Thanks,
> David Booth
> 
> On 4/21/20 1:32 PM, David Booth wrote:
>> [Apologies for reaching the google hangout participant limit today, 
>> and thank you to Victor Mireles-Chavez for allowing us to switch over 
>> to his zoom instead!  I will find a better solution for next week.]
>>
>> Below are meeting notes from today's call.  If you would like to 
>> present your work on CORD-19 semantic annotations, please email me so 
>> that I can put you on the schedule.  You do not need to have results 
>> yet.  Even if you are just starting out, it is helpful to learn what 
>> others are doing.
>>
>>                             ----------------------------
>>
>> MEETING NOTES 21-Apr-2020
>>
>> Present: David Booth, Jin-Dong Kim, Víctor Mireles, Oliver Giles,Harry 
>> Hochheiser, Franck Michel, James Malone, Kyle Lo, Sebastian Kohlmeier, 
>> Guoqian Jiang, Gaurav Vaidya, Gollam Rabby, Oliver, Tomas Kliegr
>>
>> Introductions
>>
>> David Booth: Many years in semantic web technology, applying it to 
>> healthcare and life sciences for the past 10years.  Involved in 
>> standardizing the RDF representation of HL7 FHIR: 
>> https://www.hl7.org/fhir/rdf.html
>>
>> Gaurav: U of NC (https://renci.org/staff/gaurav-vaidya/), sem web 
>> tech, using CORD-19, trying to annotate ont terms as part of Robokop 
>> (https://robokop.renci.org/).
>>
>> Harry: U of Pittsburgh, involved w W3C, drug-drug interaction, cancer 
>> information models, not actively using CORD-19.
>>
>> James Malone: CTO SciBite in UK, provide sem enrichment tooling to 
>> pharma, KG building.  Background, applying ont to public data, machine 
>> learning, building ontologies.
>>
>> Kyle: Researcher at Allen Institute, NLP, working on CORD-19.
>>
>> Oliver: Machine learning at SciBite w James, NLP, machine learning.
>>
>> Sebastian: Prog mgr at Allen Institute, CORD-19.
>>
>> Tomas: Working on rule learning, trying to apply it to CORD-19.
>>
>> Victor Mireles: Researcher at sem web co in Austria, looking at 
>> annotations that others have been doing on CORD-19, trying to make 
>> them match, and our own annotations.
>>
>> Presentation by Jin-Dong Kim
>>
>> Slides: 
>> https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 
>>
>>
>> Jin-Dong: Tokyo, database center for life science, Japan gov funded, 
>> bioinformatics, NLP, text mining, esp biomedical literature.
>>
>> (Jin-Dong presents his slides)
>>
>> Jin-Dong: Using multiple datasets.  Multiple groups producing 
>> annotations, isolation.  PubAnnotation is a 10-year-old project to 
>> integrate annotations to literature.  Collecting annotations for 
>> COVID-19 literature to integrate and release them for other use. 
>> PubAnnotation is an open repo of biomed text annotations.  Anyone can 
>> submit to it.  All annotations are aligned to the canonical texts.
>>
>> Jin-Dong: PubAnnotation also provides RESTful web services. Many 
>> annotators compatible with PubAnnotation.  Also collecting manual 
>> annotations using Testae.
>>
>> Q: Are the annotations from a controlled vocab or ont?  Jin-Dong: Both 
>> free text or from ont.
>>
>> Jin-Dong: Every text span has a URL.  You can see what projects 
>> include a doc.  And you can choose a span of text and see what 
>> projects used that span.
>>
>> Q: What is a project?  Jin-Dong: We collect any kind of annotations. 
>> Project identifies the source of people who have contributed annotations.
>>
>> Jin-Dong: Annotations can be accessed via a span URL.  Also converting 
>> annotationsn into RDF.  Still experimenting.  Also have a search 
>> interface.  SPARQL queries.
>> https://covid19.pubannotation.org/
>>
>> Jin-Dong: Trying to add annotations for temporation notations.
>>
>> Jin-Dong: Literature includes CORD-19 and LitCovid, from NCBI.  
>> Uploaded all the test to PubAnnotation 
>> (http://pubannotation.org/collections/LitCovid)  Anyone can 
>> contribute. To contribute, you can download, annotation, then create a 
>> new project and add it to the LitCovid collection and it will appear.  
>> Open platform.  Same setup for CORD-19.  Received 6 contributions so 
>> far. Need to analyze them.  Planning to call for wider contributions 
>> soon, maybe next week.  Plan to continuously update.
>>
>> Guoqian: Any specific research questions using these annotations? 
>> Particular use cases?  Jin-Dong: Need to find out. Clinicians began 
>> with manual annotations.  Will figure out missing parts and try to 
>> fill the gaps.  Many annotations are concept annotations using ont -- 
>> many similar.  But we think there are still important missing 
>> annotations, such as temporal expressions.  Looking to add those.  
>> Also quantitative traits annotations are missing.  Looking for those too.
>>
>> Q: How might these be used?
>>
>> Franck: I'm in Inria/CNRS/Univ Côte d'Azur, contacts with Inserm 
>> (French NIH) point to the need to search literature with questions 
>> like: "What are the papers that link Coronavirus with other diseases 
>> like diabetes or cancer?"
>>
>> James: Released COVID-specific annotations. Pharma using them: looking 
>> for co-risk factors, or drugs interacting.  Comes down to: want to 
>> narrow down to a set of papers to read.  Anything that gets them to 
>> the paper.  Want to read the o
>>
>> Franck: Summarizing the main claim of the paper helps also, to narrow 
>> down the search.
>>
>> Victor: Drug-drug interactions.  Many other KGs, to link to drug-drug 
>> or protein-protein interaction databases we need URIs, so 
>> pubAnnotations can query and get URIs from it, so I can see what drugs 
>> are mentioned in this span.  Is this supported?
>>
>> Jin-Dong: Group in China is working on annotations for drug 
>> repurposing.   I think they're using drug ont.
>>
>> Franck: How can we consume the annotations that have been contributed? 
>> Jin-Dong: Download in JSON or CSV, or access as RDF.
>>
>> Tomas: We detect entities, then try to do semantic extension.  Would 
>> there be a way to use this for semantic extension of entities, or get 
>> a list of highly specific concepts that appear in the article.  
>> Jin-Dong: Yes, because they're in RDF, could do that.  Tomas: How to 
>> match doc in your DB with doc in other DB?  Jin-Dong: Every doc is 
>> identified by a pair: DB identifier, and ID within that DB.
>>
>> Tomas: How many annotations average per document?  Jin-Dong: 
>> Conversion is not entirely done.  RDF statements only partially done.  
>> Jin-Dong: in CORD-PICO, for 26k docs, 69k annotations for PICO.
>>
>> ADJOURNED
>>
>> -----------------------------------------------------------------------
>>
>> On 4/21/20 10:47 AM, David Booth wrote:
>>> Last minute schedule change for today's call: Instead of Scott Malec, 
>>> Jin-Dong Kim will present his work on "An open collaboration for 
>>> richly annotating Covid-19 Literature".  Slides are here:
>>> https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 
>>>
>>>
>>> David Booth
>>>
>>> On 4/20/20 11:56 AM, David Booth wrote:
>>>> Tomorrow (Tuesday) 11am Boston time Scott Malec will discuss his 
>>>> work on computable knowledge extraction using the CORD-19 dataset 
>>>> that was released by the Allen Institute.
>>>>
>>>> We will use this google hangout:
>>>> http://tinyurl.com/fhirrdf
>>>>
>>>> More on Scott's work:
>>>> https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects#project-name-cord-semantictriples 
>>>>
>>>>
>>>> We still have time for one other presentation tomorrow about CORD-19 
>>>> semantic annotation.  If anyone else is ready (with slides) to 
>>>> present for 20 minutes, please let me know.
>>>>
>>>> Thanks,
>>>> David Booth
>>>>
>>>> -----------------------------------------------
>>>>
>>>> MEETING NOTES 7-Apr-2020
>>>> Present: David Booth <david@dbooth.org>, Sebastian Kohlmeier 
>>>> <sebastiank@allenai.org>, Lucy Lu Wang <lucyw@allenai.org>, Kyle Lo 
>>>> <kylel@allenai.org>, Jim McCusker <mccusker@gmail.com>, Scott Malec 
>>>> <sam413@pitt.edu>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Todor 
>>>> Primov <todor.primov@ontotext.com>
>>>>
>>>> Sebastian: Allen Institute, Semantic Scholar, Non-profit AI 
>>>> institute, w Lucy and Kyle.  Engaged in COVID-19 because as 
>>>> non-profit could develop a corpus that we can make available. 
>>>> Created CORD-19 dataset.  Goal: Standardized format that's easy for 
>>>> machines to read, to enable quick analysys of the literature. 
>>>> Working to extend it. Weekly updates, but want to get to daily 
>>>> updates.  Want to also get to to entity and relation extraction.
>>>>
>>>> Guoqian: Identifiers used?  SHA numbers for full text, but also IDs 
>>>> linked to DOIs and Pubmed IDs.  Should discuss best way to have 
>>>> unique ID for publication.
>>>>
>>>> Kyle: Added unique IDs: cord_UID.  SHA is a hash of PDF, and 
>>>> sometimes there are multiple PDFs for a single paper.
>>>>
>>>> Jim: DOIs?
>>>>
>>>> Lucy: Some papers do not have a DOI.
>>>>
>>>> Jim: Building a KG using generalized tools from another projects, 
>>>> used in many domains.  Looking to do drug repurposing using CORD-19. 
>>>> Using an extract of CORD-19.  Does deep extraction of named entities 
>>>> and relationships.  Use PROV ont and nanopublications, for rich 
>>>> modeling and provenance for probabilistic KG.  Arcs in picture are 
>>>> based on confidence level.  Allows high precision on drugs that have 
>>>> been tested on melanoma before.  Re-applying this to COVID-19.  We 
>>>> focus on open ontologies, and not using FHIR.  Unpublished yet. 
>>>> Page-rank based analysis of pubmed citation graph, to compute 
>>>> community trust in a paper.
>>>>
>>>> Guoqian: What ont?
>>>>
>>>> Jim: Drugbank mostly.  Lots of targets.
>>>>
>>>> Kyle: Relation-entity set.  Closed set?
>>>>
>>>> Jim: We have drug graph, protein-protein interaction, and drugbank 
>>>> has drug-protein interaction.  Molecular interaction.  CTD 
>>>> Comparative Toxinomic Database, Heng Ji Lab database started with it.
>>>>
>>>> Kyle: Trying to add more KB entities?
>>>>
>>>> Jim: Want to expand the interaction set.  Also entities.  We have 
>>>> all human proteins and drugbank drugs.  If you have a drug with an 
>>>> effect on a target similar protein in COVID-19, will there be hits, 
>>>> directly or indirectly?  To do that, we want to score it also based 
>>>> on confidence in the research.
>>>>
>>>> Scott: My research approach is to integrate structured knowledge 
>>>> from literature or other curated sources, and combine with 
>>>> observational data to facilitate more reliable inference.  General 
>>>> idea is that contextual info can help interpret and identify 
>>>> confounders. Confounders are common causes of the predictor and 
>>>> outcome.  What I did with CORD-19: took pubmed IDs, and found what 
>>>> machine reading performed and created KG.  Machine reading can run 
>>>> for months.  Jim's work on citation analysis is cool.  Using semrep, 
>>>> developed by NLM, over titles and abstracts in pubmed.  Using Pubmed 
>>>> central IDs from metadata table, in beginning of March, 31k papers, 
>>>> with 28k in pubmed central.  Seemed like a good place to start 
>>>> building a KG quickly, to see the big picture quickly.  Pulled 106k 
>>>> semantic predications in the 21k docs, pulled into cytoscape and 
>>>> computed network centrality, and from that ranked. Fitered by 
>>>> biomedicl entities, diseases, syndromes, amino acids, peptides or 
>>>> pharm substances.  Ranked themm by centrality to understnad their 
>>>> importance.  Very prelim analysis. Interested to see how I might 
>>>> expand on this and learn what others are doing.
>>>>
>>>> Guoqian: Can cytoscape support RDF graphs?  David: Yes.  Jim: Yes, 
>>>> and you can form SPARQL queries to extract specific interactions. 
>>>> Not 1:1 mapping of RDF graph to bio network.
>>>>
>>>> Todor: There are different plugins, one is SPARQL endpoint.  Others 
>>>> for other visualizations.  Keep expectations low.
>>>>
>>>> Jim: It also includes a knowledge exploration interface, built on 
>>>> cytoscape.js, a re-implementation of cytoscape.  The implementation 
>>>> I'm using has some interface element.
>>>>
>>>> Lucy: How does Coronavirus ont relate?
>>>>
>>>> Guoqian: Using this ont to annotate the papers.
>>>>
>>>> Lucy: Where did that ont come from?
>>>>
>>>> Jim: Built using OBO foundries?  Guoqian: Yes.
>>>>
>>>> Jim: We use OBO ont.  Oliver has a lot of tools for subsetting and 
>>>> extracting for app ontologies.
>>>>
>>>> Guoqian: Also collaborating with Cochrane PICO ontology, devloping 
>>>> COVID-19 PICO ont, specific subtypes of the high level types, eg, 
>>>> subtypes of population with particular co-morbilitidies.  The ont is 
>>>> also avail on github.
>>>>
>>>> Guoqian: How to collaborate?  Need a registry for KG from this 
>>>> community?
>>>>
>>>> Lucy: Working on semantic annotation of entity and rel.  Lots of 
>>>> people are doing bottom-up annotation, without formal vocab, then 
>>>> linking to UMLS.  But haven't seen COVID-19 ont.
>>>>
>>>> Guoqian: Also should look at use cases that different groups have. 
>>>> Community said they want open vocab instead of SNOMED-CT, such as UMLS.
>>>>
>>>> Lucy: Also working with a group at AWS, KB of concepts, link to 
>>>> ICD-10 and RXNorm, also lots of requests for protein and interactions.
>>>>
>>>> Guoqian: Also procedure datasets.
>>>>
>>>> Lucy: What use cases are these projects addressing?
>>>>
>>>> Guoqian: For EBMonFHIR, they are focused on review of evidence, and 
>>>> clinical concepts.  Other team looking at using OBO ont to analyse 
>>>> DB to mine underlying mechanisms.  Ideally we should have linkage 
>>>> across vocabularies.  Eg UMLS can link many ont.  But for OBO it 
>>>> might be  a challenge.
>>>>
>>>> Jim: From microbio perspectvie, most useful from this group would be 
>>>> having cross mapping from clinical/FHIR/SNOMED-ish world and OBO bio 
>>>> world, with translation between the two.  E.g. I use uniprot IDs.  
>>>> Is that a problem?  What about drug IDs?  IDs are the hardest part 
>>>> -- agree on some, and mappings for others.
>>>>
>>>> Guoqian: If we can provide a list of ont each team prefers, we can 
>>>> discuss.
>>>>
>>>> Lucy: Would be great to be able to share annotations.  Centralized 
>>>> vocab?  Central KB?  Use cases are key.
>>>>
>>>> Scott: Mapping problems with COVID-19 are same as other mapping 
>>>> problems.  Should have a central place to share projects.  Should 
>>>> keep use cases in mind.
>>>>
>>>> Sebastian: Please give us feedback on the dataset!
>>>>
>>>> Todor: Focus on specific questions that you want to answer, then map 
>>>> using common IDs to address them.
>>>>
>>>> Daniel: What formats?  Right now we're using FHIR.  Use others?
>>>>
>>>> Jim: identifier.org might be useful for mapping.
>>>>
>>>> David: Useful to have each group present use cases and vocab.
>>>>
>>>> We'll meet weekly, same time, 1 hour.  Each group will present their 
>>>> work in more detail, with focus on:
>>>> what use cases they are addressing; and
>>>> what vocabularies / ontologies they're using.
>>>>
>>>> Each group will present for 20 min presents, 10 min questions.
>>>>
>>>> ADJOURNED
Received on Monday, 27 April 2020 20:01:12 UTC