Re: CORD-19 semantic annotations - 11am Tuesday (Boston time) - Victor Mireles on RDFizing CORD-19 annotations

Notes from today's call:

MEETING NOTES 28-Apr-2020
Present: David Booth, Victor Mireles, Louis Rumanes, Tom Conlin, Franck 
Michel, Gollam Rabby, Jim McCusker, Lucy Wong, Sebastian Kohlmeier, 
Tomáš Kliegr

Introductions
David Booth: 10 years applying semantic web tech to healthcare and life 
sciences, working on Mayo Clinic / Johns-Hopkins University collaboration.

Louis Rumane: United Health Group, Doing COVID research, looking at 
making a KG

Tom Conlin: Working with Melissa Haendel (Monarch Initiative),

Franck: INRIA

Gollam: Prague, Univ

Jim: Research sci RPI, working on KG w bio

Lucy: Allen institute, research scientist.

Tomas: Assoc Prof, Prague, KG.

Sebastian: Sr Mgr on CORD-19.

Victor: Semantic Web company researcher

Victor's Presentation
Slides here: 
https://docs.google.com/presentation/d/1xaS_88sJ47iSrvv0ezOfjscIvG2VINUe7vqrUEMiaCA/edit?usp=sharing

victor: Semantic Web Company, 40+ FTEs.  Makes PoolParty. Works w 
companies in many counties.  Taxonomy helps extract entities from text. 
image search, data mgmt.

victor: Developing text and data mining tools for biomed, and CORD-19. 
We don't only annotate text.  What's useful about annotating text w 
entities is to use the knowledge, simplest is encoded in SKOS, such as 
broader/narrower.  But to do this we need to annotate the text into 
URIs, then import relationships into the graph.  Trying to link existing 
annotations w other knowledge sources.  Ont is simplified version of 
NIFT: documents have sections, sections have annotations that are SKOS 
concepts.

victor: So far, we've set up a pipeline to take a document and it finds 
annotations with offsets.  So far imported ChEBI, GO, MeSH, HPO, but 
using them as controlled vocab.  Many are very specific, such as 
"COVID-19" -- not really NLP, because there are not inflections, 
plurals, etc.  Output is a bunch of triples in the simple SKOS ont 
previously mentioned. Put them into GraphDB, along with the vocabs.

victor: Also looked at SciBite annotations.  They've done an excellent 
job annotating.  They also have their own controlled vocab that is very 
good.  JSON files have annotations. Put them into triples.  Combining 
them w bio DBs gives a graph DB.

(victor shows relationships in GraphDB viewer)

victor: you can navigate the hierarchy of concepts and link them to the 
paragraphs in CORD-19 DB.

(victor shows SPARQL queries)

victor: This allows us to pull up the titles and paragraphs of articles 
that both mention a kind of neoplasm and a kind of coronavirus.

victor: Want to take other DBs and put them into GraphDB also.  Monarch 
Initiative is putting together KG, and also puts in SciBite.

victor: Missing from both our effort and Monarch: searchability.  I 
showed SPARQL queries using broader/narrower.  Also need to be more 
efficient for humans, working also on faceted search.  Monarch 
Initiative is very good for machine readable stuff.  Another thing 
missing: relation extraction, from the text.  How does human determine 
that some text is saying that a protein interacts with another.  JPL 
(Lewis Magidney?sp?) is using a Stanford NLP for relation extraction.
https://github.com/nasa-jpl-cord-19/covid19-knowledge-graph
It isn't perfect, but it indicates a relationship.  Both entities are in 
GO.  This adds new edges between entities.  Lots of interest in this 
topic now.

Franck: We're doing pretty close to this in INRIA, looking at named 
entities, wikidata entities, queries that gather all articles on cancer 
and any coronavirus.  Another thing we're doing: in addition to 
detecting named entities, we're running other tools to identify 
arguments, claims, evidence in articles and draw netowrk of claims and 
evidence to see what supports the claims.  Hope to publish this network 
soon as RDF graph.

victor: PubAnnotation shown last week, showed epistemic analysis.

Franck: Argument, clinical trial analysys.  Query pubmed and platform 
analyzes those articles.  Want to apply them to CORD-19.

Vincent: Is RDF available? victor: Will take a couple more weeks. 
Vincent: Size? victor: 20GB RDF.

David: Overlap between efforts, helpful to learn about each other's work.

victor: After looking at Monarch initative, it isn't new, names i 
recognized from Human Phenotype initative.  Most of that summarizes work 
that others have done.  FHIR DB also have overlaps with SciBite.

david: SPARQL query was valuable, but biologists need simple UI.

jim: Working on faceted browser for various things, that can be reused. 
Based on SPARQL fragments, property path gives certain values, here's 
how to render it.  Potentially useful here.  Also integrated WHYIS Vega 
(JS framework for charts and visualization), can plug a SPARQL query in 
and get a chart.  People can share how thtey're exploring the graph.
https://github.com/tetherless-world/whyis
Faceted search is a view in WHYIS, but a lot of the capabilities are 
designed to use nanopub.

Email list for these calls: 
https://lists.w3.org/Archives/Public/public-semweb-lifesci/

Franck to present next week.

ADJOURNED


On 4/27/20 4:00 PM, David Booth wrote:
> We will use this zoom:
> 
> Zoom Link: 
> https://us02web.zoom.us/j/89011102533?pwd=SU9CdDYxUlRtUkNBdjFUN0x4MTRxUT09
> password:  82AY02Rt66
> 
> Thanks,
> David Booth
> 
> On 4/27/20 12:04 PM, David Booth wrote:
>> Tomorrow (Tuesday) 11am Boston time Victor Mireles will present his 
>> work on RDFizing several annotations on the Cord19 dataset that are 
>> around in different vocabularies. Current vocabularies: gene ontology, 
>> ChEBI, human phenotype ontology, MeSH disease.
>>
>> Details for joining the call will be posted in a follow-up message.
>>
>> Thanks,
>> David Booth
>>
>> On 4/21/20 1:32 PM, David Booth wrote:
>>> [Apologies for reaching the google hangout participant limit today, 
>>> and thank you to Victor Mireles-Chavez for allowing us to switch over 
>>> to his zoom instead!  I will find a better solution for next week.]
>>>
>>> Below are meeting notes from today's call.  If you would like to 
>>> present your work on CORD-19 semantic annotations, please email me so 
>>> that I can put you on the schedule.  You do not need to have results 
>>> yet.  Even if you are just starting out, it is helpful to learn what 
>>> others are doing.
>>>
>>>                             ----------------------------
>>>
>>> MEETING NOTES 21-Apr-2020
>>>
>>> Present: David Booth, Jin-Dong Kim, Víctor Mireles, Oliver 
>>> Giles,Harry Hochheiser, Franck Michel, James Malone, Kyle Lo, 
>>> Sebastian Kohlmeier, Guoqian Jiang, Gaurav Vaidya, Gollam Rabby, 
>>> Oliver, Tomas Kliegr
>>>
>>> Introductions
>>>
>>> David Booth: Many years in semantic web technology, applying it to 
>>> healthcare and life sciences for the past 10years.  Involved in 
>>> standardizing the RDF representation of HL7 FHIR: 
>>> https://www.hl7.org/fhir/rdf.html
>>>
>>> Gaurav: U of NC (https://renci.org/staff/gaurav-vaidya/), sem web 
>>> tech, using CORD-19, trying to annotate ont terms as part of Robokop 
>>> (https://robokop.renci.org/).
>>>
>>> Harry: U of Pittsburgh, involved w W3C, drug-drug interaction, cancer 
>>> information models, not actively using CORD-19.
>>>
>>> James Malone: CTO SciBite in UK, provide sem enrichment tooling to 
>>> pharma, KG building.  Background, applying ont to public data, 
>>> machine learning, building ontologies.
>>>
>>> Kyle: Researcher at Allen Institute, NLP, working on CORD-19.
>>>
>>> Oliver: Machine learning at SciBite w James, NLP, machine learning.
>>>
>>> Sebastian: Prog mgr at Allen Institute, CORD-19.
>>>
>>> Tomas: Working on rule learning, trying to apply it to CORD-19.
>>>
>>> Victor Mireles: Researcher at sem web co in Austria, looking at 
>>> annotations that others have been doing on CORD-19, trying to make 
>>> them match, and our own annotations.
>>>
>>> Presentation by Jin-Dong Kim
>>>
>>> Slides: 
>>> https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 
>>>
>>>
>>> Jin-Dong: Tokyo, database center for life science, Japan gov funded, 
>>> bioinformatics, NLP, text mining, esp biomedical literature.
>>>
>>> (Jin-Dong presents his slides)
>>>
>>> Jin-Dong: Using multiple datasets.  Multiple groups producing 
>>> annotations, isolation.  PubAnnotation is a 10-year-old project to 
>>> integrate annotations to literature.  Collecting annotations for 
>>> COVID-19 literature to integrate and release them for other use. 
>>> PubAnnotation is an open repo of biomed text annotations.  Anyone can 
>>> submit to it.  All annotations are aligned to the canonical texts.
>>>
>>> Jin-Dong: PubAnnotation also provides RESTful web services. Many 
>>> annotators compatible with PubAnnotation.  Also collecting manual 
>>> annotations using Testae.
>>>
>>> Q: Are the annotations from a controlled vocab or ont?  Jin-Dong: 
>>> Both free text or from ont.
>>>
>>> Jin-Dong: Every text span has a URL.  You can see what projects 
>>> include a doc.  And you can choose a span of text and see what 
>>> projects used that span.
>>>
>>> Q: What is a project?  Jin-Dong: We collect any kind of annotations. 
>>> Project identifies the source of people who have contributed 
>>> annotations.
>>>
>>> Jin-Dong: Annotations can be accessed via a span URL.  Also 
>>> converting annotationsn into RDF.  Still experimenting.  Also have a 
>>> search interface.  SPARQL queries.
>>> https://covid19.pubannotation.org/
>>>
>>> Jin-Dong: Trying to add annotations for temporation notations.
>>>
>>> Jin-Dong: Literature includes CORD-19 and LitCovid, from NCBI. 
>>> Uploaded all the test to PubAnnotation 
>>> (http://pubannotation.org/collections/LitCovid)  Anyone can 
>>> contribute. To contribute, you can download, annotation, then create 
>>> a new project and add it to the LitCovid collection and it will 
>>> appear. Open platform.  Same setup for CORD-19.  Received 6 
>>> contributions so far. Need to analyze them.  Planning to call for 
>>> wider contributions soon, maybe next week.  Plan to continuously update.
>>>
>>> Guoqian: Any specific research questions using these annotations? 
>>> Particular use cases?  Jin-Dong: Need to find out. Clinicians began 
>>> with manual annotations.  Will figure out missing parts and try to 
>>> fill the gaps.  Many annotations are concept annotations using ont -- 
>>> many similar.  But we think there are still important missing 
>>> annotations, such as temporal expressions.  Looking to add those. 
>>> Also quantitative traits annotations are missing.  Looking for those 
>>> too.
>>>
>>> Q: How might these be used?
>>>
>>> Franck: I'm in Inria/CNRS/Univ Côte d'Azur, contacts with Inserm 
>>> (French NIH) point to the need to search literature with questions 
>>> like: "What are the papers that link Coronavirus with other diseases 
>>> like diabetes or cancer?"
>>>
>>> James: Released COVID-specific annotations. Pharma using them: 
>>> looking for co-risk factors, or drugs interacting.  Comes down to: 
>>> want to narrow down to a set of papers to read.  Anything that gets 
>>> them to the paper.  Want to read the o
>>>
>>> Franck: Summarizing the main claim of the paper helps also, to narrow 
>>> down the search.
>>>
>>> Victor: Drug-drug interactions.  Many other KGs, to link to drug-drug 
>>> or protein-protein interaction databases we need URIs, so 
>>> pubAnnotations can query and get URIs from it, so I can see what 
>>> drugs are mentioned in this span.  Is this supported?
>>>
>>> Jin-Dong: Group in China is working on annotations for drug 
>>> repurposing.   I think they're using drug ont.
>>>
>>> Franck: How can we consume the annotations that have been 
>>> contributed? Jin-Dong: Download in JSON or CSV, or access as RDF.
>>>
>>> Tomas: We detect entities, then try to do semantic extension.  Would 
>>> there be a way to use this for semantic extension of entities, or get 
>>> a list of highly specific concepts that appear in the article. 
>>> Jin-Dong: Yes, because they're in RDF, could do that.  Tomas: How to 
>>> match doc in your DB with doc in other DB?  Jin-Dong: Every doc is 
>>> identified by a pair: DB identifier, and ID within that DB.
>>>
>>> Tomas: How many annotations average per document?  Jin-Dong: 
>>> Conversion is not entirely done.  RDF statements only partially done. 
>>> Jin-Dong: in CORD-PICO, for 26k docs, 69k annotations for PICO.
>>>
>>> ADJOURNED
>>>
>>> -----------------------------------------------------------------------
>>>
>>> On 4/21/20 10:47 AM, David Booth wrote:
>>>> Last minute schedule change for today's call: Instead of Scott 
>>>> Malec, Jin-Dong Kim will present his work on "An open collaboration 
>>>> for richly annotating Covid-19 Literature".  Slides are here:
>>>> https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0 
>>>>
>>>>
>>>> David Booth
>>>>
>>>> On 4/20/20 11:56 AM, David Booth wrote:
>>>>> Tomorrow (Tuesday) 11am Boston time Scott Malec will discuss his 
>>>>> work on computable knowledge extraction using the CORD-19 dataset 
>>>>> that was released by the Allen Institute.
>>>>>
>>>>> We will use this google hangout:
>>>>> http://tinyurl.com/fhirrdf
>>>>>
>>>>> More on Scott's work:
>>>>> https://github.com/fhircat/CORD-19-on-FHIR/wiki/CORD-19-Semantic-Annotation-Projects#project-name-cord-semantictriples 
>>>>>
>>>>>
>>>>> We still have time for one other presentation tomorrow about 
>>>>> CORD-19 semantic annotation.  If anyone else is ready (with slides) 
>>>>> to present for 20 minutes, please let me know.
>>>>>
>>>>> Thanks,
>>>>> David Booth
>>>>>
>>>>> -----------------------------------------------
>>>>>
>>>>> MEETING NOTES 7-Apr-2020
>>>>> Present: David Booth <david@dbooth.org>, Sebastian Kohlmeier 
>>>>> <sebastiank@allenai.org>, Lucy Lu Wang <lucyw@allenai.org>, Kyle Lo 
>>>>> <kylel@allenai.org>, Jim McCusker <mccusker@gmail.com>, Scott Malec 
>>>>> <sam413@pitt.edu>, Guoqian Jiang <jiang.guoqian@mayo.edu>, Todor 
>>>>> Primov <todor.primov@ontotext.com>
>>>>>
>>>>> Sebastian: Allen Institute, Semantic Scholar, Non-profit AI 
>>>>> institute, w Lucy and Kyle.  Engaged in COVID-19 because as 
>>>>> non-profit could develop a corpus that we can make available. 
>>>>> Created CORD-19 dataset.  Goal: Standardized format that's easy for 
>>>>> machines to read, to enable quick analysys of the literature. 
>>>>> Working to extend it. Weekly updates, but want to get to daily 
>>>>> updates.  Want to also get to to entity and relation extraction.
>>>>>
>>>>> Guoqian: Identifiers used?  SHA numbers for full text, but also IDs 
>>>>> linked to DOIs and Pubmed IDs.  Should discuss best way to have 
>>>>> unique ID for publication.
>>>>>
>>>>> Kyle: Added unique IDs: cord_UID.  SHA is a hash of PDF, and 
>>>>> sometimes there are multiple PDFs for a single paper.
>>>>>
>>>>> Jim: DOIs?
>>>>>
>>>>> Lucy: Some papers do not have a DOI.
>>>>>
>>>>> Jim: Building a KG using generalized tools from another projects, 
>>>>> used in many domains.  Looking to do drug repurposing using 
>>>>> CORD-19. Using an extract of CORD-19.  Does deep extraction of 
>>>>> named entities and relationships.  Use PROV ont and 
>>>>> nanopublications, for rich modeling and provenance for 
>>>>> probabilistic KG.  Arcs in picture are based on confidence level.  
>>>>> Allows high precision on drugs that have been tested on melanoma 
>>>>> before.  Re-applying this to COVID-19.  We focus on open 
>>>>> ontologies, and not using FHIR.  Unpublished yet. Page-rank based 
>>>>> analysis of pubmed citation graph, to compute community trust in a 
>>>>> paper.
>>>>>
>>>>> Guoqian: What ont?
>>>>>
>>>>> Jim: Drugbank mostly.  Lots of targets.
>>>>>
>>>>> Kyle: Relation-entity set.  Closed set?
>>>>>
>>>>> Jim: We have drug graph, protein-protein interaction, and drugbank 
>>>>> has drug-protein interaction.  Molecular interaction.  CTD 
>>>>> Comparative Toxinomic Database, Heng Ji Lab database started with it.
>>>>>
>>>>> Kyle: Trying to add more KB entities?
>>>>>
>>>>> Jim: Want to expand the interaction set.  Also entities.  We have 
>>>>> all human proteins and drugbank drugs.  If you have a drug with an 
>>>>> effect on a target similar protein in COVID-19, will there be hits, 
>>>>> directly or indirectly?  To do that, we want to score it also based 
>>>>> on confidence in the research.
>>>>>
>>>>> Scott: My research approach is to integrate structured knowledge 
>>>>> from literature or other curated sources, and combine with 
>>>>> observational data to facilitate more reliable inference.  General 
>>>>> idea is that contextual info can help interpret and identify 
>>>>> confounders. Confounders are common causes of the predictor and 
>>>>> outcome.  What I did with CORD-19: took pubmed IDs, and found what 
>>>>> machine reading performed and created KG.  Machine reading can run 
>>>>> for months.  Jim's work on citation analysis is cool.  Using 
>>>>> semrep, developed by NLM, over titles and abstracts in pubmed.  
>>>>> Using Pubmed central IDs from metadata table, in beginning of 
>>>>> March, 31k papers, with 28k in pubmed central.  Seemed like a good 
>>>>> place to start building a KG quickly, to see the big picture 
>>>>> quickly.  Pulled 106k semantic predications in the 21k docs, pulled 
>>>>> into cytoscape and computed network centrality, and from that 
>>>>> ranked. Fitered by biomedicl entities, diseases, syndromes, amino 
>>>>> acids, peptides or pharm substances.  Ranked themm by centrality to 
>>>>> understnad their importance.  Very prelim analysis. Interested to 
>>>>> see how I might expand on this and learn what others are doing.
>>>>>
>>>>> Guoqian: Can cytoscape support RDF graphs?  David: Yes.  Jim: Yes, 
>>>>> and you can form SPARQL queries to extract specific interactions. 
>>>>> Not 1:1 mapping of RDF graph to bio network.
>>>>>
>>>>> Todor: There are different plugins, one is SPARQL endpoint.  Others 
>>>>> for other visualizations.  Keep expectations low.
>>>>>
>>>>> Jim: It also includes a knowledge exploration interface, built on 
>>>>> cytoscape.js, a re-implementation of cytoscape.  The implementation 
>>>>> I'm using has some interface element.
>>>>>
>>>>> Lucy: How does Coronavirus ont relate?
>>>>>
>>>>> Guoqian: Using this ont to annotate the papers.
>>>>>
>>>>> Lucy: Where did that ont come from?
>>>>>
>>>>> Jim: Built using OBO foundries?  Guoqian: Yes.
>>>>>
>>>>> Jim: We use OBO ont.  Oliver has a lot of tools for subsetting and 
>>>>> extracting for app ontologies.
>>>>>
>>>>> Guoqian: Also collaborating with Cochrane PICO ontology, devloping 
>>>>> COVID-19 PICO ont, specific subtypes of the high level types, eg, 
>>>>> subtypes of population with particular co-morbilitidies.  The ont 
>>>>> is also avail on github.
>>>>>
>>>>> Guoqian: How to collaborate?  Need a registry for KG from this 
>>>>> community?
>>>>>
>>>>> Lucy: Working on semantic annotation of entity and rel.  Lots of 
>>>>> people are doing bottom-up annotation, without formal vocab, then 
>>>>> linking to UMLS.  But haven't seen COVID-19 ont.
>>>>>
>>>>> Guoqian: Also should look at use cases that different groups have. 
>>>>> Community said they want open vocab instead of SNOMED-CT, such as 
>>>>> UMLS.
>>>>>
>>>>> Lucy: Also working with a group at AWS, KB of concepts, link to 
>>>>> ICD-10 and RXNorm, also lots of requests for protein and interactions.
>>>>>
>>>>> Guoqian: Also procedure datasets.
>>>>>
>>>>> Lucy: What use cases are these projects addressing?
>>>>>
>>>>> Guoqian: For EBMonFHIR, they are focused on review of evidence, and 
>>>>> clinical concepts.  Other team looking at using OBO ont to analyse 
>>>>> DB to mine underlying mechanisms.  Ideally we should have linkage 
>>>>> across vocabularies.  Eg UMLS can link many ont.  But for OBO it 
>>>>> might be  a challenge.
>>>>>
>>>>> Jim: From microbio perspectvie, most useful from this group would 
>>>>> be having cross mapping from clinical/FHIR/SNOMED-ish world and OBO 
>>>>> bio world, with translation between the two.  E.g. I use uniprot 
>>>>> IDs. Is that a problem?  What about drug IDs?  IDs are the hardest 
>>>>> part -- agree on some, and mappings for others.
>>>>>
>>>>> Guoqian: If we can provide a list of ont each team prefers, we can 
>>>>> discuss.
>>>>>
>>>>> Lucy: Would be great to be able to share annotations.  Centralized 
>>>>> vocab?  Central KB?  Use cases are key.
>>>>>
>>>>> Scott: Mapping problems with COVID-19 are same as other mapping 
>>>>> problems.  Should have a central place to share projects.  Should 
>>>>> keep use cases in mind.
>>>>>
>>>>> Sebastian: Please give us feedback on the dataset!
>>>>>
>>>>> Todor: Focus on specific questions that you want to answer, then 
>>>>> map using common IDs to address them.
>>>>>
>>>>> Daniel: What formats?  Right now we're using FHIR.  Use others?
>>>>>
>>>>> Jim: identifier.org might be useful for mapping.
>>>>>
>>>>> David: Useful to have each group present use cases and vocab.
>>>>>
>>>>> We'll meet weekly, same time, 1 hour.  Each group will present 
>>>>> their work in more detail, with focus on:
>>>>> what use cases they are addressing; and
>>>>> what vocabularies / ontologies they're using.
>>>>>
>>>>> Each group will present for 20 min presents, 10 min questions.
>>>>>
>>>>> ADJOURNED

Received on Tuesday, 28 April 2020 16:09:24 UTC