Re: CORD-19 semantic annotations - 11am Tuesday (Boston time) - Franck Michel on Named Entities Knowledge Graph from David Booth on 2020-05-13 (public-semweb-lifesci@w3.org from May 2020)

From: David Booth <david@dbooth.org>
Date: Wed, 13 May 2020 10:46:04 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Cc: Franck Michel <fmichel@i3s.unice.fr>
Message-ID: <0a1590dd-349a-2193-7473-cf0b7097b20f@dbooth.org>
Notes from yesterday's webinar by Franck Michel are below.  Thanks to 
Victor Mireles-Chavez a recording of the call is available at the 
following URL.  Franck's presentation starts at 17:10.

https://tinyurl.com/y8kmfxhe
Recording password: 7t?N&*9+

--------------------------------------------------------------
MEETING NOTES 12-May-2020
Present: David Booth, Victor Mireles, Franck Michel, Albert Burger, 
Daniel Stone, Deborah McGuiness, Filip, Gaurav Vaidya, Gollam Rabby, 
Louis, Gollam Rabby, Louis Rumanes, Marcin Joachimiak, Michael Liebman, 
Subhashis Das, Nico, Tom Conlin, Chuming Chen

Introductions
David Booth: 10 years applying semantic web tech to healthcare and life 
sciences, working on Mayo Clinic / Johns-Hopkins University collaboration.

Subhashis Das: PostDoctoral researcher at CeIC, DCU, Dublin. 
Specialization in domain ontology and healthcare data integration.

Franck's presentation
Slides: 
https://www.dropbox.com/s/nnyg1o45f9dvimk/20200512%20Covid-on-the-Web%20-%20CORD-19%20semantic%20annotations.pdf?dl=0 


Franck: Goal is to make it easier to find and make sense of COVID-19 
literature: both named entities, and argumentative graphs.  Using 
DBpedia Spotlight, Entity-fishing, BioPortal Annotator.

Franck: Releasing v1.1 shortly.  54M named entities, 564k URIs.
30M NEs, 155,651 URIs from Wikidata
21M NEs, 339,990 URIs from BioPortal
1.8M NEs, from DBpedia
https://github.com/wimmics/cord19-nekg
Full modelling details: 
https://github.com/Wimmics/cord19-nekg/blob/master/doc/01-data-modeling.md
SPARQL endpoint: http://covid19.i3s.unice.fr/sparql
Virtuoso faceted browsing: http://covid19.i3s.unice.fr:8890/fct/
Franck: Web annotation ont and PROV-O used to annotate articles. 
Annotation points to article and position within the article where the 
entity was found.

Franck: Able to query for cancer entity and its subclasses or instances.

Franck: Also looking at co-mentions of named entities.

Franck: Colleagues also working on ACTA: A Tool for Argumentative ... 
claims/evidence.  This would allow arguments/claims/evidence to be 
displayed in a graph.

David: What ont are you using for determining the subclass relations of 
cancer, for example?
Franck: So far using wikidata hierarchy.  One exception: viruses in 
wikidata are not modeled as classes, so we regenerated them as classes.

Victor: Why can't DBpedia SPotlight process full text?
Franck: We have 54M NEs, 700M triples.  Not enough machine power to do 
full text.

Victor: If I find offsets, how can I be sure that I am aligned in my own 
data?
Franck: It refers specifically to the CORD-19 dataset.

Marcin: How are you extracting info about viral proteins?  There are 
poly proteins?
Franck: We rely on the results of the tools we're using.  If a protein 
is identified by those tools then we get them.  If an article mentions a 
gene name, would it show up?

Marcin: There are a few of these different entity extraction efforts. 
Should we try to merge them?

David: That's exactly the point of these teleconferences -- to start 
learning about each other's work and figure out how best to coordinate.

michael: We compared analysis of abstracts vs full body, and found 
significant difference, because abstract is more of an advertisement. 
Also, in dealing with the full body, we found it necessary to parse the 
article, separate section on methods, results, conclusions.

Franck: My colleagues working on argumentative extraction, quality 
varies a lot from one category to another.  They've noticed 
(anecdotally) that clinical trials have an abstract with a few clear 
statements about results, and relatively easy to extract, but not for 
other articles.

Victor: Comment on avoiding duplication of effort, there is quite some 
effort in doing annotations.  Some are better prepared than others. 
Takes time.  By the time someone presents work, others have already 
spent time doing similar work.

David: We began these calls with very brief presentations by each 
participant, but after that, switched to deeper presentations of each 
project.

Deborah: When presenting, please say what of your work is ready for 
others to use.

Tom: Also interested in timing, how long things took, what was good/bad.

AGREED: Next week we will do 5-minute presentations of what we're doing 
or planning.

Speakers next week: Daniel, Deborah, Gaurav, Gollam, Marcin, John Z, 
Michael, Tom, David.

Subhashis: not next week, but later.

ADJOURNED


On 5/11/20 12:22 PM, David Booth wrote:
> Tomorrow (Tuesday) Franck Michel will present his work on CORD-19 Named 
> Entities Knowledge Graph (CORD19-NEKG).
> 
> Zoom Link:
> https://us02web.zoom.us/j/83815969391?pwd=Q0k4Nm9xc3V2K0djL0FYT2JMVTJmUT09
> 
> Thanks,
> David Booth
> 
> On 4/28/20 12:09 PM, David Booth wrote:
>> Notes from today's call:
>>
>> MEETING NOTES 28-Apr-2020
>> Present: David Booth, Victor Mireles, Louis Rumanes, Tom Conlin, 
>> Franck Michel, Gollam Rabby, Jim McCusker, Lucy Wong, Sebastian 
>> Kohlmeier, Tomáš Kliegr
>>
>> Introductions
>> David Booth: 10 years applying semantic web tech to healthcare and 
>> life sciences, working on Mayo Clinic / Johns-Hopkins University 
>> collaboration.
>>
>> Louis Rumane: United Health Group, Doing COVID research, looking at 
>> making a KG
>>
>> Tom Conlin: Working with Melissa Haendel (Monarch Initiative),
>>
>> Franck: INRIA
>>
>> Gollam: Prague, Univ
>>
>> Jim: Research sci RPI, working on KG w bio
>>
>> Lucy: Allen institute, research scientist.
>>
>> Tomas: Assoc Prof, Prague, KG.
>>
>> Sebastian: Sr Mgr on CORD-19.
>>
>> Victor: Semantic Web company researcher
>>
>> Victor's Presentation
>> Slides here: 
>> https://docs.google.com/presentation/d/1xaS_88sJ47iSrvv0ezOfjscIvG2VINUe7vqrUEMiaCA/edit?usp=sharing 
>>
>>
>> victor: Semantic Web Company, 40+ FTEs.  Makes PoolParty. Works w 
>> companies in many counties.  Taxonomy helps extract entities from 
>> text. image search, data mgmt.
>>
>> victor: Developing text and data mining tools for biomed, and CORD-19. 
>> We don't only annotate text.  What's useful about annotating text w 
>> entities is to use the knowledge, simplest is encoded in SKOS, such as 
>> broader/narrower.  But to do this we need to annotate the text into 
>> URIs, then import relationships into the graph.  Trying to link 
>> existing annotations w other knowledge sources.  Ont is simplified 
>> version of NIFT: documents have sections, sections have annotations 
>> that are SKOS concepts.
>>
>> victor: So far, we've set up a pipeline to take a document and it 
>> finds annotations with offsets.  So far imported ChEBI, GO, MeSH, HPO, 
>> but using them as controlled vocab.  Many are very specific, such as 
>> "COVID-19" -- not really NLP, because there are not inflections, 
>> plurals, etc.  Output is a bunch of triples in the simple SKOS ont 
>> previously mentioned. Put them into GraphDB, along with the vocabs.
>>
>> victor: Also looked at SciBite annotations.  They've done an excellent 
>> job annotating.  They also have their own controlled vocab that is 
>> very good.  JSON files have annotations. Put them into triples.  
>> Combining them w bio DBs gives a graph DB.
>>
>> (victor shows relationships in GraphDB viewer)
>>
>> victor: you can navigate the hierarchy of concepts and link them to 
>> the paragraphs in CORD-19 DB.
>>
>> (victor shows SPARQL queries)
>>
>> victor: This allows us to pull up the titles and paragraphs of 
>> articles that both mention a kind of neoplasm and a kind of coronavirus.
>>
>> victor: Want to take other DBs and put them into GraphDB also.  
>> Monarch Initiative is putting together KG, and also puts in SciBite.
>>
>> victor: Missing from both our effort and Monarch: searchability.  I 
>> showed SPARQL queries using broader/narrower.  Also need to be more 
>> efficient for humans, working also on faceted search.  Monarch 
>> Initiative is very good for machine readable stuff.  Another thing 
>> missing: relation extraction, from the text.  How does human determine 
>> that some text is saying that a protein interacts with another.  JPL 
>> (Lewis Magidney?sp?) is using a Stanford NLP for relation extraction.
>> https://github.com/nasa-jpl-cord-19/covid19-knowledge-graph
>> It isn't perfect, but it indicates a relationship.  Both entities are 
>> in GO.  This adds new edges between entities.  Lots of interest in 
>> this topic now.
>>
>> Franck: We're doing pretty close to this in INRIA, looking at named 
>> entities, wikidata entities, queries that gather all articles on 
>> cancer and any coronavirus.  Another thing we're doing: in addition to 
>> detecting named entities, we're running other tools to identify 
>> arguments, claims, evidence in articles and draw netowrk of claims and 
>> evidence to see what supports the claims.  Hope to publish this 
>> network soon as RDF graph.
>>
>> victor: PubAnnotation shown last week, showed epistemic analysis.
>>
>> Franck: Argument, clinical trial analysys.  Query pubmed and platform 
>> analyzes those articles.  Want to apply them to CORD-19.
>>
>> Vincent: Is RDF available? victor: Will take a couple more weeks. 
>> Vincent: Size? victor: 20GB RDF.
>>
>> David: Overlap between efforts, helpful to learn about each other's work.
>>
>> victor: After looking at Monarch initative, it isn't new, names i 
>> recognized from Human Phenotype initative.  Most of that summarizes 
>> work that others have done.  FHIR DB also have overlaps with SciBite.
>>
>> david: SPARQL query was valuable, but biologists need simple UI.
>>
>> jim: Working on faceted browser for various things, that can be 
>> reused. Based on SPARQL fragments, property path gives certain values, 
>> here's how to render it.  Potentially useful here.  Also integrated 
>> WHYIS Vega (JS framework for charts and visualization), can plug a 
>> SPARQL query in and get a chart.  People can share how thtey're 
>> exploring the graph.
>> https://github.com/tetherless-world/whyis
>> Faceted search is a view in WHYIS, but a lot of the capabilities are 
>> designed to use nanopub.
>>
>> Email list for these calls: 
>> https://lists.w3.org/Archives/Public/public-semweb-lifesci/
>>
>> Franck to present next week.
>>
>> ADJOURNED
Received on Wednesday, 13 May 2020 14:46:19 UTC