Re: CORD-19 semantic annotations - 11am Tuesday (Boston time) - Lightning talks on CORD-19 work from David Booth on 2020-05-18 (public-semweb-lifesci@w3.org from May 2020)

From: David Booth <david@dbooth.org>
Date: Mon, 18 May 2020 15:43:53 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-ID: <97b4249d-f124-6f67-f36c-5e8511e5fab3@dbooth.org>
Tomorrow (Tuesday) we will have a series of 5-minute overview 
presentations by people doing semantic annotation of the CORD-19 dataset:

    Daniel Stone,
    Gaurav Vaidya,
    Gollam Rabby,
    Marcin Joachimiak,
    Michael Liebman,
    Tom Conlin,
    David Booth.

Zoom Link:
https://us02web.zoom.us/j/83815969391?pwd=Q0k4Nm9xc3V2K0djL0FYT2JMVTJmUT09

If anyone else wishes to present their CORD-19 work, please let me know. 
  We will probably hold another, similar session next week or a 
following week also, for people who were not able to present today.

The CORD-19 dataset is a dataset released by the Allen Institute 
containing 63,000 journal article related to COVID-19.

Thanks,
David Booth

On 5/13/20 10:46 AM, David Booth wrote:
> Notes from yesterday's webinar by Franck Michel are below.  Thanks to 
> Victor Mireles-Chavez a recording of the call is available at the 
> following URL.  Franck's presentation starts at 17:10.
> 
> https://tinyurl.com/y8kmfxhe
> Recording password: 7t?N&*9+
> 
> --------------------------------------------------------------
> MEETING NOTES 12-May-2020
> Present: David Booth, Victor Mireles, Franck Michel, Albert Burger, 
> Daniel Stone, Deborah McGuiness, Filip, Gaurav Vaidya, Gollam Rabby, 
> Louis, Gollam Rabby, Louis Rumanes, Marcin Joachimiak, Michael Liebman, 
> Subhashis Das, Nico, Tom Conlin, Chuming Chen
> 
> Introductions
> David Booth: 10 years applying semantic web tech to healthcare and life 
> sciences, working on Mayo Clinic / Johns-Hopkins University collaboration.
> 
> Subhashis Das: PostDoctoral researcher at CeIC, DCU, Dublin. 
> Specialization in domain ontology and healthcare data integration.
> 
> Franck's presentation
> Slides: 
> https://www.dropbox.com/s/nnyg1o45f9dvimk/20200512%20Covid-on-the-Web%20-%20CORD-19%20semantic%20annotations.pdf?dl=0 
> 
> 
> Franck: Goal is to make it easier to find and make sense of COVID-19 
> literature: both named entities, and argumentative graphs.  Using 
> DBpedia Spotlight, Entity-fishing, BioPortal Annotator.
> 
> Franck: Releasing v1.1 shortly.  54M named entities, 564k URIs.
> 30M NEs, 155,651 URIs from Wikidata
> 21M NEs, 339,990 URIs from BioPortal
> 1.8M NEs, from DBpedia
> https://github.com/wimmics/cord19-nekg
> Full modelling details: 
> https://github.com/Wimmics/cord19-nekg/blob/master/doc/01-data-modeling.md
> SPARQL endpoint: http://covid19.i3s.unice.fr/sparql
> Virtuoso faceted browsing: http://covid19.i3s.unice.fr:8890/fct/
> Franck: Web annotation ont and PROV-O used to annotate articles. 
> Annotation points to article and position within the article where the 
> entity was found.
> 
> Franck: Able to query for cancer entity and its subclasses or instances.
> 
> Franck: Also looking at co-mentions of named entities.
> 
> Franck: Colleagues also working on ACTA: A Tool for Argumentative ... 
> claims/evidence.  This would allow arguments/claims/evidence to be 
> displayed in a graph.
> 
> David: What ont are you using for determining the subclass relations of 
> cancer, for example?
> Franck: So far using wikidata hierarchy.  One exception: viruses in 
> wikidata are not modeled as classes, so we regenerated them as classes.
> 
> Victor: Why can't DBpedia SPotlight process full text?
> Franck: We have 54M NEs, 700M triples.  Not enough machine power to do 
> full text.
> 
> Victor: If I find offsets, how can I be sure that I am aligned in my own 
> data?
> Franck: It refers specifically to the CORD-19 dataset.
> 
> Marcin: How are you extracting info about viral proteins?  There are 
> poly proteins?
> Franck: We rely on the results of the tools we're using.  If a protein 
> is identified by those tools then we get them.  If an article mentions a 
> gene name, would it show up?
> 
> Marcin: There are a few of these different entity extraction efforts. 
> Should we try to merge them?
> 
> David: That's exactly the point of these teleconferences -- to start 
> learning about each other's work and figure out how best to coordinate.
> 
> michael: We compared analysis of abstracts vs full body, and found 
> significant difference, because abstract is more of an advertisement. 
> Also, in dealing with the full body, we found it necessary to parse the 
> article, separate section on methods, results, conclusions.
> 
> Franck: My colleagues working on argumentative extraction, quality 
> varies a lot from one category to another.  They've noticed 
> (anecdotally) that clinical trials have an abstract with a few clear 
> statements about results, and relatively easy to extract, but not for 
> other articles.
> 
> Victor: Comment on avoiding duplication of effort, there is quite some 
> effort in doing annotations.  Some are better prepared than others. 
> Takes time.  By the time someone presents work, others have already 
> spent time doing similar work.
> 
> David: We began these calls with very brief presentations by each 
> participant, but after that, switched to deeper presentations of each 
> project.
> 
> Deborah: When presenting, please say what of your work is ready for 
> others to use.
> 
> Tom: Also interested in timing, how long things took, what was good/bad.
> 
> AGREED: Next week we will do 5-minute presentations of what we're doing 
> or planning.
> 
> Speakers next week: Daniel, Deborah, Gaurav, Gollam, Marcin, John Z, 
> Michael, Tom, David.
> 
> Subhashis: not next week, but later.
> 
> ADJOURNED
> 
> 
> On 5/11/20 12:22 PM, David Booth wrote:
>> Tomorrow (Tuesday) Franck Michel will present his work on CORD-19 
>> Named Entities Knowledge Graph (CORD19-NEKG).
>>
>> Zoom Link:
>> https://us02web.zoom.us/j/83815969391?pwd=Q0k4Nm9xc3V2K0djL0FYT2JMVTJmUT09 
>>
>>
>> Thanks,
>> David Booth
>>
>> On 4/28/20 12:09 PM, David Booth wrote:
>>> Notes from today's call:
>>>
>>> MEETING NOTES 28-Apr-2020
>>> Present: David Booth, Victor Mireles, Louis Rumanes, Tom Conlin, 
>>> Franck Michel, Gollam Rabby, Jim McCusker, Lucy Wong, Sebastian 
>>> Kohlmeier, Tomáš Kliegr
>>>
>>> Introductions
>>> David Booth: 10 years applying semantic web tech to healthcare and 
>>> life sciences, working on Mayo Clinic / Johns-Hopkins University 
>>> collaboration.
>>>
>>> Louis Rumane: United Health Group, Doing COVID research, looking at 
>>> making a KG
>>>
>>> Tom Conlin: Working with Melissa Haendel (Monarch Initiative),
>>>
>>> Franck: INRIA
>>>
>>> Gollam: Prague, Univ
>>>
>>> Jim: Research sci RPI, working on KG w bio
>>>
>>> Lucy: Allen institute, research scientist.
>>>
>>> Tomas: Assoc Prof, Prague, KG.
>>>
>>> Sebastian: Sr Mgr on CORD-19.
>>>
>>> Victor: Semantic Web company researcher
>>>
>>> Victor's Presentation
>>> Slides here: 
>>> https://docs.google.com/presentation/d/1xaS_88sJ47iSrvv0ezOfjscIvG2VINUe7vqrUEMiaCA/edit?usp=sharing 
>>>
>>>
>>> victor: Semantic Web Company, 40+ FTEs.  Makes PoolParty. Works w 
>>> companies in many counties.  Taxonomy helps extract entities from 
>>> text. image search, data mgmt.
>>>
>>> victor: Developing text and data mining tools for biomed, and 
>>> CORD-19. We don't only annotate text.  What's useful about annotating 
>>> text w entities is to use the knowledge, simplest is encoded in SKOS, 
>>> such as broader/narrower.  But to do this we need to annotate the 
>>> text into URIs, then import relationships into the graph.  Trying to 
>>> link existing annotations w other knowledge sources.  Ont is 
>>> simplified version of NIFT: documents have sections, sections have 
>>> annotations that are SKOS concepts.
>>>
>>> victor: So far, we've set up a pipeline to take a document and it 
>>> finds annotations with offsets.  So far imported ChEBI, GO, MeSH, 
>>> HPO, but using them as controlled vocab.  Many are very specific, 
>>> such as "COVID-19" -- not really NLP, because there are not 
>>> inflections, plurals, etc.  Output is a bunch of triples in the 
>>> simple SKOS ont previously mentioned. Put them into GraphDB, along 
>>> with the vocabs.
>>>
>>> victor: Also looked at SciBite annotations.  They've done an 
>>> excellent job annotating.  They also have their own controlled vocab 
>>> that is very good.  JSON files have annotations. Put them into 
>>> triples. Combining them w bio DBs gives a graph DB.
>>>
>>> (victor shows relationships in GraphDB viewer)
>>>
>>> victor: you can navigate the hierarchy of concepts and link them to 
>>> the paragraphs in CORD-19 DB.
>>>
>>> (victor shows SPARQL queries)
>>>
>>> victor: This allows us to pull up the titles and paragraphs of 
>>> articles that both mention a kind of neoplasm and a kind of coronavirus.
>>>
>>> victor: Want to take other DBs and put them into GraphDB also. 
>>> Monarch Initiative is putting together KG, and also puts in SciBite.
>>>
>>> victor: Missing from both our effort and Monarch: searchability.  I 
>>> showed SPARQL queries using broader/narrower.  Also need to be more 
>>> efficient for humans, working also on faceted search.  Monarch 
>>> Initiative is very good for machine readable stuff.  Another thing 
>>> missing: relation extraction, from the text.  How does human 
>>> determine that some text is saying that a protein interacts with 
>>> another.  JPL (Lewis Magidney?sp?) is using a Stanford NLP for 
>>> relation extraction.
>>> https://github.com/nasa-jpl-cord-19/covid19-knowledge-graph
>>> It isn't perfect, but it indicates a relationship.  Both entities are 
>>> in GO.  This adds new edges between entities.  Lots of interest in 
>>> this topic now.
>>>
>>> Franck: We're doing pretty close to this in INRIA, looking at named 
>>> entities, wikidata entities, queries that gather all articles on 
>>> cancer and any coronavirus.  Another thing we're doing: in addition 
>>> to detecting named entities, we're running other tools to identify 
>>> arguments, claims, evidence in articles and draw netowrk of claims 
>>> and evidence to see what supports the claims.  Hope to publish this 
>>> network soon as RDF graph.
>>>
>>> victor: PubAnnotation shown last week, showed epistemic analysis.
>>>
>>> Franck: Argument, clinical trial analysys.  Query pubmed and platform 
>>> analyzes those articles.  Want to apply them to CORD-19.
>>>
>>> Vincent: Is RDF available? victor: Will take a couple more weeks. 
>>> Vincent: Size? victor: 20GB RDF.
>>>
>>> David: Overlap between efforts, helpful to learn about each other's 
>>> work.
>>>
>>> victor: After looking at Monarch initative, it isn't new, names i 
>>> recognized from Human Phenotype initative.  Most of that summarizes 
>>> work that others have done.  FHIR DB also have overlaps with SciBite.
>>>
>>> david: SPARQL query was valuable, but biologists need simple UI.
>>>
>>> jim: Working on faceted browser for various things, that can be 
>>> reused. Based on SPARQL fragments, property path gives certain 
>>> values, here's how to render it.  Potentially useful here.  Also 
>>> integrated WHYIS Vega (JS framework for charts and visualization), 
>>> can plug a SPARQL query in and get a chart.  People can share how 
>>> thtey're exploring the graph.
>>> https://github.com/tetherless-world/whyis
>>> Faceted search is a view in WHYIS, but a lot of the capabilities are 
>>> designed to use nanopub.
>>>
>>> Email list for these calls: 
>>> https://lists.w3.org/Archives/Public/public-semweb-lifesci/
>>>
>>> Franck to present next week.
>>>
>>> ADJOURNED
Received on Monday, 18 May 2020 19:44:07 UTC