Re: CORD-19 semantic annotations - 11am Tuesday (Boston time) - Lightning talks on CORD-19 work from David Booth on 2020-05-19 (public-semweb-lifesci@w3.org from May 2020)

From: David Booth <david@dbooth.org>
Date: Tue, 19 May 2020 10:25:35 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-ID: <3b9bc30f-6924-3298-3766-493cddd9490a@dbooth.org>
Zoom Link for today's call:
  https://us02web.zoom.us/j/83815969391?pwd=Q0k4Nm9xc3V2K0djL0FYT2JMVTJmUT09

Slides for today's talks:

On 5/18/20 3:43 PM, David Booth wrote:
> Tomorrow (Tuesday) we will have a series of 5-minute overview 
> presentations by people doing semantic annotation of the CORD-19 dataset:

>     Gaurav Vaidya,
https://docs.google.com/presentation/d/1ghAqVwgrCO6moGyWNSZfRBApMZfJnoqa9Z5NwhRF53g/edit?usp=sharing

>     Gollam Rabby,
https://github.com/corei5/Entity-Based-Document-Classification-on-the-CORD---19-Corpus

>     Marcin Joachimiak,
https://lists.w3.org/Archives/Public/www-archive/2020May/att-0001/01-part

>     Michael Liebman,
(None sent yet)

>     Tom Conlin,
(None sent yet)

>     David Booth and Daniel Stone (Mayo Clinic & Johns-Hopkins University)
https://tinyurl.com/cord-19-on-fhir


Thanks,
David Booth

> 
> If anyone else wishes to present their CORD-19 work, please let me know. 
>   We will probably hold another, similar session next week or a 
> following week also, for people who were not able to present today.
> 
> The CORD-19 dataset is a dataset released by the Allen Institute 
> containing 63,000 journal article related to COVID-19.
> 
> Thanks,
> David Booth
> 
> On 5/13/20 10:46 AM, David Booth wrote:
>> Notes from yesterday's webinar by Franck Michel are below.  Thanks to 
>> Victor Mireles-Chavez a recording of the call is available at the 
>> following URL.  Franck's presentation starts at 17:10.
>>
>> https://tinyurl.com/y8kmfxhe
>> Recording password: 7t?N&*9+
>>
>> --------------------------------------------------------------
>> MEETING NOTES 12-May-2020
>> Present: David Booth, Victor Mireles, Franck Michel, Albert Burger, 
>> Daniel Stone, Deborah McGuiness, Filip, Gaurav Vaidya, Gollam Rabby, 
>> Louis, Gollam Rabby, Louis Rumanes, Marcin Joachimiak, Michael 
>> Liebman, Subhashis Das, Nico, Tom Conlin, Chuming Chen
>>
>> Introductions
>> David Booth: 10 years applying semantic web tech to healthcare and 
>> life sciences, working on Mayo Clinic / Johns-Hopkins University 
>> collaboration.
>>
>> Subhashis Das: PostDoctoral researcher at CeIC, DCU, Dublin. 
>> Specialization in domain ontology and healthcare data integration.
>>
>> Franck's presentation
>> Slides: 
>> https://www.dropbox.com/s/nnyg1o45f9dvimk/20200512%20Covid-on-the-Web%20-%20CORD-19%20semantic%20annotations.pdf?dl=0 
>>
>>
>> Franck: Goal is to make it easier to find and make sense of COVID-19 
>> literature: both named entities, and argumentative graphs.  Using 
>> DBpedia Spotlight, Entity-fishing, BioPortal Annotator.
>>
>> Franck: Releasing v1.1 shortly.  54M named entities, 564k URIs.
>> 30M NEs, 155,651 URIs from Wikidata
>> 21M NEs, 339,990 URIs from BioPortal
>> 1.8M NEs, from DBpedia
>> https://github.com/wimmics/cord19-nekg
>> Full modelling details: 
>> https://github.com/Wimmics/cord19-nekg/blob/master/doc/01-data-modeling.md 
>>
>> SPARQL endpoint: http://covid19.i3s.unice.fr/sparql
>> Virtuoso faceted browsing: http://covid19.i3s.unice.fr:8890/fct/
>> Franck: Web annotation ont and PROV-O used to annotate articles. 
>> Annotation points to article and position within the article where the 
>> entity was found.
>>
>> Franck: Able to query for cancer entity and its subclasses or instances.
>>
>> Franck: Also looking at co-mentions of named entities.
>>
>> Franck: Colleagues also working on ACTA: A Tool for Argumentative ... 
>> claims/evidence.  This would allow arguments/claims/evidence to be 
>> displayed in a graph.
>>
>> David: What ont are you using for determining the subclass relations 
>> of cancer, for example?
>> Franck: So far using wikidata hierarchy.  One exception: viruses in 
>> wikidata are not modeled as classes, so we regenerated them as classes.
>>
>> Victor: Why can't DBpedia SPotlight process full text?
>> Franck: We have 54M NEs, 700M triples.  Not enough machine power to do 
>> full text.
>>
>> Victor: If I find offsets, how can I be sure that I am aligned in my 
>> own data?
>> Franck: It refers specifically to the CORD-19 dataset.
>>
>> Marcin: How are you extracting info about viral proteins?  There are 
>> poly proteins?
>> Franck: We rely on the results of the tools we're using.  If a protein 
>> is identified by those tools then we get them.  If an article mentions 
>> a gene name, would it show up?
>>
>> Marcin: There are a few of these different entity extraction efforts. 
>> Should we try to merge them?
>>
>> David: That's exactly the point of these teleconferences -- to start 
>> learning about each other's work and figure out how best to coordinate.
>>
>> michael: We compared analysis of abstracts vs full body, and found 
>> significant difference, because abstract is more of an advertisement. 
>> Also, in dealing with the full body, we found it necessary to parse 
>> the article, separate section on methods, results, conclusions.
>>
>> Franck: My colleagues working on argumentative extraction, quality 
>> varies a lot from one category to another.  They've noticed 
>> (anecdotally) that clinical trials have an abstract with a few clear 
>> statements about results, and relatively easy to extract, but not for 
>> other articles.
>>
>> Victor: Comment on avoiding duplication of effort, there is quite some 
>> effort in doing annotations.  Some are better prepared than others. 
>> Takes time.  By the time someone presents work, others have already 
>> spent time doing similar work.
>>
>> David: We began these calls with very brief presentations by each 
>> participant, but after that, switched to deeper presentations of each 
>> project.
>>
>> Deborah: When presenting, please say what of your work is ready for 
>> others to use.
>>
>> Tom: Also interested in timing, how long things took, what was good/bad.
>>
>> AGREED: Next week we will do 5-minute presentations of what we're 
>> doing or planning.
>>
>> Speakers next week: Daniel, Deborah, Gaurav, Gollam, Marcin, John Z, 
>> Michael, Tom, David.
>>
>> Subhashis: not next week, but later.
>>
>> ADJOURNED
>>
>>
>> On 5/11/20 12:22 PM, David Booth wrote:
>>> Tomorrow (Tuesday) Franck Michel will present his work on CORD-19 
>>> Named Entities Knowledge Graph (CORD19-NEKG).
>>>
>>> Zoom Link:
>>> https://us02web.zoom.us/j/83815969391?pwd=Q0k4Nm9xc3V2K0djL0FYT2JMVTJmUT09 
>>>
>>>
>>> Thanks,
>>> David Booth
>>>
>>> On 4/28/20 12:09 PM, David Booth wrote:
>>>> Notes from today's call:
>>>>
>>>> MEETING NOTES 28-Apr-2020
>>>> Present: David Booth, Victor Mireles, Louis Rumanes, Tom Conlin, 
>>>> Franck Michel, Gollam Rabby, Jim McCusker, Lucy Wong, Sebastian 
>>>> Kohlmeier, Tomáš Kliegr
>>>>
>>>> Introductions
>>>> David Booth: 10 years applying semantic web tech to healthcare and 
>>>> life sciences, working on Mayo Clinic / Johns-Hopkins University 
>>>> collaboration.
>>>>
>>>> Louis Rumane: United Health Group, Doing COVID research, looking at 
>>>> making a KG
>>>>
>>>> Tom Conlin: Working with Melissa Haendel (Monarch Initiative),
>>>>
>>>> Franck: INRIA
>>>>
>>>> Gollam: Prague, Univ
>>>>
>>>> Jim: Research sci RPI, working on KG w bio
>>>>
>>>> Lucy: Allen institute, research scientist.
>>>>
>>>> Tomas: Assoc Prof, Prague, KG.
>>>>
>>>> Sebastian: Sr Mgr on CORD-19.
>>>>
>>>> Victor: Semantic Web company researcher
>>>>
>>>> Victor's Presentation
>>>> Slides here: 
>>>> https://docs.google.com/presentation/d/1xaS_88sJ47iSrvv0ezOfjscIvG2VINUe7vqrUEMiaCA/edit?usp=sharing 
>>>>
>>>>
>>>> victor: Semantic Web Company, 40+ FTEs.  Makes PoolParty. Works w 
>>>> companies in many counties.  Taxonomy helps extract entities from 
>>>> text. image search, data mgmt.
>>>>
>>>> victor: Developing text and data mining tools for biomed, and 
>>>> CORD-19. We don't only annotate text.  What's useful about 
>>>> annotating text w entities is to use the knowledge, simplest is 
>>>> encoded in SKOS, such as broader/narrower.  But to do this we need 
>>>> to annotate the text into URIs, then import relationships into the 
>>>> graph.  Trying to link existing annotations w other knowledge 
>>>> sources.  Ont is simplified version of NIFT: documents have 
>>>> sections, sections have annotations that are SKOS concepts.
>>>>
>>>> victor: So far, we've set up a pipeline to take a document and it 
>>>> finds annotations with offsets.  So far imported ChEBI, GO, MeSH, 
>>>> HPO, but using them as controlled vocab.  Many are very specific, 
>>>> such as "COVID-19" -- not really NLP, because there are not 
>>>> inflections, plurals, etc.  Output is a bunch of triples in the 
>>>> simple SKOS ont previously mentioned. Put them into GraphDB, along 
>>>> with the vocabs.
>>>>
>>>> victor: Also looked at SciBite annotations.  They've done an 
>>>> excellent job annotating.  They also have their own controlled vocab 
>>>> that is very good.  JSON files have annotations. Put them into 
>>>> triples. Combining them w bio DBs gives a graph DB.
>>>>
>>>> (victor shows relationships in GraphDB viewer)
>>>>
>>>> victor: you can navigate the hierarchy of concepts and link them to 
>>>> the paragraphs in CORD-19 DB.
>>>>
>>>> (victor shows SPARQL queries)
>>>>
>>>> victor: This allows us to pull up the titles and paragraphs of 
>>>> articles that both mention a kind of neoplasm and a kind of 
>>>> coronavirus.
>>>>
>>>> victor: Want to take other DBs and put them into GraphDB also. 
>>>> Monarch Initiative is putting together KG, and also puts in SciBite.
>>>>
>>>> victor: Missing from both our effort and Monarch: searchability.  I 
>>>> showed SPARQL queries using broader/narrower.  Also need to be more 
>>>> efficient for humans, working also on faceted search.  Monarch 
>>>> Initiative is very good for machine readable stuff.  Another thing 
>>>> missing: relation extraction, from the text.  How does human 
>>>> determine that some text is saying that a protein interacts with 
>>>> another.  JPL (Lewis Magidney?sp?) is using a Stanford NLP for 
>>>> relation extraction.
>>>> https://github.com/nasa-jpl-cord-19/covid19-knowledge-graph
>>>> It isn't perfect, but it indicates a relationship.  Both entities 
>>>> are in GO.  This adds new edges between entities.  Lots of interest 
>>>> in this topic now.
>>>>
>>>> Franck: We're doing pretty close to this in INRIA, looking at named 
>>>> entities, wikidata entities, queries that gather all articles on 
>>>> cancer and any coronavirus.  Another thing we're doing: in addition 
>>>> to detecting named entities, we're running other tools to identify 
>>>> arguments, claims, evidence in articles and draw netowrk of claims 
>>>> and evidence to see what supports the claims.  Hope to publish this 
>>>> network soon as RDF graph.
>>>>
>>>> victor: PubAnnotation shown last week, showed epistemic analysis.
>>>>
>>>> Franck: Argument, clinical trial analysys.  Query pubmed and 
>>>> platform analyzes those articles.  Want to apply them to CORD-19.
>>>>
>>>> Vincent: Is RDF available? victor: Will take a couple more weeks. 
>>>> Vincent: Size? victor: 20GB RDF.
>>>>
>>>> David: Overlap between efforts, helpful to learn about each other's 
>>>> work.
>>>>
>>>> victor: After looking at Monarch initative, it isn't new, names i 
>>>> recognized from Human Phenotype initative.  Most of that summarizes 
>>>> work that others have done.  FHIR DB also have overlaps with SciBite.
>>>>
>>>> david: SPARQL query was valuable, but biologists need simple UI.
>>>>
>>>> jim: Working on faceted browser for various things, that can be 
>>>> reused. Based on SPARQL fragments, property path gives certain 
>>>> values, here's how to render it.  Potentially useful here.  Also 
>>>> integrated WHYIS Vega (JS framework for charts and visualization), 
>>>> can plug a SPARQL query in and get a chart.  People can share how 
>>>> thtey're exploring the graph.
>>>> https://github.com/tetherless-world/whyis
>>>> Faceted search is a view in WHYIS, but a lot of the capabilities are 
>>>> designed to use nanopub.
>>>>
>>>> Email list for these calls: 
>>>> https://lists.w3.org/Archives/Public/public-semweb-lifesci/
>>>>
>>>> Franck to present next week.
>>>>
>>>> ADJOURNED
Received on Tuesday, 19 May 2020 14:25:50 UTC