- From: David Booth <david@dbooth.org>
- Date: Tue, 19 May 2020 10:25:35 -0400
- To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Zoom Link for today's call: https://us02web.zoom.us/j/83815969391?pwd=Q0k4Nm9xc3V2K0djL0FYT2JMVTJmUT09 Slides for today's talks: On 5/18/20 3:43 PM, David Booth wrote: > Tomorrow (Tuesday) we will have a series of 5-minute overview > presentations by people doing semantic annotation of the CORD-19 dataset: > Gaurav Vaidya, https://docs.google.com/presentation/d/1ghAqVwgrCO6moGyWNSZfRBApMZfJnoqa9Z5NwhRF53g/edit?usp=sharing > Gollam Rabby, https://github.com/corei5/Entity-Based-Document-Classification-on-the-CORD---19-Corpus > Marcin Joachimiak, https://lists.w3.org/Archives/Public/www-archive/2020May/att-0001/01-part > Michael Liebman, (None sent yet) > Tom Conlin, (None sent yet) > David Booth and Daniel Stone (Mayo Clinic & Johns-Hopkins University) https://tinyurl.com/cord-19-on-fhir Thanks, David Booth > > If anyone else wishes to present their CORD-19 work, please let me know. > We will probably hold another, similar session next week or a > following week also, for people who were not able to present today. > > The CORD-19 dataset is a dataset released by the Allen Institute > containing 63,000 journal article related to COVID-19. > > Thanks, > David Booth > > On 5/13/20 10:46 AM, David Booth wrote: >> Notes from yesterday's webinar by Franck Michel are below. Thanks to >> Victor Mireles-Chavez a recording of the call is available at the >> following URL. Franck's presentation starts at 17:10. >> >> https://tinyurl.com/y8kmfxhe >> Recording password: 7t?N&*9+ >> >> -------------------------------------------------------------- >> MEETING NOTES 12-May-2020 >> Present: David Booth, Victor Mireles, Franck Michel, Albert Burger, >> Daniel Stone, Deborah McGuiness, Filip, Gaurav Vaidya, Gollam Rabby, >> Louis, Gollam Rabby, Louis Rumanes, Marcin Joachimiak, Michael >> Liebman, Subhashis Das, Nico, Tom Conlin, Chuming Chen >> >> Introductions >> David Booth: 10 years applying semantic web tech to healthcare and >> life sciences, working on Mayo Clinic / Johns-Hopkins University >> collaboration. >> >> Subhashis Das: PostDoctoral researcher at CeIC, DCU, Dublin. >> Specialization in domain ontology and healthcare data integration. >> >> Franck's presentation >> Slides: >> https://www.dropbox.com/s/nnyg1o45f9dvimk/20200512%20Covid-on-the-Web%20-%20CORD-19%20semantic%20annotations.pdf?dl=0 >> >> >> Franck: Goal is to make it easier to find and make sense of COVID-19 >> literature: both named entities, and argumentative graphs. Using >> DBpedia Spotlight, Entity-fishing, BioPortal Annotator. >> >> Franck: Releasing v1.1 shortly. 54M named entities, 564k URIs. >> 30M NEs, 155,651 URIs from Wikidata >> 21M NEs, 339,990 URIs from BioPortal >> 1.8M NEs, from DBpedia >> https://github.com/wimmics/cord19-nekg >> Full modelling details: >> https://github.com/Wimmics/cord19-nekg/blob/master/doc/01-data-modeling.md >> >> SPARQL endpoint: http://covid19.i3s.unice.fr/sparql >> Virtuoso faceted browsing: http://covid19.i3s.unice.fr:8890/fct/ >> Franck: Web annotation ont and PROV-O used to annotate articles. >> Annotation points to article and position within the article where the >> entity was found. >> >> Franck: Able to query for cancer entity and its subclasses or instances. >> >> Franck: Also looking at co-mentions of named entities. >> >> Franck: Colleagues also working on ACTA: A Tool for Argumentative ... >> claims/evidence. This would allow arguments/claims/evidence to be >> displayed in a graph. >> >> David: What ont are you using for determining the subclass relations >> of cancer, for example? >> Franck: So far using wikidata hierarchy. One exception: viruses in >> wikidata are not modeled as classes, so we regenerated them as classes. >> >> Victor: Why can't DBpedia SPotlight process full text? >> Franck: We have 54M NEs, 700M triples. Not enough machine power to do >> full text. >> >> Victor: If I find offsets, how can I be sure that I am aligned in my >> own data? >> Franck: It refers specifically to the CORD-19 dataset. >> >> Marcin: How are you extracting info about viral proteins? There are >> poly proteins? >> Franck: We rely on the results of the tools we're using. If a protein >> is identified by those tools then we get them. If an article mentions >> a gene name, would it show up? >> >> Marcin: There are a few of these different entity extraction efforts. >> Should we try to merge them? >> >> David: That's exactly the point of these teleconferences -- to start >> learning about each other's work and figure out how best to coordinate. >> >> michael: We compared analysis of abstracts vs full body, and found >> significant difference, because abstract is more of an advertisement. >> Also, in dealing with the full body, we found it necessary to parse >> the article, separate section on methods, results, conclusions. >> >> Franck: My colleagues working on argumentative extraction, quality >> varies a lot from one category to another. They've noticed >> (anecdotally) that clinical trials have an abstract with a few clear >> statements about results, and relatively easy to extract, but not for >> other articles. >> >> Victor: Comment on avoiding duplication of effort, there is quite some >> effort in doing annotations. Some are better prepared than others. >> Takes time. By the time someone presents work, others have already >> spent time doing similar work. >> >> David: We began these calls with very brief presentations by each >> participant, but after that, switched to deeper presentations of each >> project. >> >> Deborah: When presenting, please say what of your work is ready for >> others to use. >> >> Tom: Also interested in timing, how long things took, what was good/bad. >> >> AGREED: Next week we will do 5-minute presentations of what we're >> doing or planning. >> >> Speakers next week: Daniel, Deborah, Gaurav, Gollam, Marcin, John Z, >> Michael, Tom, David. >> >> Subhashis: not next week, but later. >> >> ADJOURNED >> >> >> On 5/11/20 12:22 PM, David Booth wrote: >>> Tomorrow (Tuesday) Franck Michel will present his work on CORD-19 >>> Named Entities Knowledge Graph (CORD19-NEKG). >>> >>> Zoom Link: >>> https://us02web.zoom.us/j/83815969391?pwd=Q0k4Nm9xc3V2K0djL0FYT2JMVTJmUT09 >>> >>> >>> Thanks, >>> David Booth >>> >>> On 4/28/20 12:09 PM, David Booth wrote: >>>> Notes from today's call: >>>> >>>> MEETING NOTES 28-Apr-2020 >>>> Present: David Booth, Victor Mireles, Louis Rumanes, Tom Conlin, >>>> Franck Michel, Gollam Rabby, Jim McCusker, Lucy Wong, Sebastian >>>> Kohlmeier, Tomáš Kliegr >>>> >>>> Introductions >>>> David Booth: 10 years applying semantic web tech to healthcare and >>>> life sciences, working on Mayo Clinic / Johns-Hopkins University >>>> collaboration. >>>> >>>> Louis Rumane: United Health Group, Doing COVID research, looking at >>>> making a KG >>>> >>>> Tom Conlin: Working with Melissa Haendel (Monarch Initiative), >>>> >>>> Franck: INRIA >>>> >>>> Gollam: Prague, Univ >>>> >>>> Jim: Research sci RPI, working on KG w bio >>>> >>>> Lucy: Allen institute, research scientist. >>>> >>>> Tomas: Assoc Prof, Prague, KG. >>>> >>>> Sebastian: Sr Mgr on CORD-19. >>>> >>>> Victor: Semantic Web company researcher >>>> >>>> Victor's Presentation >>>> Slides here: >>>> https://docs.google.com/presentation/d/1xaS_88sJ47iSrvv0ezOfjscIvG2VINUe7vqrUEMiaCA/edit?usp=sharing >>>> >>>> >>>> victor: Semantic Web Company, 40+ FTEs. Makes PoolParty. Works w >>>> companies in many counties. Taxonomy helps extract entities from >>>> text. image search, data mgmt. >>>> >>>> victor: Developing text and data mining tools for biomed, and >>>> CORD-19. We don't only annotate text. What's useful about >>>> annotating text w entities is to use the knowledge, simplest is >>>> encoded in SKOS, such as broader/narrower. But to do this we need >>>> to annotate the text into URIs, then import relationships into the >>>> graph. Trying to link existing annotations w other knowledge >>>> sources. Ont is simplified version of NIFT: documents have >>>> sections, sections have annotations that are SKOS concepts. >>>> >>>> victor: So far, we've set up a pipeline to take a document and it >>>> finds annotations with offsets. So far imported ChEBI, GO, MeSH, >>>> HPO, but using them as controlled vocab. Many are very specific, >>>> such as "COVID-19" -- not really NLP, because there are not >>>> inflections, plurals, etc. Output is a bunch of triples in the >>>> simple SKOS ont previously mentioned. Put them into GraphDB, along >>>> with the vocabs. >>>> >>>> victor: Also looked at SciBite annotations. They've done an >>>> excellent job annotating. They also have their own controlled vocab >>>> that is very good. JSON files have annotations. Put them into >>>> triples. Combining them w bio DBs gives a graph DB. >>>> >>>> (victor shows relationships in GraphDB viewer) >>>> >>>> victor: you can navigate the hierarchy of concepts and link them to >>>> the paragraphs in CORD-19 DB. >>>> >>>> (victor shows SPARQL queries) >>>> >>>> victor: This allows us to pull up the titles and paragraphs of >>>> articles that both mention a kind of neoplasm and a kind of >>>> coronavirus. >>>> >>>> victor: Want to take other DBs and put them into GraphDB also. >>>> Monarch Initiative is putting together KG, and also puts in SciBite. >>>> >>>> victor: Missing from both our effort and Monarch: searchability. I >>>> showed SPARQL queries using broader/narrower. Also need to be more >>>> efficient for humans, working also on faceted search. Monarch >>>> Initiative is very good for machine readable stuff. Another thing >>>> missing: relation extraction, from the text. How does human >>>> determine that some text is saying that a protein interacts with >>>> another. JPL (Lewis Magidney?sp?) is using a Stanford NLP for >>>> relation extraction. >>>> https://github.com/nasa-jpl-cord-19/covid19-knowledge-graph >>>> It isn't perfect, but it indicates a relationship. Both entities >>>> are in GO. This adds new edges between entities. Lots of interest >>>> in this topic now. >>>> >>>> Franck: We're doing pretty close to this in INRIA, looking at named >>>> entities, wikidata entities, queries that gather all articles on >>>> cancer and any coronavirus. Another thing we're doing: in addition >>>> to detecting named entities, we're running other tools to identify >>>> arguments, claims, evidence in articles and draw netowrk of claims >>>> and evidence to see what supports the claims. Hope to publish this >>>> network soon as RDF graph. >>>> >>>> victor: PubAnnotation shown last week, showed epistemic analysis. >>>> >>>> Franck: Argument, clinical trial analysys. Query pubmed and >>>> platform analyzes those articles. Want to apply them to CORD-19. >>>> >>>> Vincent: Is RDF available? victor: Will take a couple more weeks. >>>> Vincent: Size? victor: 20GB RDF. >>>> >>>> David: Overlap between efforts, helpful to learn about each other's >>>> work. >>>> >>>> victor: After looking at Monarch initative, it isn't new, names i >>>> recognized from Human Phenotype initative. Most of that summarizes >>>> work that others have done. FHIR DB also have overlaps with SciBite. >>>> >>>> david: SPARQL query was valuable, but biologists need simple UI. >>>> >>>> jim: Working on faceted browser for various things, that can be >>>> reused. Based on SPARQL fragments, property path gives certain >>>> values, here's how to render it. Potentially useful here. Also >>>> integrated WHYIS Vega (JS framework for charts and visualization), >>>> can plug a SPARQL query in and get a chart. People can share how >>>> thtey're exploring the graph. >>>> https://github.com/tetherless-world/whyis >>>> Faceted search is a view in WHYIS, but a lot of the capabilities are >>>> designed to use nanopub. >>>> >>>> Email list for these calls: >>>> https://lists.w3.org/Archives/Public/public-semweb-lifesci/ >>>> >>>> Franck to present next week. >>>> >>>> ADJOURNED
Received on Tuesday, 19 May 2020 14:25:50 UTC