[shex] How to process/generate ShEx reports (#115)

andrawaag has just created a new issue for https://github.com/shexSpec/shex:

== How to process/generate ShEx reports ==
I am applying ShEx validation on a large scale on Wikidata items, but I am struggling to aggregate the results in a sensible report. I am looking for best practices here. This is the approach I have followed so far, which is working for this specific use case.

Take for example the following use case:
* I want to validate all items in Wikidata that have a statement with the Disease Ontology Property (P699):
`https://w.wiki/387j`. 
* For those 13054 wikidata items, I would like to know if they fit the following Shape Expression [E3](https://genewikibots.semscape.org/wiki/Special:EntitySchemaText/E3). 
* For those 13054 reports of the ShEx validation I would like to have an aggregated report that tells the overall issues observed. 

I have developed a script using WikidataIntegrator and PyShEx to do the validation. These are the steps:
* Do the validation: [script](https://github.com/andrawaag/wd_shex_batch/blob/main/diseaseShex.py)
* capture the generated reports: [log](https://raw.githubusercontent.com/andrawaag/wd_shex_batch/main/disease_shex.json)
* Parse the log an generate aggregated report: [notebook](https://github.com/andrawaag/wd_shex_batch/blob/main/ShEx_errors_reports.ipynb)

The current aggregated report is sufficient for its task, i.e. where are the issues. But getting there requires some suboptimal parsing of output of strings and some arbitrary clustering on types of errors. 

`{'No matching triples found for predicate p:P2888': 6608,
 'No matching triples found for predicate ps:P279': 6217,
 '2 triples exceeds max {1,1}': 3666,
 'No matching triples found for predicate prov:wasDerivedFrom': 2632,
 '{"values": ["http://www.wikidata.org/entity/Q5282129"], "typ...': 534,
 '{"values": ["http://www.wikidata.org/entity/Q27468140"], "ty...': 1304,
 'No matching triples found for predicate pr:P699': 772,
 '3 triples exceeds max {1,1}': 9,
 'No matching triples found for predicate pr:P5270': 1}`

I am looking for:
1. suggestion to improve the pipeline/alternatives 
2. a standard output from shex validation pipelines from which reports can be generated. For example can there be a finite set of error types? e.g "No matching triples, cardinality issue", etc. 

Please view or discuss this issue at https://github.com/shexSpec/shex/issues/115 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 24 March 2021 13:53:24 UTC