Shape inference from Daniel Fernández Álvarez on 2019-03-01 (public-shex@w3.org from March 2019)

From: Daniel Fernández Álvarez <danifdezalvarez@gmail.com>
Date: Fri, 1 Mar 2019 20:31:52 +0100
To: "public-shex@w3.org" <public-shex@w3.org>
Message-ID: <5c7988a8.1c69fb81.9876c.2af8@mx.google.com>

Hi all,

I am Daniel Fernández-Álvarez, PhD student at the University of Oviedo (Spain). I'd like to share with you a tool for shape inference that I mentioned in a call about this topic last week. I'm developing this tool in a public repository (python), in case you want to check it or even add some issue.

Currently, the only way to use this is to clone the repo and execute it locally. But I am actively developing it, and my priorities right now are to make the tool easier to use by:
- Offering a web service.
- Surrounding that WS also with a webapp.

I've run some experiments using Wikidata content (considering just direct properties). The results can be checked here.

I briefly describe the tool's (current) features:

-  The main inputs are an RDF graph and a set of classes selected by the user. The output is a ShEx file containing a shape inferred for each one of those classes.

- The shape of each class is inferred w.r.t. the outgoing links of its instances. 

- A single triple may be a reason to consider different constraints for a given shape, being more or less specific regarding the type of the object and the cardinality. For instance, a triple such as (:Harry :name "Harry"), could produce  constraints such as
     - :name xsd:string  ;
     - :name Literal ;
     - :name Literal + ;
     - :name . * ;
     - ...
The algorithm considers most of these possibilities and associate to each constraint a score which reflects the proportion of instances of a class that actually conform with the constraint. That let us sort the constraints in the final shape regarding how trustworthy they are. 
Most of these constraints does not appear in the final shape, but just the most representative ones according to some config params. The rest of them are used to provide extra information via comments regarding specific cardinalities or objects, as it is shown here

- It makes shape interlinkage when there are links between instances whose classes have an associated shape. In the rest of cases, it represents these relations using the macro IRI. 

There are several configuration params that I'm starting to document here. Those params allow you to do things like ignoring constraints which have a low score, ignoring certain triples, producing shapes which are valid for every instance (using Kleene closures when needed in the constraints) or shapes that look more reasonable bearing in mind the scores (even if that makes that some instances are not compliant with the shape), and so on.

Any feedback, question, suggestion or request about this would be really welcomed =)

Best regards,
Dani F.

Received on Friday, 1 March 2019 19:37:46 UTC