Salad: Semantic Annotations for Linked Avro Data - was: GRDDL for BigData or CSVW for Avro? from Henry Story on 2022-06-05 (semantic-web@w3.org from June 2022)

From: Henry Story <henry.story@bblfish.net>
Date: Sun, 5 Jun 2022 19:18:16 +0200
To: Eric Prud'hommeaux <eric@w3.org>
Cc: Joshua Shinavier <josh@fortytwo.net>, semantic-web@w3.org
Message-Id: <D08E998A-CBA5-4F34-9E4C-7703C3AFEFCC@bblfish.net>
> On 3. Jun 2022, at 17:51, Eric Prud'hommeaux <eric@w3.org> wrote:
> 
> On Fri, Jun 03, 2022 at 04:19:49PM +0200, Henry Story wrote:
>> 
>> 
>>> On 3. Jun 2022, at 12:58, Eric Prud'hommeaux <eric@w3.org> wrote:
>>> 
>>>> 
>>>> I’ll look into Schema Salad 
>>>> https://www.commonwl.org/v1.0/SchemaSalad.html
>>> 
>>> In principle, an accompanying JSON-LD @context does this for you, e.g.
>>> AVRO schema:
>> 
>> Thanks Eric for those very helpful examples. (I think the data you 
>> gave for the second example does not quite fit the schema, but I
>> get the point).
> 
> Yeah, I had .name as a sibling of .study .

Before looking at your ideas on avro-dl I wanted to look at Salad, 
as it had Avro in the title "Semantic Annotations for Linked Avro Data”. 
The problem it is trying to solve is the number of different files 
doing nearly the same thing, which is something you pointed out
earlier in this thread too I think.

To understand Salad I worked on transforming your first example to 
Salad yaml format. I was more interested in getting it to work 
than to be faithful to your structure. So for example I renamed 
the ”name” fields to ”dname” and ”fname” because of name clashed.
There is likely a way to solve that, but it would be something
to do next.

The data files became the following 

## Trial.data.yaml

This is just the json data you gave me earlier , but now 
in yaml format with name disambiguated (to start with) and 
a base added (may not be needed)

[[
$base: "https://mrna.com/"
study:
  dname: PARAMEDIC2
  corpus:
  - fname: Kathleen Cleaver
    status: initiated
  - fname: Fredricka Newton
    status: enrolled
]]

## Trial_schema.yaml

The Schema YAML file brings together both the
Avro schema, and the JSON-LD markup allowing 
one also to add comments. 
(Note: I started off with the complex nested structure
you had but I could not get the jsonldPredicate to work 
that way so I decomposed it in a flatter hierarchy that
also makes it easier to read)

[[
$base: "https://salad.egg/"

$namespaces: 
  ex: "http://example.org/ns/rct#"
  foaf: "http://xmlns.com/foaf/0.1/"
  doap: "http://usefulinc.com/ns/doap#"


$graph:
- name: Trial
  type: record
  documentRoot: true
  # namespace: example.avro <- not needed
  fields:
  - name: study
    jsonldPredicate: "ex:study"
    type: Study

- name: ParaMedic
  type: record
  fields:
  - name: fname #was "name", changed to avoid name clash
    jsonldPredicate: "foaf:name"
    type: string
  - name: status
    jsonldPredicate: "ex:status"
    type: StatusType

- name: StatusType
  type: enum
  symbols:
  - "enrolled"
  - "initiated"
  - "completed"

- name: Study  # change from 'study' to avoid nameclash
  type: record
  fields:
  - name: dname # was "name", changed to avoid name-clash
    type: "string" 
    doc: "name of study"
    jsonldPredicate: "doap:name"        
  - name: corpus
    doc: "the body of the study (made of people)"
    jsonldPredicate: 
      "_id": "ex:corpus"
      "_container": "@list"       
    type: 
      type: array
      items: ParaMedic
]]

After installing schema-salad-tool I can use those python tools to 
do the following

## Extract the RDFS from the Salad Schema

[[
$ schema-salad-tool --print-rdfs Trial_schema.yaml
/Users/hjs/Library/Python/3.8/bin/schema-salad-tool Current version: 8.3.20220525163636
@prefix doap: <http://usefulinc.com/ns/doap#> .
@prefix ex: <http://example.org/ns/rct#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<https://salad.egg/#ParaMedic> a rdfs:Class .

<https://salad.egg/#StatusType> a rdfs:Class .

<https://salad.egg/#Study> a rdfs:Class .

<https://salad.egg/#Trial> a rdfs:Class .

ex:status a rdf:Property ;
    rdfs:domain <https://salad.egg/#ParaMedic> .

ex:study a rdf:Property ;
    rdfs:domain <https://salad.egg/#Trial> .

doap:name a rdf:Property ;
    rdfs:domain <https://salad.egg/#Study> .

foaf:name a rdf:Property ;
    rdfs:domain <https://salad.egg/#ParaMedic> .
]]

## Extract the Avro JSON schema from the Salad Schema

[[
$ schema-salad-tool --print-avro Trial_schema.yaml
/Users/hjs/Library/Python/3.8/bin/schema-salad-tool Current version: 8.3.20220525163636
[
    {
        "name": "egg.salad.Trial",
        "type": "record",
        "documentRoot": true,
        "fields": [
            {
                "name": "study",
                "jsonldPredicate": "ex:study",
                "type": {
                    "name": "egg.salad.Study",
                    "type": "record",
                    "fields": [
                        {
                            "name": "dname",
                            "type": "string",
                            "doc": "name of study",
                            "jsonldPredicate": "doap:name"
                        },
                        {
                            "name": "corpus",
                            "doc": "the body of the study (made of people)",
                            "jsonldPredicate": {
                                "_id": "http://example.org/ns/rct#corpus",
                                "_container": "@list"
                            },
                            "type": {
                                "type": "array",
                                "items": {
                                    "name": "egg.salad.ParaMedic",
                                    "type": "record",
                                    "fields": [
                                        {
                                            "name": "fname",
                                            "jsonldPredicate": "foaf:name",
                                            "type": "string"
                                        },
                                        {
                                            "name": "status",
                                            "jsonldPredicate": "ex:status",
                                            "type": {
                                                "name": "egg.salad.StatusType",
                                                "type": "enum",
                                                "symbols": [
                                                    "enrolled",
                                                    "initiated",
                                                    "completed"
                                                ]
                                            }
                                        }
                                    ]
                                },
                                "name": ""
                            }
                        }
                    ]
                }
            }
        ]
    }
]]

## Extract the json-ld context

This gives us the JSON-LD context that one can use with the YAML data 
Trial.data.yaml to produce RDF.

[[
$ schema-salad-tool --print-jsonld-context Trial_schema.yaml
/Users/hjs/Library/Python/3.8/bin/schema-salad-tool Current version: 8.3.20220525163636
{
    "@context": {
        "ParaMedic": "https://salad.egg/#ParaMedic",
        "StatusType": "https://salad.egg/#StatusType",
        "Study": "https://salad.egg/#Study",
        "Trial": "https://salad.egg/#Trial",
        "completed": "https://salad.egg/#StatusType/completed",
        "corpus": {
            "@container": "@list",
            "@id": "http://example.org/ns/rct#corpus"
        },
        "dname": "doap:name",
        "doap": "http://usefulinc.com/ns/doap#",
        "enrolled": "https://salad.egg/#StatusType/enrolled",
        "ex": "http://example.org/ns/rct#",
        "fname": "foaf:name",
        "foaf": "http://xmlns.com/foaf/0.1/",
        "initiated": "https://salad.egg/#StatusType/initiated",
        "status": "ex:status",
        "study": "ex:study"
    }
}
]]

## Transform the Data using the schema to RDF

One can do the transformation to rdf directly with the yaml data

[[
 schema-salad-tool --print-rdf Trial_schema.yaml Trial.data.yaml
/Users/hjs/Library/Python/3.8/bin/schema-salad-tool Current version: 8.3.20220525163636
@prefix doap: <http://usefulinc.com/ns/doap#> .
@prefix ex: <http://example.org/ns/rct#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

[] ex:study [ ex:corpus ( [ ex:status "initiated" ;
                        foaf:name "Kathleen Cleaver" ] [ ex:status "enrolled" ;
                        foaf:name "Fredricka Newton" ] ) ;
            doap:name "PARAMEDIC2" ] .
]]

## other options

I have not yet found if one could use this now directly to do something with Avro binary data.

 schema-salad-tool -h
usage: schema-salad-tool [-h] [--rdf-serializer RDF_SERIALIZER] [--skip-schemas] [--strict-foreign-properties] [--print-jsonld-context] [--print-rdfs]
                         [--print-avro] [--print-rdf] [--print-pre] [--print-index] [--print-metadata] [--print-inheritance-dot] [--print-fieldrefs-dot]
                         [--codegen language] [--codegen-target CODEGEN_TARGET] [--codegen-examples directory] [--codegen-package dotted.package]
                         [--codegen-copyright copyright_string] [--codegen-parser-info parser_info] [--print-oneline] [--print-doc]
                         [--strict | --non-strict] [--verbose | --quiet | --debug] [--only ONLY] [--redirect REDIRECT] [--brand BRAND]
                         [--brandlink BRANDLINK] [--brandstyle BRANDSTYLE] [--brandinverse] [--primtype PRIMTYPE] [--version]
                         [schema] [document]

positional arguments:
  schema
  document

optional arguments:
  -h, --help            show this help message and exit
  --rdf-serializer RDF_SERIALIZER
                        Output RDF serialization format used by --print-rdf(one of turtle (default), n3, nt, xml)
  --skip-schemas        If specified, ignore $schemas sections.
  --strict-foreign-properties
                        Strict checking of foreign properties
  --print-jsonld-context
                        Print JSON-LD context for schema
  --print-rdfs          Print RDF schema
  --print-avro          Print Avro schema
  --print-rdf           Print corresponding RDF graph for document
  --print-pre           Print document after preprocessing
  --print-index         Print node index
  --print-metadata      Print document metadata
  --print-inheritance-dot
                        Print graphviz file of inheritance
  --print-fieldrefs-dot
                        Print graphviz file of field refs
  --codegen language    Generate classes in target language, currently supported: python, java, typescript
  --codegen-target CODEGEN_TARGET
                        Defaults to sys.stdout for python and ./ for Java
  --codegen-examples directory
                        Directory of example documents for test case generation (Java only).
  --codegen-package dotted.package
                        Optional override of the package name which is other derived from the base URL (Java only).
  --codegen-copyright copyright_string
                        Optional copyright of the input schema.
  --codegen-parser-info parser_info
                        Optional parser name which is accessible via resulted parser API (Python only)
  --print-oneline       Print each error message in oneline
  --print-doc           Print HTML schema documentation page
  --strict              Strict validation (unrecognized or out of place fields are error)
  --non-strict          Lenient validation (ignore unrecognized fields)
  --verbose             Default logging
  --quiet               Only print warnings and errors.
  --debug               Print even more logging
  --only ONLY           Use with --print-doc, document only listed types
  --redirect REDIRECT   Use with --print-doc, override default link for type
  --brand BRAND         Use with --print-doc, set the 'brand' text in nav bar
  --brandlink BRANDLINK
                        Use with --print-doc, set the link for 'brand' in nav bar
  --brandstyle BRANDSTYLE
                        Use with --print-doc, HTML code to link to an external style sheet
  --brandinverse        Use with --print-doc
  --primtype PRIMTYPE   Use with --print-doc, link to use for primitive types (string, int etc)
  --version, -v         Print version

Hope some of you find this helpful.


Henry Story

https://co-operating.systems
WhatsApp, Signal, Tel: +33 6 38 32 69 84‬ 
Twitter: @bblfish
Received on Sunday, 5 June 2022 17:18:32 UTC