Re: GRDDL for BigData or CSVW for Avro?

On Fri, Jun 03, 2022 at 04:19:49PM +0200, Henry Story wrote:
> 
> 
> > On 3. Jun 2022, at 12:58, Eric Prud'hommeaux <eric@w3.org> wrote:
> > 
> >> 
> >> I’ll look into Schema Salad 
> >> https://www.commonwl.org/v1.0/SchemaSalad.html
> > 
> > In principle, an accompanying JSON-LD @context does this for you, e.g.
> > AVRO schema:
> 
> Thanks Eric for those very helpful examples. (I think the data you 
> gave for the second example does not quite fit the schema, but I
> get the point).

Yeah, I had .name as a sibling of .study .


> So the idea here is that 
> 1) one can think 
>    of Binary Avro data
>    interpreted with its JSON Avro schema 
>    as isomorphic to a json file JF.
> 2) One can then just consider JF to have the right json ld 
> context resulting in RDF.
> 
> That is nice because it means one has all the tools to view 
> the Avro binary data as already being RDF. (ignoring the 
> disjunction problem you mention)
> 
> Of course one would want to avoid serializing 
> the data to JSON in order to view it as jsonld, as
> that feels a bit expensive.

I think that's an implementation issue. Granted, the segementation of standards does kinda encourage re-serialization, but there's nothing that keeps an enterprising implementor from adding parallel JSON-LD processing to an Avro processor (e.g. some map/reduce function or whatnot). Such a processor would maintain it's location in both the Avro schema and the @context.

(Renaming @rdf to @id for consistency with JSON-LD) I think DRY is a slightly stronger argument for sticking an @id annotation into Avro. The Avro schema and the JSON-LD @context (redundantly) dictated the same JSON structure and encoding that structure twice should annoy you. However, the @id annotation doesn't give you RDF; it only allows you to define some terms that would appear in the RDF graph (probably predicate names, given the snipped discussion). If you want to define the full mapping to RDF, you have to define the mapping of any AVRO structure to RDF.

As a starting proposition, you'd probably want apply some sort of striping assumption and map avro:type to rdf:type. (Our goal is not to have an RDF representation of the schema, but a defined RDF graph for any instance data conforming to the schema).

[[
{                                              <-- S is a fresh BNode (or steal @id conventions from JSON-LD)
  "type": "record",                            <-- emit { S rdf:type <record> .} (outer-most frame only)
  "namespace": "example.avro",                 <-- no effect; used for avro resultion
  "name": "array_union",                       <-- no effect (i don't even know what this is for)
  "fields": [                                  <-- implies a list of nested statements with subject S
    { "name": "study",                         <-- V for a complex type is a fresh BNode
+     "@id": "http:...name",                   <-- emit { S <http:...study> V . }
      "type": {                                <-- S := V
        "name": "study",                       <-- no effect
        "type": "record",                      <-- no effect
        "fields": [                            <-- implies a list of nested statements with subject V
          { "name": "name",                    <-- validates a value V
            "type": "string",                  <--   with DT := xsd:string
+           "@id": "http:...name"},            <-- emit { S <http:...name> V^^DT . }
          { "name": "corpus",                  <-- V is a fresh BNode
+           "@id": "http:...corpus"},          <-- P := 
            "type": [                          <-- 
              "null",                          <-- No triple emitted if null (at least, that was the DirectMapping choice)
              { "type": "array",               <-- in a type "array", so keep track of tail of list: TAIL := { S <http:...corpus> @TBD . }
                "name": "corpus_name_0",       <--
                "items": {                     <-- for each item,
                  "name": "_name_0",           <-- no effect
                  "type": "record",            <-- LI := fresh BNode;
                                                   emit TAIL with LI substituted in for @TBD
                                                   emit { LI rdf:first foaf:name S };
                                                   TAIL := { LI rdf:rest @TBD . }
                  "fields": [                  <-- S is a fresh BNode; 
                    { "name": "name",          <-- validates a value V
                      "type": "string",        <--   with DT := xsd:string
+                     "@id": "foaf:name"},     <-- emit { S foaf:name V^^DT };
                    { "name": "status",        <-- validates a value V
+                     "@id": "http:...status"},
                      "type": {
                        "name": "StatusType",
                        "type": "enum",        <--   with termType := iri
                        "symbols": [
                          "enrolled",          <-- if matched, emit { S <http:...status> <enrolled> . }
                          "initiated",         <-- if matched, emit { S <http:...status> <initiated> . }
                          "completed" ] } }    <-- if matched, emit { S <http:...status> <completed> . }
                  ] } }                        <-- at end of items, emit TAIL with rdf:nill substituted in for @TBD.
            ] }
        ] } }
  ] }
]]

You may want to sprinkle in more @ directives for more control. What I'm trying to emphasize is that you're inventing AVRO-LD.


> Ideally one would want BigData folks to work
> with the data as much as possible as they are used to, 
> without transforming it to RDF, but making it easy for their 
> tools to keep track all the time of the relations and RDF 
> types.
> 
> So there it seems like it would be better to have
> the Avro Schema directly do the mapping to RDF. 
> Or annotate it as Joshua suggested using the fact that
> Avro has java-like package namespaces. 
> 
> Perhaps that is what Salad [1] is attempting to do. I’ll
> be able to look at it more closely now.
> 
> 
> Henry Story
> [1] https://www.commonwl.org/v1.0/SchemaSalad.html
> 
> https://co-operating.systems
> WhatsApp, Signal, Tel: +33 6 38 32 69 84‬ 
> Twitter: @bblfish
> 

Received on Friday, 3 June 2022 15:51:50 UTC