Re: GRDDL for BigData or CSVW for Avro?

On Fri, Jun 03, 2022 at 05:59:38AM +0200, Henry Story wrote:
> 
> 
> > On 2. Jun 2022, at 21:08, Joshua Shinavier <josh@fortytwo.net> wrote:
> > 
> > Hi Henry,
> > 
> > The best thing about Avro's AVSC format is that it is "just JSON", and will accept any annotations you care to attach as properties. Uber (as of the time I left in July) enriches Avro schemas with annotations which link types and fields to standardized terms, add constraints such as cardinality, mark fields which correspond to primary keys, etc. See my presentation from that time. The annotations were even sufficient to map Avro data into RDF, with SHACL as the schema language. I have seen other companies including LinkedIn enriching Avro schemas in similar ways.
> 
> Thanks for reminding me of that talk. 
> I could not find a recording of it but I saw a shorter version
> https://twitter.com/bblfish/status/1162660513005953024
> 
> The slides by themselves are easy to read (though for some reason they don’t go full screen)
> https://eng.uber.com/dragon-schema-integration-at-uber-scale/
> 
> Actually in those slides you present your idea using YAML around slide 40 is 
> making me wonder if this ”schame salad” project I came across
> is not actually doing what I am looking for 
> 
> Semantic Annotations for Linked Avro Data (SALAD) 
> https://github.com/common-workflow-language/schema_salad
> 
> It mentions Json and Json-LD and Avro but their examples
> are in YAML which may be what confused me.
> 
> > 
> > In some cases, however, it may be preferable not to enrich the Avro schema, but to write the schema in a formally-specified, strongly-typed language, and map that schema to AVSC without any extra bells and whistles. When you need the features of the stronger schema, you refer back to it via lineage.
> 
> yes, I think there are many ways to create the schema. (I had a project at Sun Microsystems to
> enhance Java with simple @rdf addnotations on classes and files to give those URLs and so map them
> to RDF). 
> 
> What I think would be useful is if Big Data Devs could easily see the ontology behind the schemas
> or the code just to make the meaning of the data as transparent as possible and easy to access. 
> 
> I’ll look into Schema Salad 
> https://www.commonwl.org/v1.0/SchemaSalad.html

In principle, an accompanying JSON-LD @context does this for you, e.g.
AVRO schema:
[[
{ "type": "record",
  "namespace": "example.avro",
  "name": "Trial",
  "fields": [
    { "name": "name",
      "type": "string" },
    { "name": "study",
      "type": {
        "name": "study",
        "type": "record",
        "fields": [
          { "name": "corpus",
            "type": {
              "type": "array",
              "name": "corpus_name_0",
              "items": {
                "name": "_name_0",
                "type": "record",
                "fields": [
                  { "name": "name",
                    "type": "string" },
                  { "name": "status",
                    "type": {
                      "name": "StatusType",
                      "type": "enum",
                      "symbols": [
                        "enrolled",
                        "initiated",
                        "completed" ] } }
                ] } } }
        ] } }
  ] }
]]

Data:
[[
{
  "study": {
    "name": "PARAMEDIC2",
    "corpus": [
      { "name": "Kathleen Cleaver", "status": "initiated" },
      { "name": "Fredricka Newton", "status": "enrolled" }
    ]
  }
}
]]

@context (you can add this to the data in JSON-LD playground):
[[
  "@context": {
    "ex": "http://example.org/ns/rct#",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "study": {
      "@id": "ex:study",
      "@context": {
        "name": "ex:name",
        "corpus": {
          "@id": "ex:corpus",
          "@container": "@list",
          "@context": {
            "name": "foaf:name",
            "status": "ex:status"
          }
        }
      }
    }
  }
]]

Turtle:
[[
_:b0 ex:study [
  ex:name "PARAMEDIC2" .
  ex:corpus (
    [ foaf:name "Kathleen Cleaver" ;
      ex:status "initiated" ]
    [ foaf:name "Fredricka Newton" ;
      ex:status "enrolled" ]
  )
] .
]]

Upside: JSON-LD (1.1) is context-senstive and so can capture the differnt semantics of .study.name and .study.corpus[].name .

Downsides:
  1. not very DRY; you have essentially two representations describing the nestings in your structure.
  2. Union types in Avro schemas require a disambiguator in the instance data which will ultimately land in your RDF graph:

[[
{
  "type": "record",
  "namespace": "example.avro",
  "name": "array_union",
  "fields": [
    { "name": "name",
      "type": "string" },
    { "name": "study",
      "type": {
        "name": "study",
        "type": "record",
        "fields": [
          { "name": "corpus",
            "type": [              <-- union type of
              "null",                  null
              { "type": "array",       || array of records
                "name": "corpus_name_0",
                "items": {
                  "name": "_name_0",
                  "type": "record",
                  "fields": [
                    { "name": "fname",
                      "type": "string" },
                    { "name": "lname",
                      "type": "string" },
                    { "name": "status",
                      "type": {
                        "name": "StatusType",
                        "type": "enum",
                        "symbols": [
                          "enrolled",
                          "initiated",
                          "completed" ] } }
                  ] } }
            ] }
        ] } }
  ] }
]]

Data:
[[
{
  "study": {
    "name": "PARAMEDIC2",
    "corpus": {
      "array": [                    <-- obligatroy union disambiguator
        { "name": "Kathleen Cleaver", "status": "initiated" },
        { "name": "Fredricka Newton", "status": "enrolled" }
      ]
    }
  }
}
]]

Turtle (some @context later):
[[
_:b0 ex:study [
  ex:name "PARAMEDIC2" .
  ex:corpus [
    grumble:array (                 <-- can't get rid of this with JSON-LD
      [ foaf:name "Kathleen Cleaver" ;
        ex:status "initiated" ]
      [ foaf:name "Fredricka Newton" ;
        ex:status "enrolled" ]
    )
  ]
] .
]]


> > Joshua
> > 
> > 
> > On Thu, Jun 2, 2022 at 11:50 AM Henry Story <henry.story@bblfish.net> wrote:
> > Hi all,
> > 
> >    I am working a bit with big data stacks and RDF recently.
> > The Big data crowd like to use binary formats such as Apache
> > Avro [2] These completely seperate the schema from the data, 
> > encoding the data in purely binary format which would be 
> > incomprehensible without the schema (for Avro this is a Json Schema).
> > 
> > What seems to be missing is a way to markup the schema [2] the way 
> > CSVW [1] does it for tables, by allowing one to specify what the URI of the
> > relations or classes are, or how to construct a URI from the data so
> > that it could be easy to tie it to linked data cloud.
> > 
> > The advantage of doing this for the BigData crowd would be that it 
> > would allow Big Data engineers to be able to find the definitions 
> > of the data they are using, and some logical infrastructure to 
> > find some established consequences of the relations. It could also
> > allow one to automate the construction of Avro files from the data
> > I guess… 
> > 
> > I looked around on the web but could not find anything clearly
> > going in that direction.
> > 
> > Henry Story
> > 
> > [1] https://twitter.com/bblfish/status/1531932840086077441
> > [2] https://avro.apache.org/docs/current/
> > 
> > 
> > 
> > https://co-operating.systems
> > WhatsApp, Signal, Tel: +33 6 38 32 69 84‬ 
> > Twitter: @bblfish
> > 
> > 
> 
> Henry Story
> 
> https://co-operating.systems
> WhatsApp, Signal, Tel: +33 6 38 32 69 84‬ 
> Twitter: @bblfish
> 
> 

Received on Friday, 3 June 2022 10:58:31 UTC