W3C home > Mailing lists > Public > semantic-web@w3.org > June 2022

Re: GRDDL for BigData or CSVW for Avro?

From: Henry Story <henry.story@bblfish.net>
Date: Fri, 3 Jun 2022 05:59:38 +0200
Cc: semantic-web@w3.org
Message-Id: <6E88E785-F1B6-40D9-8ACA-DB40BC35AC64@bblfish.net>
To: Joshua Shinavier <josh@fortytwo.net>


> On 2. Jun 2022, at 21:08, Joshua Shinavier <josh@fortytwo.net> wrote:
> 
> Hi Henry,
> 
> The best thing about Avro's AVSC format is that it is "just JSON", and will accept any annotations you care to attach as properties. Uber (as of the time I left in July) enriches Avro schemas with annotations which link types and fields to standardized terms, add constraints such as cardinality, mark fields which correspond to primary keys, etc. See my presentation from that time. The annotations were even sufficient to map Avro data into RDF, with SHACL as the schema language. I have seen other companies including LinkedIn enriching Avro schemas in similar ways.

Thanks for reminding me of that talk. 
I could not find a recording of it but I saw a shorter version
https://twitter.com/bblfish/status/1162660513005953024

The slides by themselves are easy to read (though for some reason they don’t go full screen)
https://eng.uber.com/dragon-schema-integration-at-uber-scale/

Actually in those slides you present your idea using YAML around slide 40 is 
making me wonder if this ”schame salad” project I came across
is not actually doing what I am looking for 

Semantic Annotations for Linked Avro Data (SALAD) 
https://github.com/common-workflow-language/schema_salad

It mentions Json and Json-LD and Avro but their examples
are in YAML which may be what confused me.

> 
> In some cases, however, it may be preferable not to enrich the Avro schema, but to write the schema in a formally-specified, strongly-typed language, and map that schema to AVSC without any extra bells and whistles. When you need the features of the stronger schema, you refer back to it via lineage.

yes, I think there are many ways to create the schema. (I had a project at Sun Microsystems to
enhance Java with simple @rdf addnotations on classes and files to give those URLs and so map them
to RDF). 

What I think would be useful is if Big Data Devs could easily see the ontology behind the schemas
or the code just to make the meaning of the data as transparent as possible and easy to access. 

I’ll look into Schema Salad 
https://www.commonwl.org/v1.0/SchemaSalad.html

> 
> Joshua
> 
> 
> On Thu, Jun 2, 2022 at 11:50 AM Henry Story <henry.story@bblfish.net> wrote:
> Hi all,
> 
>    I am working a bit with big data stacks and RDF recently.
> The Big data crowd like to use binary formats such as Apache
> Avro [2] These completely seperate the schema from the data, 
> encoding the data in purely binary format which would be 
> incomprehensible without the schema (for Avro this is a Json Schema).
> 
> What seems to be missing is a way to markup the schema [2] the way 
> CSVW [1] does it for tables, by allowing one to specify what the URI of the
> relations or classes are, or how to construct a URI from the data so
> that it could be easy to tie it to linked data cloud.
> 
> The advantage of doing this for the BigData crowd would be that it 
> would allow Big Data engineers to be able to find the definitions 
> of the data they are using, and some logical infrastructure to 
> find some established consequences of the relations. It could also
> allow one to automate the construction of Avro files from the data
> I guess… 
> 
> I looked around on the web but could not find anything clearly
> going in that direction.
> 
> Henry Story
> 
> [1] https://twitter.com/bblfish/status/1531932840086077441
> [2] https://avro.apache.org/docs/current/
> 
> 
> 
> https://co-operating.systems
> WhatsApp, Signal, Tel: +33 6 38 32 69 84‬ 
> Twitter: @bblfish
> 
> 

Henry Story

https://co-operating.systems
WhatsApp, Signal, Tel: +33 6 38 32 69 84‬ 
Twitter: @bblfish
Received on Friday, 3 June 2022 03:59:54 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:46:11 UTC