W3C home > Mailing lists > Public > semantic-web@w3.org > June 2022

Re: GRDDL for BigData or CSVW for Avro?

From: Joshua Shinavier <josh@fortytwo.net>
Date: Thu, 2 Jun 2022 12:08:06 -0700
Message-ID: <CACrq4OHONa150B7Ad04iGc-oROuouZvndFA6gKuegCJg18yObw@mail.gmail.com>
To: Henry Story <henry.story@bblfish.net>
Cc: semantic-web@w3.org
Hi Henry,

The best thing about Avro's AVSC format is that it is "just JSON", and will
accept any annotations you care to attach as properties. Uber (as of the
time I left in July) enriches Avro schemas with annotations which link
types and fields to standardized terms, add constraints such as
cardinality, mark fields which correspond to primary keys, etc. See my
presentation <https://eng.uber.com/dragon-schema-integration-at-uber-scale/>
from that time. The annotations were even sufficient to map Avro data into
RDF, with SHACL as the schema language. I have seen other companies
including LinkedIn enriching Avro schemas in similar ways.

In some cases, however, it may be preferable not to enrich the Avro schema,
but to write the schema in a formally-specified, strongly-typed language,
and map that schema to AVSC without any extra bells and whistles. When you
need the features of the stronger schema, you refer back to it via lineage.

Joshua


On Thu, Jun 2, 2022 at 11:50 AM Henry Story <henry.story@bblfish.net> wrote:

> Hi all,
>
>    I am working a bit with big data stacks and RDF recently.
> The Big data crowd like to use binary formats such as Apache
> Avro [2] These completely seperate the schema from the data,
> encoding the data in purely binary format which would be
> incomprehensible without the schema (for Avro this is a Json Schema).
>
> What seems to be missing is a way to markup the schema [2] the way
> CSVW [1] does it for tables, by allowing one to specify what the URI of the
> relations or classes are, or how to construct a URI from the data so
> that it could be easy to tie it to linked data cloud.
>
> The advantage of doing this for the BigData crowd would be that it
> would allow Big Data engineers to be able to find the definitions
> of the data they are using, and some logical infrastructure to
> find some established consequences of the relations. It could also
> allow one to automate the construction of Avro files from the data
> I guess…
>
> I looked around on the web but could not find anything clearly
> going in that direction.
>
> Henry Story
>
> [1] https://twitter.com/bblfish/status/1531932840086077441
> [2] https://avro.apache.org/docs/current/
>
>
>
> https://co-operating.systems
> WhatsApp, Signal, Tel: +33 6 38 32 69 84‬
> Twitter: @bblfish
>
>
>
Received on Thursday, 2 June 2022 19:08:30 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:46:11 UTC