Re: GRDDL for BigData or CSVW for Avro? from Joshua Shinavier on 2022-06-03 (semantic-web@w3.org from June 2022)

From: Joshua Shinavier <josh@fortytwo.net>
Date: Thu, 2 Jun 2022 22:04:05 -0700
To: Henry Story <henry.story@bblfish.net>
Cc: semantic-web@w3.org
Message-ID: <CACrq4OFSn5n1J3iNjYPmmqR668LKL+ryLXXW_5aVARB-BtzDUg@mail.gmail.com>
On Thu, Jun 2, 2022 at 8:59 PM Henry Story <henry.story@bblfish.net> wrote:

>
> [...]
> Thanks for reminding me of that talk.
> I could not find a recording of it but I saw a shorter version
> https://twitter.com/bblfish/status/1162660513005953024



Ah, that is an earlier talk (at the first KGC) where I was essentially
complaining about how fragmented the company's data landscape was, and how
it forced me to work a couple of layers lower in the knowledge graph
hierarchy of needs than I would have preferred -- we needed solutions for
bridging the gaps between data languages and disconnected schemas in order
to treat our data as more of a cohesive entity, up to and including as a
graph. The schema software arose out of those needs.

The slides I linked are from a US2TS talk
<https://us2ts.org/2020/keynote-joshua-shinavier> about a year later. I
saved a recording in my company Google Drive at the time, but do not have
access now. Maybe Juan Sequeda has a recording somewhere.



> [...]
>
> Actually in those slides you present your idea using YAML around slide 40
> is
> making me wonder if this ”schame salad” project I came across
> is not actually doing what I am looking for
>
> [...]



Schema Salad looks good, and I would imagine its approach could be extended
to other data exchange languages like Protobuf and Thrift, as well as
GraphQL and relational schemas. Most of the data languages used in the sort
of tech companies I have worked for have a similar flavor, and it
surprisingly is not an OO-like flavor, but an algebraic one: records,
unions, and literals.


 [...]

> yes, I think there are many ways to create the schema. (I had a project at
> Sun Microsystems to
> enhance Java with simple @rdf addnotations on classes and files to give
> those URLs and so map them
> to RDF).


There, OWL might make a lot of sense as a schema language. I have found
shapes constraint languages (SHACL being the one I've used in practice, but
ShEx would work, too) to be a better fit for languages like Avro.


What I think would be useful is if Big Data Devs could easily see the
> ontology behind the schemas
> or the code just to make the meaning of the data as transparent as
> possible and easy to access.



Avro is pretty good in that it natively allows you to define types in
global namespaces -- so identifying records with classes or shapes does not
even require an annotation. You do need to tell it where to find foreign
key references -- e.g. this string is actually a reference to a User by
uuid, that number is actually a reference to a geo place, etc. -- but it
doesn't take much. Even Protobuf schemas can be mapped pretty neatly to OWL
ontologies for querying or visualization, without any semantic annotation
at all. I used to use Dragon+Gra.fo as a way of turning a few hundred type
definitions into bubbles and lines which engineers could look at in order
to easily appreciate the structure of their schemas.

Joshua




I’ll look into Schema Salad
> https://www.commonwl.org/v1.0/SchemaSalad.html
>
> >
> > Joshua
> >
> >
> > On Thu, Jun 2, 2022 at 11:50 AM Henry Story <henry.story@bblfish.net>
> wrote:
> > Hi all,
> >
> >    I am working a bit with big data stacks and RDF recently.
> > The Big data crowd like to use binary formats such as Apache
> > Avro [2] These completely seperate the schema from the data,
> > encoding the data in purely binary format which would be
> > incomprehensible without the schema (for Avro this is a Json Schema).
> >
> > What seems to be missing is a way to markup the schema [2] the way
> > CSVW [1] does it for tables, by allowing one to specify what the URI of
> the
> > relations or classes are, or how to construct a URI from the data so
> > that it could be easy to tie it to linked data cloud.
> >
> > The advantage of doing this for the BigData crowd would be that it
> > would allow Big Data engineers to be able to find the definitions
> > of the data they are using, and some logical infrastructure to
> > find some established consequences of the relations. It could also
> > allow one to automate the construction of Avro files from the data
> > I guess…
> >
> > I looked around on the web but could not find anything clearly
> > going in that direction.
> >
> > Henry Story
> >
> > [1] https://twitter.com/bblfish/status/1531932840086077441
> > [2] https://avro.apache.org/docs/current/
> >
> >
> >
> > https://co-operating.systems
> > WhatsApp, Signal, Tel: +33 6 38 32 69 84‬
> > Twitter: @bblfish
> >
> >
>
> Henry Story
>
> https://co-operating.systems
> WhatsApp, Signal, Tel: +33 6 38 32 69 84‬
> Twitter: @bblfish
>
>
Received on Friday, 3 June 2022 05:04:30 UTC