Re: Introducing CBOR-LD... from Nader Helmy on 2020-07-24 (public-credentials@w3.org from July 2020)

From: Nader Helmy <creator.nader@gmail.com>
Date: Fri, 24 Jul 2020 14:29:10 -0500
To: Orie Steele <orie@transmute.industries>
Cc: Leonard Rosenthol <lrosenth@adobe.com>, Manu Sporny <msporny@digitalbazaar.com>, "public-credentials@w3.org" <public-credentials@w3.org>
Message-ID: <CAKTXcdcXHS8b0Z5j1yjaA3cL=PdKnK_-4WU_5XNzxu-0gNB+Xg@mail.gmail.com>
Stepping back to ask a simple question:

What is the relationship or the difference between this new DB spec:
https://digitalbazaar.github.io/cbor-ld-spec/

And this existing (and sparsely populated) draft spec at the W3C:
https://w3c.github.io/json-ld-cbor/

Is the former spec simply an evolution of the latter? What’s the delta
between these approaches?

On Fri, Jul 24, 2020 at 11:57 AM Orie Steele <orie@transmute.industries>
wrote:

> Sorry I am late to the CBOR-LD Party!
>
> Very excited to have a semantic linked data format that is also usable in
> a compact binary representation, and to have bi-directional
> transformation out of the box... I have been playing with CBOR on the
> weekends, and I have a repo here:
> https://github.com/transmute-industries/decentralized-cbor/blob/master/src/__fixtures__/outputs/table.csv
>
> The repo compares, JSON, JSON-LD, CBOR, DAG_CBOR and ZLIB_URDNA2015_CBOR (
> another approach at compressed linked data format in CBOR)... I am eager to
> add tests for CBOR-LD.
>
> both DAG_CBOR and CBOR-LD have some benefits over CBOR
> and ZLIB_URDNA2015_CBOR and JSON....
>
> Both are linked data formats where the linked data aspect is preserved at
> the binary level. ZLIB_URDNA2015_CBOR is just a compressed JSON-LD object
> encoded as CBOR, you cannot leverage internal semantics... in much the same
> way you cannot leverage internal semantics of "Pure JSON" and "Pure
> CBOR".... However, ZLIB_URDNA2015_CBOR is MUCH smaller than DAG_CBOR /
> "Pure CBOR" that was built from "Pure JSON", and CBOR-LD is MUCH smaller
> than ZLIB_URDNA2015_CBOR...
>
> Backing up for a second, one way to think about why CBOR-LD is awesome is
> to consider how all software that processes data, has some opinion about
> that data... sometimes these opinions are encoded in schema validation of
> incoming data (using tools like JSON Schema or ProtoBuff)... If you
> consider that changes to data on the wire would cause the software to
> explode... you can see why agreeing to a common context, is similar to
> agreeing to a data schema....
>
> And by relying on an existing context to build a compressed binary
> representation of a semantic object, we can leverage these "common
> dictionaries / vocabularies" not just for semantic disambiguation, but also
> for compression....
>
> Obviously the IoT space has been waiting for something like this for a
> long time...
>
> - https://www.w3.org/WoT/
> -
> https://github.com/Azure/opendigitaltwins-dtdl/blob/master/DTDL/v2/dtdlv2..md
>
> We are now able to convert all these ontologies and semantic vocabularies,
> into compact, interoperable, binary representations for industries that
> have already committed to the semantic web:
> https://github.com/semantalytics/awesome-semantic-web#ontologies
>
> I'm not sure of the potential internal representation benefits for
> services like https://developers.google.com/knowledge-graph but
> obviously, a small IOT device that only speaks CBOR-LD would not need to
> crack out a JSON parser and all the attack surface associated with it, just
> to talk to the knowledge graph service.
>
> OS
>
>
>
> On Fri, Jul 24, 2020 at 10:54 AM Leonard Rosenthol <lrosenth@adobe.com>
> wrote:
>
>> It's not just specific schemas but also the order of the schemas, any
>> other keys you add, plus additional "techniques" you add.
>>
>> Using your presentation as a guide:
>> Slide 11:
>>
>> In that case you have picked a single schema, found all the items, and
>> given the unique value (let's say 1-10.).  Now (not shown on the slide,
>> but...), I assume that you then pick another schema and start allocating
>> values for it in the dictionary (eg. 11-20), and so on.   At some point the
>> credentials schema is updated (1.1->1.2) - but you can't update the
>> existing entries in the dictionary and just add the new ones to the end
>> (eg. 100-105).  And then you encode something using that dictionary - how
>> does something downstream know that you are using the 1.2 version of the
>> context?  It would simply have a 100 in there - but w/o that in the
>> dictionary, it's not decodable.
>>
>>
>> Slide 14:
>>
>> This is a good example of how to reduce size by switching from a string
>> representation to binary.  I assume we will find more of those cases over
>> time.   *BUT* a decoder needs to understand this encoding approach - but
>> again, how would they recognize something new?
>>
>>
>> At a minimum, we need a way to encode the version of the CBOR-SC
>> algorithm that is used to encode a given data set.   That would go a *long
>> way* to resolving my concerns.
>>
>> Leonard
>>
>> On 7/24/20, 11:19 AM, "Manu Sporny" <msporny@digitalbazaar.com> wrote:
>>
>>     On 7/24/20 11:00 AM, Leonard Rosenthol wrote:
>>     > However, the main use case that you present in the presentation is
>>     > QRCodes - which exist as a mechanism to move from digital to analog
>>     > (and back).   The analog world is long lived - even if not
>>     > necessarily archival - and the data needs to be retrievable.  And
>>     > that can't happen w/o knowing the right (version of the) dictionary
>>     > to use.
>>
>>     ... which is why we strongly suggest that all production contexts
>> should
>>     be versioned, frozen, and cryptographically hashed. There is a general
>>     mitigation for your concern. :)
>>
>>     To be clear, this issue is well known in the JSON-LD ecosystem and
>> that
>>     ecosystem has thrived (deployed on tens of millions of domains) in
>> spite
>>     of the danger. That community has learned how to manage constantly
>>     evolving vocabularies (schema.org), and how to lock vocabularies
>> down (VCs).
>>
>>     There are solutions to the problem you outline, cryptographically
>>     hashing URLs is one thing we explored, but that bloats the size of the
>>     CBOR-LD bytes. Like any technology, CBOR-LD is a series of difficult
>>     design trade-offs.
>>
>>     Just like we made the conscious decision in JSON-LD to be able to
>>     reference external JSON-LD Context files (which people insisted was
>>     madness and unworkable when we did it... and still do), we make the
>> same
>>     conscious decision now (because it worked out pretty well for JSON-LD,
>>     and it's not clear how doing the same thing in CBOR-LD would be any
>>     different).
>>
>>     If we wanted to eliminate the risk you highlighted, we wouldn't be
>> able
>>     to solve the most pressing use cases.
>>
>>     -- manu
>>
>>     --
>>     Manu Sporny -
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmanusporny%2F&amp;data=02%7C01%7Clrosenth%40adobe.com%7C068dbd2266774d9df7c108d82fe4ec40%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637312007547071439&amp;sdata=9FPko04mJd9Ti%2FqTUGWCAA9L8v6V4N1TfQTeC%2BSwyr0%3D&amp;reserved=0
>>     Founder/CEO - Digital Bazaar, Inc.
>>     blog: Veres One Decentralized Identifier Blockchain Launches
>>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftinyurl.com%2Fveres-one-launches&amp;data=02%7C01%7Clrosenth%40adobe.com%7C068dbd2266774d9df7c108d82fe4ec40%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637312007547071439&amp;sdata=VRjEMw2dMaAme%2F5ZYMLf7EhcLxxHcyu%2B5rCEOx4N2dU%3D&amp;reserved=0
>>
>>
>
> --
> *ORIE STEELE*
> Chief Technical Officer
> www.transmute.industries
>
> <https://www.transmute.industries>
>
Received on Tuesday, 28 July 2020 10:08:02 UTC