Efficient RDF Interchange, Re: Zippy from Stephen Williams on 2014-02-18 (semantic-web@w3.org from February 2014)

From: Stephen Williams <sdw@lig.net>
Date: Tue, 18 Feb 2014 10:20:42 -0800
To: ross.horne@gmail.com, Michel Dumontier <michel.dumontier@gmail.com>
CC: Tim Berners-Lee <timbl@w3.org>, Andreas Harth <andreas@harth.org>, SWIG Web <semantic-web@w3.org>
Message-ID: <5303A47A.6010701@lig.net>
I worked on W3C Efficient XML Interchange (EXI) from before the formation of the working group almost all the way through 
standardization when my work situation changed.  A number of my ideas are in there, although a number that I felt strongly about 
are not (deltas, standardization of interchange of compiled schema baseline, byte alignment of byte data through a fast, 
efficient novel peephole algorithm that adds almost no padding).  At the end, I became much more interested in the RDF 
interchange problem, but have worked on other things since.

At that time and somewhat since I developed an architecture and design for efficient RDF / N-tuples.  There are many tradeoffs, 
but we spent years working together on EXI examining very similar issues but for a significantly different problem space.  RDF 
and other graph data has a wider range of possible uses, characteristics of data, and possibilities for specific and general 
optimization.  I'm planning to finish that design and implementation soon for my own work that leverages the semantic web 
technologies, including RDF or RDF-like data.  I'm more focused on the user interface paradigm, app, ecosystem design than 
interchange, but that is a big part of the problem.

One of the main points of EXI, and of ERI, is compactness and fast usable representation, avoiding and/or minimizing parsing, 
without necessarily requiring decompression.  Decompression is optionally layered when it makes sense because of repetition or 
data names/values, but the structure can be compact without compression.  In the case of triples (and quads, etc.), it is very 
easy to fully separate structure at two levels from values which are naturally reused, dictionary "compressed", and then 
optionally compressed.

I'm very interested in quads or n-tuples (probably N-Quads) where I don't have to represent provenance, document/database/group 
membership, and other metadata strictly as triples (although they can always be recast as triples).  I'm also interested in ways 
of chunking/delta graph data for efficiency of transport, memory, computation, etc.

Has anyone been working on compact, efficient binary representation of RDF/N-Quads or similar?  Chunking / deltas?
Does anyone want to work on these problems?  I'm deep into some projects, but I might be interested in some arrangement to push 
this forward, consulting or co-founding or something otherwise mutually beneficial.  As was my early binary / efficient XML 
work, this is all independent research for me.

My main interest is in solving the user interface / visualization / mental model problem for A) a much better experience when 
working with all kinds of large/complex knowledge and B) interfacing to / representing / creating organized semantic / linked 
data.  I'm working on a Knowlege Browser and related paradigms to complement the web browser and search paradigms.  My goal is 
to improve knowledge organization and access for everyone, from neophytes to advanced knowledge-based workers.

Thanks,
Stephen

On 2/18/14 12:53 AM, Ross Horne wrote:
> Hi Michel,
>
> I think you point is worth considering. Google make heavy use of
> zippy, rather than gzip, simply to reduce the latency of reading and
> sending large amounts of data, see [1]. (Of course, storage is not a
> limitation.)
>
> Could zippy have a role in Linked Data protocols?
>
> Regards,
>
> Ross
>
> [1] Dean, Jeff. "Designs, lessons and advice from building large
> distributed systems." Keynote from LADIS (2009).
> http://www.lamsade.dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf
>
>
> On 18 February 2014 10:42, Michel Dumontier <michel.dumontier@gmail.com> wrote:
>> Hi Tim,
>>    That folder contains 350GB of compressed RDF. I'm not about to unzip it
>> because a crawler can't decompress it on the fly.  Honestly, it worries me
>> that people aren't considering the practicalities of storing, indexing, and
>> presenting all this data.
>>    Nevertheless, Bio2RDF does provide void definitions, URI resolution, and
>> access to SPARQL endpoints.  I can only hope our data gets discovered.
>>
>> m.
>>
>> Michel Dumontier
>> Associate Professor of Medicine (Biomedical Informatics), Stanford
>> University
>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
>> http://dumontierlab.com
>>
>>
>> On Sat, Feb 15, 2014 at 10:31 PM, Tim Berners-Lee <timbl@w3.org> wrote:
>>> On 2014-02 -14, at 09:46, Michel Dumontier wrote:
>>>
>>> Andreas,
>>>
>>>   I'd like to help by getting bio2rdf data into the crawl, really. but we
>>> gzip all of our files, and they are in n-quads format.
>>>
>>> http://download.bio2rdf.org/release/3/
>>>
>>> think you can add gzip/bzip2 support ?
>>>
>>> m.
>>>
>>> Michel Dumontier
>>> Associate Professor of Medicine (Biomedical Informatics), Stanford
>>> University
>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
>>> Group
>>> http://dumontierlab.com
>>>
>>>
>>> An on 2014-02 -15, at 18:00, Hugh Glaser wrote:
>>>
>>> Hi Andreas and Tobias.
>>> Good luck!
>>> Actually, I think essentially ignoring dumps and doing a "real" crawl, is
>>> a feature, rather than a bug.
>>>
>>>
>>>
>>> Michel,
>>>
>>> Agree with High. I would encourage you unzip the data files on your own
>>> servers
>>> so the URIs will work and your data is really Linked Data.
>>> There are lots of advantages to the community to be compatible.
>>>
>>> Tim
>>>
>>>


-- 
Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in
V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
Received on Tuesday, 18 February 2014 18:21:26 UTC