Re: Efficient RDF Interchange, Re: Zippy from Stephen D. Williams on 2014-02-21 (semantic-web@w3.org from February 2014)

From: Stephen D. Williams <sdw@lig.net>
Date: Fri, 21 Feb 2014 02:19:56 -0800
To: Miel Vander Sande <miel.vandersande@ugent.be>
CC: SWIG Web <semantic-web@w3.org>
Message-ID: <5307284C.3050802@lig.net>
Thanks!  That is a very helpful pointer.  I've been concentrating on other areas too long...

On an initial glance, I don't see any active standardization work, which is good since it doesn't seem to have all the features I 
would want...
In particular, N-quads support (and I have particular interests in optimizing fine-grained named graph metadata handling), some 
possibly better encoding methods, and an in-place modifiable version.  Plus explicit support for deltas / chunks / baseline.

Some very interesting choices (bitmap graph representation) and a lot of related papers to digest.  I'm glad people have recognized 
the need and have spent good effort solving the problem.  I'll see what I can add as I get into it soon.

Stephen

On 2/21/14, 1:26 AM, Miel Vander Sande wrote:
> Hi Stephen,
>
> I think DERI has created exactly what you're looking for. It's called http://www.rdfhdt.org/ and we've recently started using it. 
> It's not only compact, but it also allows incredibly fast lookup.
>
> Kind regards,
>
> Miel Vander Sande
> Researcher Semantic Web - Linked Open Data
> Multimedia Lab [Ghent University - iMinds]
>
> On Feb 18, 2014, at 7:20 PM, Stephen Williams <sdw@lig.net <mailto:sdw@lig.net>> wrote:
>
>> I worked on W3C Efficient XML Interchange (EXI) from before the formation of the working group almost all the way through 
>> standardization when my work situation changed.  A number of my ideas are in there, although a number that I felt strongly about 
>> are not (deltas, standardization of interchange of compiled schema baseline, byte alignment of byte data through a fast, 
>> efficient novel peephole algorithm that adds almost no padding).  At the end, I became much more interested in the RDF 
>> interchange problem, but have worked on other things since.
>>
>> At that time and somewhat since I developed an architecture and design for efficient RDF / N-tuples. There are many tradeoffs, 
>> but we spent years working together on EXI examining very similar issues but for a significantly different problem space.  RDF 
>> and other graph data has a wider range of possible uses, characteristics of data, and possibilities for specific and general 
>> optimization.  I'm planning to finish that design and implementation soon for my own work that leverages the semantic web 
>> technologies, including RDF or RDF-like data.  I'm more focused on the user interface paradigm, app, ecosystem design than 
>> interchange, but that is a big part of the problem.
>>
>> One of the main points of EXI, and of ERI, is compactness and fast usable representation, avoiding and/or minimizing parsing, 
>> without necessarily requiring decompression. Decompression is optionally layered when it makes sense because of repetition or 
>> data names/values, but the structure can be compact without compression.  In the case of triples (and quads, etc.), it is very 
>> easy to fully separate structure at two levels from values which are naturally reused, dictionary "compressed", and then 
>> optionally compressed.
>>
>> I'm very interested in quads or n-tuples (probably N-Quads) where I don't have to represent provenance, document/database/group 
>> membership, and other metadata strictly as triples (although they can always be recast as triples).  I'm also interested in ways 
>> of chunking/delta graph data for efficiency of transport, memory, computation, etc.
>>
>> Has anyone been working on compact, efficient binary representation of RDF/N-Quads or similar?  Chunking / deltas?
>> Does anyone want to work on these problems?  I'm deep into some projects, but I might be interested in some arrangement to push 
>> this forward, consulting or co-founding or something otherwise mutually beneficial. As was my early binary / efficient XML work, 
>> this is all independent research for me.
>>
>> My main interest is in solving the user interface / visualization / mental model problem for A) a much better experience when 
>> working with all kinds of large/complex knowledge and B) interfacing to / representing / creating organized semantic / linked 
>> data.  I'm working on a Knowlege Browser and related paradigms to complement the web browser and search paradigms.  My goal is to 
>> improve knowledge organization and access for everyone, from neophytes to advanced knowledge-based workers.
>>
>> Thanks,
>> Stephen
>>
>> On 2/18/14 12:53 AM, Ross Horne wrote:
>>> Hi Michel,
>>>
>>> I think you point is worth considering. Google make heavy use of
>>> zippy, rather than gzip, simply to reduce the latency of reading and
>>> sending large amounts of data, see [1]. (Of course, storage is not a
>>> limitation.)
>>>
>>> Could zippy have a role in Linked Data protocols?
>>>
>>> Regards,
>>>
>>> Ross
>>>
>>> [1] Dean, Jeff. "Designs, lessons and advice from building large
>>> distributed systems." Keynote from LADIS (2009).
>>> http://www.lamsade.dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf
>>>
>>>
>>> On 18 February 2014 10:42, Michel Dumontier<michel.dumontier@gmail.com>  wrote:
>>>> Hi Tim,
>>>>    That folder contains 350GB of compressed RDF. I'm not about to unzip it
>>>> because a crawler can't decompress it on the fly.  Honestly, it worries me
>>>> that people aren't considering the practicalities of storing, indexing, and
>>>> presenting all this data.
>>>>    Nevertheless, Bio2RDF does provide void definitions, URI resolution, and
>>>> access to SPARQL endpoints.  I can only hope our data gets discovered.
>>>>
>>>> m.
>>>>
>>>> Michel Dumontier
>>>> Associate Professor of Medicine (Biomedical Informatics), Stanford
>>>> University
>>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
>>>> http://dumontierlab.com
>>>>
>>>>
>>>> On Sat, Feb 15, 2014 at 10:31 PM, Tim Berners-Lee<timbl@w3.org>  wrote:
>>>>> On 2014-02 -14, at 09:46, Michel Dumontier wrote:
>>>>>
>>>>> Andreas,
>>>>>
>>>>>   I'd like to help by getting bio2rdf data into the crawl, really. but we
>>>>> gzip all of our files, and they are in n-quads format.
>>>>>
>>>>> http://download.bio2rdf.org/release/3/
>>>>>
>>>>> think you can add gzip/bzip2 support ?
>>>>>
>>>>> m.
>>>>>
>>>>> Michel Dumontier
>>>>> Associate Professor of Medicine (Biomedical Informatics), Stanford
>>>>> University
>>>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
>>>>> Group
>>>>> http://dumontierlab.com
>>>>>
>>>>>
>>>>> An on 2014-02 -15, at 18:00, Hugh Glaser wrote:
>>>>>
>>>>> Hi Andreas and Tobias.
>>>>> Good luck!
>>>>> Actually, I think essentially ignoring dumps and doing a "real" crawl, is
>>>>> a feature, rather than a bug.
>>>>>
>>>>>
>>>>>
>>>>> Michel,
>>>>>
>>>>> Agree with High. I would encourage you unzip the data files on your own
>>>>> servers
>>>>> so the URIs will work and your data is really Linked Data.
>>>>> There are lots of advantages to the community to be compatible.
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>
>>
>> -- 
>> Stephen D. Williamssdw@lig.net  stephendwilliams@gmail.com  LinkedIn:http://sdw.st/in
>> V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
>> AIM:sdw  Skype:StephenDWilliams  Yahoo:sdwlignet  Resume:http://sdw.st/gres
>> Personal:http://sdw.st  facebook.com/sdwlig  <http://facebook.com/sdwlig>  twitter.com/scienteer  <http://twitter.com/scienteer>
>


-- 
Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in
V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
Received on Friday, 21 February 2014 10:20:25 UTC