Re: Efficient RDF Interchange, Re: Zippy

Hello all,

Note: HDT developer here, my opinions might be biased :)

We haven’t used SPARQL XML, but we are able to generate an HDT out of the results of a SPARQL construct query, since it is a graph. The size savings are similar to Ntriples/HDT comparisons available at [1], with the plus that the result is also searchable. We haven’t worked on results of select queries yet (they are tables, not graphs).

HDT is designed to be modular. Indeed you can send only the dictionary to other peer if you want to [2], then you can transfer statements by only encoding their ids. If you have several heterogeneous sources, then you are right, you need a more advanced global id model.

I really agree with Michel Dumontier that RDF should be served compressed in most of the cases, specially dumps. The difference in disk IO, bandwidth and waiting time are huge [1], so it’s a win/win for everyone. Most of the browsers accept the HTTP header “Content-Encoding: gzip” and decompress the reply on the fly, and crawlers should do it too (In java it is as easy as wrapping it with new GZIPInputStream(stream), similar in many other languages).

We would love to participate in any standardization effort for more efficient RDF. Our experience with HDT shows that the difference can be huge compared to textual, triple-oriented alternatives for many common scenarios. Consumers must not only be able to interpret the data, they need to do it really fast if we want to create interesting applications.

Best,

Mario Arias.
@MarioAriasGa
PhD Researcher.
Insight Centre for Data Analytics.
National University of Ireland, Galway.
(Formerly DERI)

[1] http://www.rdfhdt.org/technical-specification/#numbers
[2] https://code.google.com/p/hdt-java/source/browse/hdt-java-core/src/main/java/org/rdfhdt/hdt/dictionary/DictionaryPrivate.java?name=maven

El 21/02/2014, a las 14:07, Eric Prud'hommeaux <eric@w3.org> escribió:

> * Axel Polleres <axel@polleres.net> [2014-02-21 13:58+0100]
>> Fwiw, hdt. was also a w3c member submission, see http://www.w3.org/Submission/2011/03/
> 
> Did you guys try using HDT on SPARQL XML Results Format?
> Any idea whether you could share the dictionary between processes
> or machines? It could be a cool omtimization for tightly-coupled
> systems, but I guess that means you need expensive mutexes on write
> or some predictable sharding system.  Thoughts?
> 
> 
>> Best,
>> Axel
>> 
>> 
>> 
>> (sent from my mobile)
>> --
>> Prof. Axel Polleres, WU
>> url: http://www.polleres.net/  twitter: @AxelPolleres
>> 
>>> On Feb 21, 2014, at 11:19, "Stephen D. Williams" <sdw@lig.net> wrote:
>>> 
>>> Thanks!  That is a very helpful pointer.  I've been concentrating on other areas too long...
>>> 
>>> On an initial glance, I don't see any active standardization work, which is good since it doesn't seem to have all the features I would want...
>>> In particular, N-quads support (and I have particular interests in optimizing fine-grained named graph metadata handling), some possibly better encoding methods, and an in-place modifiable version.  Plus explicit support for deltas / chunks / baseline.
>>> 
>>> Some very interesting choices (bitmap graph representation) and a       lot of related papers to digest.  I'm glad people have recognized the need and have spent good effort solving the problem.  I'll see what I can add as I get into it soon.
>>> 
>>> Stephen
>>> 
>>>> On 2/21/14, 1:26 AM, Miel Vander Sande wrote:
>>>> Hi Stephen,
>>>> 
>>>> I think DERI has created exactly what you're looking for. It's called http://www.rdfhdt.org/ and we've recently started using it. It's not only compact, but it also allows incredibly fast lookup.
>>>> 
>>>> Kind regards,
>>>> 
>>>> Miel Vander Sande
>>>> Researcher Semantic Web - Linked Open Data
>>>> Multimedia Lab [Ghent University - iMinds]
>>>> 
>>>>> On Feb 18, 2014, at 7:20 PM, Stephen Williams <sdw@lig.net>           wrote:
>>>>> 
>>>>> I worked on W3C Efficient XML Interchange (EXI) from before the formation of the working group almost all the way through standardization when my work situation changed.  A number of my ideas are in there, although a number that I felt strongly about are not (deltas, standardization of interchange of compiled schema baseline, byte alignment of byte data through a fast, efficient novel peephole algorithm that adds almost no padding).  At the end, I became much more interested in the RDF interchange problem, but have worked on other things since.
>>>>> 
>>>>> At that time and somewhat since I developed an architecture and design for efficient RDF / N-tuples.  There are many tradeoffs, but we spent years working together on EXI examining very similar issues but for a significantly different problem space.  RDF and other graph data has a wider range of possible uses, characteristics of data, and possibilities for specific and general optimization.  I'm planning to finish that design and implementation soon for my own work that leverages the semantic web technologies, including RDF or RDF-like data.  I'm more focused on the user interface paradigm, app, ecosystem design than interchange, but that               is a big part of the problem.
>>>>> 
>>>>> One of the main points of EXI, and of ERI, is compactness and fast usable representation, avoiding and/or minimizing parsing, without necessarily requiring decompression.  Decompression is optionally layered when it makes sense because of repetition or data names/values, but the structure can be compact without compression.  In the case of triples (and quads, etc.), it is very easy to fully separate structure at two levels from values which are naturally reused, dictionary "compressed", and then optionally compressed.
>>>>> 
>>>>> I'm very interested in quads or n-tuples (probably N-Quads) where I don't have to represent provenance, document/database/group membership, and other metadata strictly as triples (although they can always be recast as triples).  I'm also interested in ways of chunking/delta graph data for efficiency of transport, memory, computation, etc.
>>>>> 
>>>>> Has anyone been working on compact, efficient binary representation of RDF/N-Quads or similar?  Chunking / deltas?
>>>>> Does anyone want to work on these problems?  I'm deep into some projects, but I might be interested in some arrangement to push this forward, consulting or co-founding or something otherwise mutually beneficial.  As was my early binary / efficient XML work, this is all independent research for me.
>>>>> 
>>>>> My main interest is in solving the user interface / visualization / mental model problem for A) a much better experience when working with all kinds of large/complex knowledge and B) interfacing to / representing / creating organized semantic / linked data.  I'm working on a Knowlege Browser and related paradigms to complement the web browser and search paradigms.  My goal is to improve knowledge organization and access for everyone, from neophytes to advanced knowledge-based workers.
>>>>> 
>>>>> Thanks,
>>>>> Stephen
>>>>> 
>>>>>> On 2/18/14 12:53 AM, Ross Horne wrote:
>>>>>> Hi Michel,
>>>>>> 
>>>>>> I think you point is worth considering. Google make heavy use of
>>>>>> zippy, rather than gzip, simply to reduce the latency of reading and
>>>>>> sending large amounts of data, see [1]. (Of course, storage is not a
>>>>>> limitation.)
>>>>>> 
>>>>>> Could zippy have a role in Linked Data protocols?
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> Ross
>>>>>> 
>>>>>> [1] Dean, Jeff. "Designs, lessons and advice from building large
>>>>>> distributed systems." Keynote from LADIS (2009).
>>>>>> http://www.lamsade.dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf
>>>>>> 
>>>>>> 
>>>>>> On 18 February 2014 10:42, Michel Dumontier <michel.dumontier@gmail.com> wrote:
>>>>>>> Hi Tim,
>>>>>>>  That folder contains 350GB of compressed RDF. I'm not about to unzip it
>>>>>>> because a crawler can't decompress it on the fly.  Honestly, it worries me
>>>>>>> that people aren't considering the practicalities of storing, indexing, and
>>>>>>> presenting all this data.
>>>>>>>  Nevertheless, Bio2RDF does provide void definitions, URI resolution, and
>>>>>>> access to SPARQL endpoints.  I can only hope our data gets discovered.
>>>>>>> 
>>>>>>> m.
>>>>>>> 
>>>>>>> Michel Dumontier
>>>>>>> Associate Professor of Medicine (Biomedical Informatics), Stanford
>>>>>>> University
>>>>>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
>>>>>>> http://dumontierlab.com
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Feb 15, 2014 at 10:31 PM, Tim Berners-Lee <timbl@w3.org> wrote:
>>>>>>>> On 2014-02 -14, at 09:46, Michel Dumontier wrote:
>>>>>>>> 
>>>>>>>> Andreas,
>>>>>>>> 
>>>>>>>> I'd like to help by getting bio2rdf data into the crawl, really. but we
>>>>>>>> gzip all of our files, and they are in n-quads format.
>>>>>>>> 
>>>>>>>> http://download.bio2rdf.org/release/3/
>>>>>>>> 
>>>>>>>> think you can add gzip/bzip2 support ?
>>>>>>>> 
>>>>>>>> m.
>>>>>>>> 
>>>>>>>> Michel Dumontier
>>>>>>>> Associate Professor of Medicine (Biomedical Informatics), Stanford
>>>>>>>> University
>>>>>>>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
>>>>>>>> Group
>>>>>>>> http://dumontierlab.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> An on 2014-02 -15, at 18:00, Hugh Glaser wrote:
>>>>>>>> 
>>>>>>>> Hi Andreas and Tobias.
>>>>>>>> Good luck!
>>>>>>>> Actually, I think essentially ignoring dumps and doing a "real" crawl, is
>>>>>>>> a feature, rather than a bug.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Michel,
>>>>>>>> 
>>>>>>>> Agree with High. I would encourage you unzip the data files on your own
>>>>>>>> servers
>>>>>>>> so the URIs will work and your data is really Linked Data.
>>>>>>>> There are lots of advantages to the community to be compatible.
>>>>>>>> 
>>>>>>>> Tim
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in
>>>>> V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
>>>>> AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
>>>>> Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
>>> 
>>> 
>>> -- 
>>> Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in
>>> V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
>>> AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
>>> Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
> 
> -- 
> -ericP
> 
> office: +1.617.599.3509
> mobile: +33.6.80.80.35.59
> 
> (eric@w3.org)
> Feel free to forward this message to any list for any purpose other than
> email address distribution.
> 
> There are subtle nuances encoded in font variation and clever layout
> which can only be seen by printing this message on high-clay paper.

Received on Saturday, 22 February 2014 22:01:44 UTC