Re: Efficient RDF Interchange, Re: Zippy from Michel Dumontier on 2014-02-26 (semantic-web@w3.org from February 2014)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Wed, 26 Feb 2014 11:26:22 +0900
To: Mario Arias <mario.arias@deri.org>
Cc: "Eric Prud'hommeaux" <eric@w3.org>, Axel Polleres <axel@polleres.net>, "Stephen D. Williams" <sdw@lig.net>, Miel Vander Sande <miel.vandersande@ugent.be>, SWIG Web <semantic-web@w3.org>, Joachim Baran <joachim.baran@gmail.com>, "biohackathon@googlegroups.com" <biohackathon@googlegroups.com>
Message-ID: <CALcEXf688UKMib=vp0GFwNFA0UmLBvv2jnQgb1VQLCzgBZZHgA@mail.gmail.com>
We played with HDT during the 2013 Biohackathon [1] in Japan. We developed
and used an OWL ontology (FALDO [2]) for representing genomic positions and
stored this in HDT, from which a ruby service [3] made content accessible
to the JBrowse genomic viewer [4]. A brief presentation can be found at
[5]. I think the general consensus was having a compressed, indexed, and
queryable file was a significant advance for web/standalone application
development.

m.

[1] http://2013.biohackathon.org/
[2] https://github.com/JervenBolleman/FALDO
[3] https://github.com/joejimbo/GenomicHDT
[4] http://jbrowse.org/
[5]
https://github.com/joejimbo/GenomicHDT/raw/bde59a0a2211d042b756616e1a42b1e57ac26196/docs/BioHackathon2013.pdf

Michel Dumontier
Associate Professor of Medicine (Biomedical Informatics), Stanford
University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com


On Sat, Feb 22, 2014 at 3:38 AM, Mario Arias <mario.arias@deri.org> wrote:

> Hello all,
>
> Note: HDT developer here, my opinions might be biased :)
>
> We haven't used SPARQL XML, but we are able to generate an HDT out of the
> results of a SPARQL construct query, since it is a graph. The size savings
> are similar to Ntriples/HDT comparisons available at [1], with the plus
> that the result is also searchable. We haven't worked on results of select
> queries yet (they are tables, not graphs).
>
> HDT is designed to be modular. Indeed you can send only the dictionary to
> other peer if you want to [2], then you can transfer statements by only
> encoding their ids. If you have several heterogeneous sources, then you are
> right, you need a more advanced global id model.
>
> I really agree with Michel Dumontier that RDF should be served compressed
> in most of the cases, specially dumps. The difference in disk IO, bandwidth
> and waiting time are huge [1], so it's a win/win for everyone. Most of the
> browsers accept the HTTP header "Content-Encoding: gzip" and decompress the
> reply on the fly, and crawlers should do it too (In java it is as easy as
> wrapping it with new GZIPInputStream(stream), similar in many other
> languages).
>
> We would love to participate in any standardization effort for more
> efficient RDF. Our experience with HDT shows that the difference can be
> huge compared to textual, triple-oriented alternatives for many common
> scenarios. Consumers must not only be able to interpret the data, they need
> to do it really fast if we want to create interesting applications.
>
> Best,
>
> Mario Arias.
> @MarioAriasGa
> PhD Researcher.
> Insight Centre for Data Analytics.
> National University of Ireland, Galway.
> (Formerly DERI)
>
> [1] http://www..rdfhdt.org/technical-specification/#numbers<http://www.rdfhdt.org/technical-specification/#numbers>
> [2]
> https://code.google.com/p/hdt-java/source/browse/hdt-java-core/src/main/java/org/rdfhdt/hdt/dictionary/DictionaryPrivate.java?name=maven
>
> El 21/02/2014, a las 14:07, Eric Prud'hommeaux <eric@w3.org> escribió:
>
> * Axel Polleres <axel@polleres.net> [2014-02-21 13:58+0100]
>
> Fwiw, hdt. was also a w3c member submission, see
> http://www.w3.org/Submission/2011/03/
>
>
> Did you guys try using HDT on SPARQL XML Results Format?
> Any idea whether you could share the dictionary between processes
> or machines? It could be a cool omtimization for tightly-coupled
> systems, but I guess that means you need expensive mutexes on write
> or some predictable sharding system.  Thoughts?
>
>
> Best,
> Axel
>
>
>
> (sent from my mobile)
> --
> Prof. Axel Polleres, WU
> url: http://www.polleres.net/  twitter: @AxelPolleres
>
> On Feb 21, 2014, at 11:19, "Stephen D. Williams" <sdw@lig.net> wrote:
>
> Thanks!  That is a very helpful pointer.  I've been concentrating on other
> areas too long...
>
> On an initial glance, I don't see any active standardization work, which
> is good since it doesn't seem to have all the features I would want...
> In particular, N-quads support (and I have particular interests in
> optimizing fine-grained named graph metadata handling), some possibly
> better encoding methods, and an in-place modifiable version.  Plus explicit
> support for deltas / chunks / baseline.
>
> Some very interesting choices (bitmap graph representation) and a
>       lot of related papers to digest.  I'm glad people have recognized the
> need and have spent good effort solving the problem.  I'll see what I can
> add as I get into it soon.
>
> Stephen
>
> On 2/21/14, 1:26 AM, Miel Vander Sande wrote:
> Hi Stephen,
>
> I think DERI has created exactly what you're looking for. It's called
> http://www.rdfhdt.org/ and we've recently started using it. It's not only
> compact, but it also allows incredibly fast lookup.
>
> Kind regards,
>
> Miel Vander Sande
> Researcher Semantic Web - Linked Open Data
> Multimedia Lab [Ghent University - iMinds]
>
> On Feb 18, 2014, at 7:20 PM, Stephen Williams <sdw@lig.net>
>           wrote:
>
> I worked on W3C Efficient XML Interchange (EXI) from before the formation
> of the working group almost all the way through standardization when my
> work situation changed.  A number of my ideas are in there, although a
> number that I felt strongly about are not (deltas, standardization of
> interchange of compiled schema baseline, byte alignment of byte data
> through a fast, efficient novel peephole algorithm that adds almost no
> padding).  At the end, I became much more interested in the RDF interchange
> problem, but have worked on other things since.
>
> At that time and somewhat since I developed an architecture and design for
> efficient RDF / N-tuples.  There are many tradeoffs, but we spent years
> working together on EXI examining very similar issues but for a
> significantly different problem space.  RDF and other graph data has a
> wider range of possible uses, characteristics of data, and possibilities
> for specific and general optimization.  I'm planning to finish that design
> and implementation soon for my own work that leverages the semantic web
> technologies, including RDF or RDF-like data.  I'm more focused on the user
> interface paradigm, app, ecosystem design than interchange, but that
>               is a big part of the problem.
>
> One of the main points of EXI, and of ERI, is compactness and fast usable
> representation, avoiding and/or minimizing parsing, without necessarily
> requiring decompression.  Decompression is optionally layered when it makes
> sense because of repetition or data names/values, but the structure can be
> compact without compression.  In the case of triples (and quads, etc.), it
> is very easy to fully separate structure at two levels from values which
> are naturally reused, dictionary "compressed", and then optionally
> compressed.
>
> I'm very interested in quads or n-tuples (probably N-Quads) where I don't
> have to represent provenance, document/database/group membership, and other
> metadata strictly as triples (although they can always be recast as
> triples).  I'm also interested in ways of chunking/delta graph data for
> efficiency of transport, memory, computation, etc.
>
> Has anyone been working on compact, efficient binary representation of
> RDF/N-Quads or similar?  Chunking / deltas?
> Does anyone want to work on these problems?  I'm deep into some projects,
> but I might be interested in some arrangement to push this forward,
> consulting or co-founding or something otherwise mutually beneficial.  As
> was my early binary / efficient XML work, this is all independent research
> for me.
>
> My main interest is in solving the user interface / visualization / mental
> model problem for A) a much better experience when working with all kinds
> of large/complex knowledge and B) interfacing to / representing / creating
> organized semantic / linked data.  I'm working on a Knowlege Browser and
> related paradigms to complement the web browser and search paradigms.  My
> goal is to improve knowledge organization and access for everyone, from
> neophytes to advanced knowledge-based workers.
>
> Thanks,
> Stephen
>
> On 2/18/14 12:53 AM, Ross Horne wrote:
> Hi Michel,
>
> I think you point is worth considering. Google make heavy use of
> zippy, rather than gzip, simply to reduce the latency of reading and
> sending large amounts of data, see [1]. (Of course, storage is not a
> limitation.)
>
> Could zippy have a role in Linked Data protocols?
>
> Regards,
>
> Ross
>
> [1] Dean, Jeff. "Designs, lessons and advice from building large
> distributed systems." Keynote from LADIS (2009).
>
> http://www.lamsade..dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf<http://www.lamsade.dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf>
>
>
>
> On 18 February 2014 10:42, Michel Dumontier <michel.dumontier@gmail.com>
> wrote:
>
> Hi Tim,
>  That folder contains 350GB of compressed RDF. I'm not about to unzip it
> because a crawler can't decompress it on the fly.  Honestly, it worries me
> that people aren't considering the practicalities of storing, indexing, and
> presenting all this data.
>  Nevertheless, Bio2RDF does provide void definitions, URI resolution, and
> access to SPARQL endpoints.  I can only hope our data gets discovered.
>
> m.
>
> Michel Dumontier
> Associate Professor of Medicine (Biomedical Informatics), Stanford
> University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
> Group
> http://dumontierlab.com
>
>
> On Sat, Feb 15, 2014 at 10:31 PM, Tim Berners-Lee <timbl@w3.org> wrote:
>
> On 2014-02 -14, at 09:46, Michel Dumontier wrote:
>
> Andreas,
>
> I'd like to help by getting bio2rdf data into the crawl, really. but we
> gzip all of our files, and they are in n-quads format.
>
> http://download.bio2rdf.org/release/3/
>
> think you can add gzip/bzip2 support ?
>
> m.
>
> Michel Dumontier
> Associate Professor of Medicine (Biomedical Informatics), Stanford
> University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
> Group
> http://dumontierlab.com
>
>
> An on 2014-02 -15, at 18:00, Hugh Glaser wrote:
>
> Hi Andreas and Tobias.
> Good luck!
> Actually, I think essentially ignoring dumps and doing a "real" crawl, is
> a feature, rather than a bug.
>
>
>
> Michel,
>
> Agree with High. I would encourage you unzip the data files on your own
> servers
> so the URIs will work and your data is really Linked Data.
> There are lots of advantages to the community to be compatible.
>
> Tim
>
>
>
>
> --
> Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn:
> http://sdw.st/in
> V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
> AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
> Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
>
>
>
> --
> Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn:
> http://sdw.st/in
> V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407
> AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres
> Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
>
>
> --
> -ericP
>
> office: +1.617.599.3509
> mobile: +33.6.80.80.35.59
>
> (eric@w3.org)
> Feel free to forward this message to any list for any purpose other than
> email address distribution.
>
> There are subtle nuances encoded in font variation and clever layout
> which can only be seen by printing this message on high-clay paper.
>
>
>
Received on Wednesday, 26 February 2014 02:27:13 UTC