- From: Michel Dumontier <michel.dumontier@gmail.com>
- Date: Wed, 26 Feb 2014 11:26:22 +0900
- To: Mario Arias <mario.arias@deri.org>
- Cc: "Eric Prud'hommeaux" <eric@w3.org>, Axel Polleres <axel@polleres.net>, "Stephen D. Williams" <sdw@lig.net>, Miel Vander Sande <miel.vandersande@ugent.be>, SWIG Web <semantic-web@w3.org>, Joachim Baran <joachim.baran@gmail.com>, "biohackathon@googlegroups.com" <biohackathon@googlegroups.com>
- Message-ID: <CALcEXf688UKMib=vp0GFwNFA0UmLBvv2jnQgb1VQLCzgBZZHgA@mail.gmail.com>
We played with HDT during the 2013 Biohackathon [1] in Japan. We developed and used an OWL ontology (FALDO [2]) for representing genomic positions and stored this in HDT, from which a ruby service [3] made content accessible to the JBrowse genomic viewer [4]. A brief presentation can be found at [5]. I think the general consensus was having a compressed, indexed, and queryable file was a significant advance for web/standalone application development. m. [1] http://2013.biohackathon.org/ [2] https://github.com/JervenBolleman/FALDO [3] https://github.com/joejimbo/GenomicHDT [4] http://jbrowse.org/ [5] https://github.com/joejimbo/GenomicHDT/raw/bde59a0a2211d042b756616e1a42b1e57ac26196/docs/BioHackathon2013.pdf Michel Dumontier Associate Professor of Medicine (Biomedical Informatics), Stanford University Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group http://dumontierlab.com On Sat, Feb 22, 2014 at 3:38 AM, Mario Arias <mario.arias@deri.org> wrote: > Hello all, > > Note: HDT developer here, my opinions might be biased :) > > We haven't used SPARQL XML, but we are able to generate an HDT out of the > results of a SPARQL construct query, since it is a graph. The size savings > are similar to Ntriples/HDT comparisons available at [1], with the plus > that the result is also searchable. We haven't worked on results of select > queries yet (they are tables, not graphs). > > HDT is designed to be modular. Indeed you can send only the dictionary to > other peer if you want to [2], then you can transfer statements by only > encoding their ids. If you have several heterogeneous sources, then you are > right, you need a more advanced global id model. > > I really agree with Michel Dumontier that RDF should be served compressed > in most of the cases, specially dumps. The difference in disk IO, bandwidth > and waiting time are huge [1], so it's a win/win for everyone. Most of the > browsers accept the HTTP header "Content-Encoding: gzip" and decompress the > reply on the fly, and crawlers should do it too (In java it is as easy as > wrapping it with new GZIPInputStream(stream), similar in many other > languages). > > We would love to participate in any standardization effort for more > efficient RDF. Our experience with HDT shows that the difference can be > huge compared to textual, triple-oriented alternatives for many common > scenarios. Consumers must not only be able to interpret the data, they need > to do it really fast if we want to create interesting applications. > > Best, > > Mario Arias. > @MarioAriasGa > PhD Researcher. > Insight Centre for Data Analytics. > National University of Ireland, Galway. > (Formerly DERI) > > [1] http://www..rdfhdt.org/technical-specification/#numbers<http://www.rdfhdt.org/technical-specification/#numbers> > [2] > https://code.google.com/p/hdt-java/source/browse/hdt-java-core/src/main/java/org/rdfhdt/hdt/dictionary/DictionaryPrivate.java?name=maven > > El 21/02/2014, a las 14:07, Eric Prud'hommeaux <eric@w3.org> escribió: > > * Axel Polleres <axel@polleres.net> [2014-02-21 13:58+0100] > > Fwiw, hdt. was also a w3c member submission, see > http://www.w3.org/Submission/2011/03/ > > > Did you guys try using HDT on SPARQL XML Results Format? > Any idea whether you could share the dictionary between processes > or machines? It could be a cool omtimization for tightly-coupled > systems, but I guess that means you need expensive mutexes on write > or some predictable sharding system. Thoughts? > > > Best, > Axel > > > > (sent from my mobile) > -- > Prof. Axel Polleres, WU > url: http://www.polleres.net/ twitter: @AxelPolleres > > On Feb 21, 2014, at 11:19, "Stephen D. Williams" <sdw@lig.net> wrote: > > Thanks! That is a very helpful pointer. I've been concentrating on other > areas too long... > > On an initial glance, I don't see any active standardization work, which > is good since it doesn't seem to have all the features I would want... > In particular, N-quads support (and I have particular interests in > optimizing fine-grained named graph metadata handling), some possibly > better encoding methods, and an in-place modifiable version. Plus explicit > support for deltas / chunks / baseline. > > Some very interesting choices (bitmap graph representation) and a > lot of related papers to digest. I'm glad people have recognized the > need and have spent good effort solving the problem. I'll see what I can > add as I get into it soon. > > Stephen > > On 2/21/14, 1:26 AM, Miel Vander Sande wrote: > Hi Stephen, > > I think DERI has created exactly what you're looking for. It's called > http://www.rdfhdt.org/ and we've recently started using it. It's not only > compact, but it also allows incredibly fast lookup. > > Kind regards, > > Miel Vander Sande > Researcher Semantic Web - Linked Open Data > Multimedia Lab [Ghent University - iMinds] > > On Feb 18, 2014, at 7:20 PM, Stephen Williams <sdw@lig.net> > wrote: > > I worked on W3C Efficient XML Interchange (EXI) from before the formation > of the working group almost all the way through standardization when my > work situation changed. A number of my ideas are in there, although a > number that I felt strongly about are not (deltas, standardization of > interchange of compiled schema baseline, byte alignment of byte data > through a fast, efficient novel peephole algorithm that adds almost no > padding). At the end, I became much more interested in the RDF interchange > problem, but have worked on other things since. > > At that time and somewhat since I developed an architecture and design for > efficient RDF / N-tuples. There are many tradeoffs, but we spent years > working together on EXI examining very similar issues but for a > significantly different problem space. RDF and other graph data has a > wider range of possible uses, characteristics of data, and possibilities > for specific and general optimization. I'm planning to finish that design > and implementation soon for my own work that leverages the semantic web > technologies, including RDF or RDF-like data. I'm more focused on the user > interface paradigm, app, ecosystem design than interchange, but that > is a big part of the problem. > > One of the main points of EXI, and of ERI, is compactness and fast usable > representation, avoiding and/or minimizing parsing, without necessarily > requiring decompression. Decompression is optionally layered when it makes > sense because of repetition or data names/values, but the structure can be > compact without compression. In the case of triples (and quads, etc.), it > is very easy to fully separate structure at two levels from values which > are naturally reused, dictionary "compressed", and then optionally > compressed. > > I'm very interested in quads or n-tuples (probably N-Quads) where I don't > have to represent provenance, document/database/group membership, and other > metadata strictly as triples (although they can always be recast as > triples). I'm also interested in ways of chunking/delta graph data for > efficiency of transport, memory, computation, etc. > > Has anyone been working on compact, efficient binary representation of > RDF/N-Quads or similar? Chunking / deltas? > Does anyone want to work on these problems? I'm deep into some projects, > but I might be interested in some arrangement to push this forward, > consulting or co-founding or something otherwise mutually beneficial. As > was my early binary / efficient XML work, this is all independent research > for me. > > My main interest is in solving the user interface / visualization / mental > model problem for A) a much better experience when working with all kinds > of large/complex knowledge and B) interfacing to / representing / creating > organized semantic / linked data. I'm working on a Knowlege Browser and > related paradigms to complement the web browser and search paradigms. My > goal is to improve knowledge organization and access for everyone, from > neophytes to advanced knowledge-based workers. > > Thanks, > Stephen > > On 2/18/14 12:53 AM, Ross Horne wrote: > Hi Michel, > > I think you point is worth considering. Google make heavy use of > zippy, rather than gzip, simply to reduce the latency of reading and > sending large amounts of data, see [1]. (Of course, storage is not a > limitation.) > > Could zippy have a role in Linked Data protocols? > > Regards, > > Ross > > [1] Dean, Jeff. "Designs, lessons and advice from building large > distributed systems." Keynote from LADIS (2009). > > http://www.lamsade..dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf<http://www.lamsade.dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf> > > > > On 18 February 2014 10:42, Michel Dumontier <michel.dumontier@gmail.com> > wrote: > > Hi Tim, > That folder contains 350GB of compressed RDF. I'm not about to unzip it > because a crawler can't decompress it on the fly. Honestly, it worries me > that people aren't considering the practicalities of storing, indexing, and > presenting all this data. > Nevertheless, Bio2RDF does provide void definitions, URI resolution, and > access to SPARQL endpoints. I can only hope our data gets discovered. > > m. > > Michel Dumontier > Associate Professor of Medicine (Biomedical Informatics), Stanford > University > Chair, W3C Semantic Web for Health Care and the Life Sciences Interest > Group > http://dumontierlab.com > > > On Sat, Feb 15, 2014 at 10:31 PM, Tim Berners-Lee <timbl@w3.org> wrote: > > On 2014-02 -14, at 09:46, Michel Dumontier wrote: > > Andreas, > > I'd like to help by getting bio2rdf data into the crawl, really. but we > gzip all of our files, and they are in n-quads format. > > http://download.bio2rdf.org/release/3/ > > think you can add gzip/bzip2 support ? > > m. > > Michel Dumontier > Associate Professor of Medicine (Biomedical Informatics), Stanford > University > Chair, W3C Semantic Web for Health Care and the Life Sciences Interest > Group > http://dumontierlab.com > > > An on 2014-02 -15, at 18:00, Hugh Glaser wrote: > > Hi Andreas and Tobias. > Good luck! > Actually, I think essentially ignoring dumps and doing a "real" crawl, is > a feature, rather than a bug. > > > > Michel, > > Agree with High. I would encourage you unzip the data files on your own > servers > so the URIs will work and your data is really Linked Data. > There are lots of advantages to the community to be compatible. > > Tim > > > > > -- > Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: > http://sdw.st/in > V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407 > AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres > Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer > > > > -- > Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: > http://sdw.st/in > V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407 > AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres > Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer > > > -- > -ericP > > office: +1.617.599.3509 > mobile: +33.6.80.80.35.59 > > (eric@w3.org) > Feel free to forward this message to any list for any purpose other than > email address distribution. > > There are subtle nuances encoded in font variation and clever layout > which can only be seen by printing this message on high-clay paper. > > >
Received on Wednesday, 26 February 2014 02:27:13 UTC