- From: Stephen Williams <sdw@lig.net>
- Date: Tue, 18 Feb 2014 10:20:42 -0800
- To: ross.horne@gmail.com, Michel Dumontier <michel.dumontier@gmail.com>
- CC: Tim Berners-Lee <timbl@w3.org>, Andreas Harth <andreas@harth.org>, SWIG Web <semantic-web@w3.org>
- Message-ID: <5303A47A.6010701@lig.net>
I worked on W3C Efficient XML Interchange (EXI) from before the formation of the working group almost all the way through standardization when my work situation changed. A number of my ideas are in there, although a number that I felt strongly about are not (deltas, standardization of interchange of compiled schema baseline, byte alignment of byte data through a fast, efficient novel peephole algorithm that adds almost no padding). At the end, I became much more interested in the RDF interchange problem, but have worked on other things since. At that time and somewhat since I developed an architecture and design for efficient RDF / N-tuples. There are many tradeoffs, but we spent years working together on EXI examining very similar issues but for a significantly different problem space. RDF and other graph data has a wider range of possible uses, characteristics of data, and possibilities for specific and general optimization. I'm planning to finish that design and implementation soon for my own work that leverages the semantic web technologies, including RDF or RDF-like data. I'm more focused on the user interface paradigm, app, ecosystem design than interchange, but that is a big part of the problem. One of the main points of EXI, and of ERI, is compactness and fast usable representation, avoiding and/or minimizing parsing, without necessarily requiring decompression. Decompression is optionally layered when it makes sense because of repetition or data names/values, but the structure can be compact without compression. In the case of triples (and quads, etc.), it is very easy to fully separate structure at two levels from values which are naturally reused, dictionary "compressed", and then optionally compressed. I'm very interested in quads or n-tuples (probably N-Quads) where I don't have to represent provenance, document/database/group membership, and other metadata strictly as triples (although they can always be recast as triples). I'm also interested in ways of chunking/delta graph data for efficiency of transport, memory, computation, etc. Has anyone been working on compact, efficient binary representation of RDF/N-Quads or similar? Chunking / deltas? Does anyone want to work on these problems? I'm deep into some projects, but I might be interested in some arrangement to push this forward, consulting or co-founding or something otherwise mutually beneficial. As was my early binary / efficient XML work, this is all independent research for me. My main interest is in solving the user interface / visualization / mental model problem for A) a much better experience when working with all kinds of large/complex knowledge and B) interfacing to / representing / creating organized semantic / linked data. I'm working on a Knowlege Browser and related paradigms to complement the web browser and search paradigms. My goal is to improve knowledge organization and access for everyone, from neophytes to advanced knowledge-based workers. Thanks, Stephen On 2/18/14 12:53 AM, Ross Horne wrote: > Hi Michel, > > I think you point is worth considering. Google make heavy use of > zippy, rather than gzip, simply to reduce the latency of reading and > sending large amounts of data, see [1]. (Of course, storage is not a > limitation.) > > Could zippy have a role in Linked Data protocols? > > Regards, > > Ross > > [1] Dean, Jeff. "Designs, lessons and advice from building large > distributed systems." Keynote from LADIS (2009). > http://www.lamsade.dauphine.fr/~litwin/cours98/CoursBD/doc/dean-keynote-ladis2009_scalable_distributed_google_system.pdf > > > On 18 February 2014 10:42, Michel Dumontier <michel.dumontier@gmail.com> wrote: >> Hi Tim, >> That folder contains 350GB of compressed RDF. I'm not about to unzip it >> because a crawler can't decompress it on the fly. Honestly, it worries me >> that people aren't considering the practicalities of storing, indexing, and >> presenting all this data. >> Nevertheless, Bio2RDF does provide void definitions, URI resolution, and >> access to SPARQL endpoints. I can only hope our data gets discovered. >> >> m. >> >> Michel Dumontier >> Associate Professor of Medicine (Biomedical Informatics), Stanford >> University >> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group >> http://dumontierlab.com >> >> >> On Sat, Feb 15, 2014 at 10:31 PM, Tim Berners-Lee <timbl@w3.org> wrote: >>> On 2014-02 -14, at 09:46, Michel Dumontier wrote: >>> >>> Andreas, >>> >>> I'd like to help by getting bio2rdf data into the crawl, really. but we >>> gzip all of our files, and they are in n-quads format. >>> >>> http://download.bio2rdf.org/release/3/ >>> >>> think you can add gzip/bzip2 support ? >>> >>> m. >>> >>> Michel Dumontier >>> Associate Professor of Medicine (Biomedical Informatics), Stanford >>> University >>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest >>> Group >>> http://dumontierlab.com >>> >>> >>> An on 2014-02 -15, at 18:00, Hugh Glaser wrote: >>> >>> Hi Andreas and Tobias. >>> Good luck! >>> Actually, I think essentially ignoring dumps and doing a "real" crawl, is >>> a feature, rather than a bug. >>> >>> >>> >>> Michel, >>> >>> Agree with High. I would encourage you unzip the data files on your own >>> servers >>> so the URIs will work and your data is really Linked Data. >>> There are lots of advantages to the community to be compatible. >>> >>> Tim >>> >>> -- Stephen D. Williams sdw@lig.net stephendwilliams@gmail.com LinkedIn: http://sdw.st/in V:650-450-UNIX (8649) V:866.SDW.UNIX V:703.371.9362 F:703.995.0407 AIM:sdw Skype:StephenDWilliams Yahoo:sdwlignet Resume: http://sdw.st/gres Personal: http://sdw.st facebook.com/sdwlig twitter.com/scienteer
Received on Tuesday, 18 February 2014 18:21:26 UTC