Re: On diversity [was: Chrome Extension of Semantic Web] from Austin William Wright on 2013-10-11 (public-rdfjs@w3.org from October 2013)

From: Austin William Wright <aaa@bzfx.net>
Date: Fri, 11 Oct 2013 02:37:28 -0700
To: Ruben Verborgh <ruben.verborgh@ugent.be>
Cc: public-rdfjs@w3.org, Adrian Gschwend <ktk@netlabs.org>
Message-ID: <CANkuk-VGP7MN5=JOVH+aaTbXD-j7rEZmdHP9GUebz-ANO5jNjQ@mail.gmail.com>
I haven't been impressed with the Node.js ecosystem. Not that there's too
much choice, but that there's no guarantees of anything. Everyone releases
0.x versions of software with no indication of breakage, or entire packages
simply go out of support, or you end up using two different packages that
do the exact same thing, or two different versions of the same package, or
two different instances of the same version of the same package (because
npm insists on checking out multiple instances of the same package). It's a
nightmare. (I'd like to move to an RDF-based package manager, generic
enough for an entire OS, but that's another conversation to have.)

We should definitely take the time to index functionality of the various
libraries that exist, after first indexing the common use-cases that these
libraries will be used towards.

Perhaps after this, there are two things I would suggest:

1. Publishing specifications, so we can define the functionality we want in
terms of what would be best for developing. (Libraries are supposed to
benefit the developers using them, not necessarily the developer writing
the library.)

2. Changing and specializing the functionality of the libraries accordingly.

Specifically, I'm hoping we can adopt and work on the RDF Interfaces TR,
and then we can specialize our libraries around different parts of RDF
Interfaces. It already has the parsers and serializers that you talk about.

My library is already very efficient in dealing with in-memory
representations of triples and graphs. Later, I intend to publish a library
that lets you query these stores using SPIN resources (themselves stored in
a graph) and a map of variable bindings.

What my library is not so good with is Turtle. It passes much of the
official test suite, however it doesn't choke where it's supposed to (e.g.
spaces in IRIs) for the sake of speed. I'd like to use your Turtle library,
however it doesn't support the RDF Interfaces API, or it's not trivial to
support it. (My implementation: <https://github.com/Acubed/node-rdf>)

And specifically about specialization, I would adopt your Turtle library
(or any compatible Turtle library) for parsing Turtle, and you might adopt
my library for Graph, Node, and a query library. And both libraries would
become both smaller and more functional as a result.

You write about being asynchronous, but I'm not sure what you're referring
to exactly. There is no need to have an asynchronous API; as there is
nothing waiting on I/O operations. Likewise, "backpressure" isn't a well
defined term... I believe "congestion" is what you mean? But that's not
something your library would be concerned with, only the application author
would be (let's say they're reading from the filesystem and writing to a
database, in this case the application author would implement congestion
control because they're the one doing I/O).

What I have found lacking in RDF Interfaces is a stream API. There's no way
to pass multiple write calls - each write call is expected to be a
well-formed Turtle document (so far as I can tell). Ideally, I'd be able to
pass segments of a document as they're emitted from the OS, and write it to
the Turtle parser. Before the write function call returns, the Turtle
parser would in-line emit the parsed triples back to me or store them in a
graph variable, and write to the environment object (so I can read things
like @base and @prefix). When the file is fully read, I would then call
end() to verify that the document was well formed.

The WebApps WG is currently discussing a streams API for e.g.
XMLHttpRequest. We should look into this - the Node.js Streams API isn't
very appropriate for the Web.

I don't think performance matters even the least in our initial comparison.
Performance can always improve; the feature set, functionality,
compatibility, and security is much more important. (Perhaps there's some
cases where a better API design can allow for better performance, I would
consider that functionality, not performance.)

Austin Wright.


On Thu, Oct 10, 2013 at 3:04 AM, Ruben Verborgh <ruben.verborgh@ugent.be>wrote:

> Hi all,
>
> > As mentioned before my goal is to see if we can have some common things
> > in the various JS libraries, especially parsers and serializers. For
> > example it doesn't make much sense if we have 4 different SPARQL parsers
> > for the various stores out there. There is one in rdfstore-js by
> > Antonio, one in the SPIN library by Austin, triplestoreJS might be happy
> > about it too and Matteo could use one for LevelGraph.
>
> TL;DR: Different non-functional requirements lead to different libraries,
> perhaps we should catalog them based on their strengths and weaknesses.
>
> Many modules that solve the same thing is unfortunately a common phenomenon
> in the current node.js ecosystem (see e.g., [1] and a great blog post
> which I forgot).
> The high number of existing JavaScript programmers when Node.js was
> launched
> has lead to an explosion of npm packages. There’s little reuse:
> everybody wanted their own package to solve things in a slightly different
> way.
> (Heck, I even did the same with my Turtle parser.)
>
> That said, I think it does make sense to have *some* different parsers
> (perhaps not 4).
> In addition to function to functional requirements, non-functional
> requirements influence a library as well.
> For instance, let’s talk about Turtle parsers.
> For my own node-n3 library, I had two main goals: asynchronicity and high
> performance (in that order).
> What this means to me is that, if I had to make design decisions, those
> two determined my choice.
>
> One could say: yes, but everybody wants the most high-performance library.
> Well, not always. For example, node-n3 is not running at maximum possible
> speed:
> if I'd drop the asynchronous requirement, then I can eliminate a lot of
> callbacks and speed up everything.
> This means that more performance is possible… for files up to around 2GB.
> All else would fail.
> However, that’s a crucial design decision you can’t turn on or off with a
> flag.
> I wanted to go beyond 2GB, so I chose asynchronicity over performance.
>
> On the other hand, I needed to make compromises to achieve this
> performance.
> For instance, as runtime classes heavily speed up JavaScript code [2],
> I decided that all triples would have the same rigid structure:
> { subject: <String>, predicate: <String>, object: <String>, context:
> <String> }
> This means I can’t rely on classes or hidden properties to determine
> whether the object is a literal or a URI,
> so the differentiation needs to happen inside the object string itself.
> Other parsers might output JSON-LD, which is going to be much slower,
> but way more handy to deal with in the application. If that’s your goal,
> then such a parser is better.
> However, that’s again a crucial design decision.
>
> A final example: implementing an RDF parser as a Node steam would have
> benefits as well.
> It’s handy because you can avoid backpressure, but it comes at a terrible
> performance cost [3].
>
>
> So while it is a good idea to look for common grounds,
> I suggest starting by making an overview of the different libraries.
> This helps people decide which library they need for a specific purpose.
> For instance:
> - What does the library support (and with specs does it pass)?
> - What are its strengths and weaknesses?
> - What are typical use cases?
> To avoid too much marketing language, I propose standardized tests.
> Performance is fairly easy, but not the only metric, as explained above.
> Circumstances matter: parsing small and larges files from disk, memory, or
> network?
> Parsing straight to a triple-oriented model or directly to JSON-LD?
> Those things have to be decided to have objective measurements.
>
> No size fits all—but we don’t need more diversity than we can handle.
> And the diversity we have should be documented.
>
> Best,
>
> Ruben
>
> [1] https://medium.com/on-coding/6b6402216740
> [2] https://developers.google.com/v8/design#prop_access
> [3]
> https://github.com/RubenVerborgh/node-n3/issues/6#issuecomment-24010652
>
Received on Friday, 11 October 2013 09:37:57 UTC