- From: Austin William Wright <aaa@bzfx.net>
- Date: Fri, 11 Oct 2013 02:37:28 -0700
- To: Ruben Verborgh <ruben.verborgh@ugent.be>
- Cc: public-rdfjs@w3.org, Adrian Gschwend <ktk@netlabs.org>
- Message-ID: <CANkuk-VGP7MN5=JOVH+aaTbXD-j7rEZmdHP9GUebz-ANO5jNjQ@mail.gmail.com>
I haven't been impressed with the Node.js ecosystem. Not that there's too much choice, but that there's no guarantees of anything. Everyone releases 0.x versions of software with no indication of breakage, or entire packages simply go out of support, or you end up using two different packages that do the exact same thing, or two different versions of the same package, or two different instances of the same version of the same package (because npm insists on checking out multiple instances of the same package). It's a nightmare. (I'd like to move to an RDF-based package manager, generic enough for an entire OS, but that's another conversation to have.) We should definitely take the time to index functionality of the various libraries that exist, after first indexing the common use-cases that these libraries will be used towards. Perhaps after this, there are two things I would suggest: 1. Publishing specifications, so we can define the functionality we want in terms of what would be best for developing. (Libraries are supposed to benefit the developers using them, not necessarily the developer writing the library.) 2. Changing and specializing the functionality of the libraries accordingly. Specifically, I'm hoping we can adopt and work on the RDF Interfaces TR, and then we can specialize our libraries around different parts of RDF Interfaces. It already has the parsers and serializers that you talk about. My library is already very efficient in dealing with in-memory representations of triples and graphs. Later, I intend to publish a library that lets you query these stores using SPIN resources (themselves stored in a graph) and a map of variable bindings. What my library is not so good with is Turtle. It passes much of the official test suite, however it doesn't choke where it's supposed to (e.g. spaces in IRIs) for the sake of speed. I'd like to use your Turtle library, however it doesn't support the RDF Interfaces API, or it's not trivial to support it. (My implementation: <https://github.com/Acubed/node-rdf>) And specifically about specialization, I would adopt your Turtle library (or any compatible Turtle library) for parsing Turtle, and you might adopt my library for Graph, Node, and a query library. And both libraries would become both smaller and more functional as a result. You write about being asynchronous, but I'm not sure what you're referring to exactly. There is no need to have an asynchronous API; as there is nothing waiting on I/O operations. Likewise, "backpressure" isn't a well defined term... I believe "congestion" is what you mean? But that's not something your library would be concerned with, only the application author would be (let's say they're reading from the filesystem and writing to a database, in this case the application author would implement congestion control because they're the one doing I/O). What I have found lacking in RDF Interfaces is a stream API. There's no way to pass multiple write calls - each write call is expected to be a well-formed Turtle document (so far as I can tell). Ideally, I'd be able to pass segments of a document as they're emitted from the OS, and write it to the Turtle parser. Before the write function call returns, the Turtle parser would in-line emit the parsed triples back to me or store them in a graph variable, and write to the environment object (so I can read things like @base and @prefix). When the file is fully read, I would then call end() to verify that the document was well formed. The WebApps WG is currently discussing a streams API for e.g. XMLHttpRequest. We should look into this - the Node.js Streams API isn't very appropriate for the Web. I don't think performance matters even the least in our initial comparison. Performance can always improve; the feature set, functionality, compatibility, and security is much more important. (Perhaps there's some cases where a better API design can allow for better performance, I would consider that functionality, not performance.) Austin Wright. On Thu, Oct 10, 2013 at 3:04 AM, Ruben Verborgh <ruben.verborgh@ugent.be>wrote: > Hi all, > > > As mentioned before my goal is to see if we can have some common things > > in the various JS libraries, especially parsers and serializers. For > > example it doesn't make much sense if we have 4 different SPARQL parsers > > for the various stores out there. There is one in rdfstore-js by > > Antonio, one in the SPIN library by Austin, triplestoreJS might be happy > > about it too and Matteo could use one for LevelGraph. > > TL;DR: Different non-functional requirements lead to different libraries, > perhaps we should catalog them based on their strengths and weaknesses. > > Many modules that solve the same thing is unfortunately a common phenomenon > in the current node.js ecosystem (see e.g., [1] and a great blog post > which I forgot). > The high number of existing JavaScript programmers when Node.js was > launched > has lead to an explosion of npm packages. There’s little reuse: > everybody wanted their own package to solve things in a slightly different > way. > (Heck, I even did the same with my Turtle parser.) > > That said, I think it does make sense to have *some* different parsers > (perhaps not 4). > In addition to function to functional requirements, non-functional > requirements influence a library as well. > For instance, let’s talk about Turtle parsers. > For my own node-n3 library, I had two main goals: asynchronicity and high > performance (in that order). > What this means to me is that, if I had to make design decisions, those > two determined my choice. > > One could say: yes, but everybody wants the most high-performance library. > Well, not always. For example, node-n3 is not running at maximum possible > speed: > if I'd drop the asynchronous requirement, then I can eliminate a lot of > callbacks and speed up everything. > This means that more performance is possible… for files up to around 2GB. > All else would fail. > However, that’s a crucial design decision you can’t turn on or off with a > flag. > I wanted to go beyond 2GB, so I chose asynchronicity over performance. > > On the other hand, I needed to make compromises to achieve this > performance. > For instance, as runtime classes heavily speed up JavaScript code [2], > I decided that all triples would have the same rigid structure: > { subject: <String>, predicate: <String>, object: <String>, context: > <String> } > This means I can’t rely on classes or hidden properties to determine > whether the object is a literal or a URI, > so the differentiation needs to happen inside the object string itself. > Other parsers might output JSON-LD, which is going to be much slower, > but way more handy to deal with in the application. If that’s your goal, > then such a parser is better. > However, that’s again a crucial design decision. > > A final example: implementing an RDF parser as a Node steam would have > benefits as well. > It’s handy because you can avoid backpressure, but it comes at a terrible > performance cost [3]. > > > So while it is a good idea to look for common grounds, > I suggest starting by making an overview of the different libraries. > This helps people decide which library they need for a specific purpose. > For instance: > - What does the library support (and with specs does it pass)? > - What are its strengths and weaknesses? > - What are typical use cases? > To avoid too much marketing language, I propose standardized tests. > Performance is fairly easy, but not the only metric, as explained above. > Circumstances matter: parsing small and larges files from disk, memory, or > network? > Parsing straight to a triple-oriented model or directly to JSON-LD? > Those things have to be decided to have objective measurements. > > No size fits all—but we don’t need more diversity than we can handle. > And the diversity we have should be documented. > > Best, > > Ruben > > [1] https://medium.com/on-coding/6b6402216740 > [2] https://developers.google.com/v8/design#prop_access > [3] > https://github.com/RubenVerborgh/node-n3/issues/6#issuecomment-24010652 >
Received on Friday, 11 October 2013 09:37:57 UTC