Re: representing URIs and literals from Ruben Verborgh on 2013-11-03 (public-rdfjs@w3.org from November 2013)

From: Ruben Verborgh <ruben.verborgh@ugent.be>
Date: Sun, 3 Nov 2013 17:24:33 +0000
To: Sandro Hawke <sandro@w3.org>
Cc: Austin William Wright <aaa@bzfx.net>, "public-rdfjs@w3.org" <public-rdfjs@w3.org>
Message-Id: <EFA16A38-A2A2-4924-A522-B5C52C433CDC@ugent.be>

Hi Sandro,

> Ruben, I applaud your efforts in this.   I've been thinking about this issue for a while, but haven't had a chance to run benchmark experiments.

I had run ad-hoc benchmarks when I was designing node-n3;
i.e., I would try different triple representations to check out differences in speed.
The biggest gain when processing triples seems to come
from having only one hidden class { subject: string, predicate: string, object: string }
instead of the multiple variations that arise when the object is a JavaScript Object.

> I think it's also important to pay attention to memory footprint.

That could be an argument for using prefixed names, which results in shorter strings and less memory.

Another option is to store each unique IRI only once in a lookup hash,
and materialize only those triples that are currently needed.
This approach is taken in node-n3’s in-memory store [1].
There is one lookup hash from IRI (or literal) to integer,
and three two-level indexes (?s ?p / ?p ?o / ?o ?s) that use the integers as keys.

> I've been thinking of just using the first character of the string as a special signifier, eg "_" for blank node, "<" for iri, s for xsd:string, i for xsd:int, f for xsd:float, etc.

That’s interesting. However, since IRIs are by far the most common, I opted to not give them any special character.
That way, IRIs do not need any processing; their value is the string value and vice versa.
You then only have the parsing overhead for literals (and blank nodes; I use the “_:” convention).

> It seems to me desirable to avoid needing to strip the closing quote of the string, or search in it for the datatype URI, as one would have to do with your design above.

Aha, never considered that. I had the extra closing quote for symmetry reasons, as it highlights the literal more nicely IMHO.
There is no difference in performance however [2], so I think the symmetry argument could work here:
    '"this is a literal'
    '"this is a literal"’
Might also be a little clearer if a datatype or language is included (not benchmarked but will likely the same with or without quote):
    '"this is a literal@en'
    '"this is a literal"@en’

> The datatype and language tag, and whether it's an iri or blank node will always be determinable from x[0].

My implementation so far lets you distinguish only between IRI, blank node, and literal from x[0].
I don’t know if it’s necessary to know the language tag this fast; could be handy for the datatype.

> For unknown datatypes and for language tags, I was considering just building a dynamic table of special first-characters.

The downside of this is that the representations lose their universal meaning.
On the one hand, this is not a problem, as this internal format is not meant to be exchanged (Turtle is).
On the other hand, it makes programming more difficult.

With the current node-n3 implementation, you know that
     x === '"5”^^<http://www.w3.org/2001/XMLSchema#integer>’
     x === '"something"^^<http://example.org/#mytype>'
will always be possible, regardless of which type you’re interested in.
With dynamic tables, the precise character assigned to mytype could be different depending on execution order.

> Since it's a unicode string, there are plenty of possibilities for that first character.

True, but that would also be difficult for programming
if you have to write those characters in comparisons such as the above.


It would be great to think about this some more and perhaps come up with some best practices!

Best,

Ruben

Received on Sunday, 3 November 2013 17:25:13 UTC