Re: representing URIs and literals from Sandro Hawke on 2013-11-03 (public-rdfjs@w3.org from November 2013)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 03 Nov 2013 10:59:23 -0500
To: Ruben Verborgh <ruben.verborgh@ugent.be>, Austin William Wright <aaa@bzfx.net>
CC: "public-rdfjs@w3.org" <public-rdfjs@w3.org>
Message-ID: <527672DB.4020003@w3.org>
On 11/03/2013 06:08 AM, Ruben Verborgh wrote:
> Hi Austin,
>
> Thanks for adding these considerations.
> I set up a benchmark to compare the performance and memory usage between representations [1].
> It appears that object/string-representations are considerably faster than polymorphism.
>
>> What you're describing, and what node-n3 appears to do, Java could do as well: Just look at the string and switch behavior depending on the detected input. This is quintessential object-oriented programming, it's polymorphism.
> No, polymorphism is "providing a single interface to entities of different types” [2].
> In the object-oriented paradigm, you have:
>      x.isLiteral()
> and this gives different results depending on the type of x.
> However, what I propose is switching behavior on objects on the same type:
>      /^”/.test(x) ? ‘x is a literal’ : ‘x is a URI’
> where x is always a string.
>
>> I'm not sure how this is faster. It should be slower, you have to actually perform string operations on the string to determine its type. This is far slower than builtin type/class polymorphism!
> That’s not correct: it is far faster, as the benchmark indicates.
> Checking whether a string-based representation is a literal
> is twice as fast as with built-in type/class polymorphism.
> - Check prototype-based triples for literals: 17.149s
> - Check object/string-based triples for literals: 8.559s
>
>> ECMAScript is distinctly *not* object-oriented, if the definition of OO includes polymorphic (most references I see define it this way, I'll adopt this usage here since it helps us distinguish things better). What ECMAScript has instead is prototypes, including for built-in primitives! What we tend to call 'classes' in ECMAScript aren't really classes, that's just for lack of a better term
> That’s true at design time, but the current generation of JavaScript engines
> uses hidden classes that greatly affect performance [3].
>
>> What this means is instead of encoding literals as an ECMAScript value such as any of [ 'uri' , '"string"' , '"string"^<datatype>' ], we can encode some literals using native datatypes:
>>
>> (50).type === 'http://www.w3.org/2001/XMLSchema#integer'
>> (50).toString() === '"50"^^<http://www.w3.org/2001/XMLSchema#integer>’
> No, that’s not possible:
>      var a = 50;
>      a.type = 'http://www.w3.org/2001/XMLSchema#integer';
>      console.log(a.type); //  undefined
>
>> This is far more powerful than merely encoding all literals as a string.
> If it were possible, yes. Ruby offers this; I think it’s really cool.
>
>> Remember, in ECMAScript, there's no polymorphism, just objects with a prototype chain
> Prototypes do provide polymorphism: regarding of what type an object is (= what prototype it has), the same interface can have different effects.
>
>> so there shouldn't be any difference between passing a string and passing an object per se, except that a string removes your ability to use a prototype chain. And when including other effects (not per se), strings appear to fare worse: As I pointed out, you have to parse them, but objects are pre-parsed in memory.
> Yes, but as the benchmarks show, string evaluation is faster than a prototype check.
> Have a look at the times for finding triples with specific subject and objects:
> - Find prototype-based triples with a given subject 18.848s
> - Find object/string-based triples with a given subject 9.505s
> - Find prototype-based triples with a given object 20.467s
> - Find object/string-based triples with a given object 10.571s
>
> Not to mention the creation time and memory usage of the triple structure in memory:
> - Generate prototype-based triples 12.692s (1241MB)
> - Generate object/string-based triples 4.867s (803MB)
>
> So the pre-parsing takes more time and actually harms performance instead of improving it.
>
>> Though intuitively tempting, it doesn't make sense to encode literals as just strings. Most literals will have types, and in RDF 1.1, all literals have types, untyped strings become the same as `xsd:string`.
> I do add types and languages to double-quoted literals:
>      N3Util.isLiteral('"Mickey Mouse"'); // true
>      N3Util.isLiteral('"Mickey Mouse"@en'); // true
>      N3Util.isLiteral('"3"^^<http://www.w3.org/2001/XMLSchema#integer>'); // true
>      N3Util.isLiteral('"http://example.org/"'); // true

Ruben, I applaud your efforts in this.   I've been thinking about this 
issue for a while, but haven't had a chance to run benchmark experiments.

I think it's also important to pay attention to memory footprint. 
Especially on mobile, my understanding is that exhausting RAM is key 
performance problem in client software today, at least given today's GC 
behavior.   So, running in a mobile browser, it's important to keep the 
in-memory representation of the data small.

My (untested) intuition agrees with your plan to just use strings for 
all RDF Terms.  I've been thinking of just using the first character of 
the string as a special signifier, eg "_" for blank node, "<" for iri, s 
for xsd:string, i for xsd:int, f for xsd:float, etc.     It seems to me 
desirable to avoid needing to strip the closing quote of the string, or 
search in it for the datatype URI, as one would have to do with your 
design above.

With this first-char design, the lexical part for a data value x is 
always just x.slice(1).   The datatype and language tag, and whether 
it's an iri or blank node will always be determinable from x[0].

For unknown datatypes and for language tags, I was considering just 
building a dynamic table of special first-characters.   Since it's a 
unicode string, there are plenty of possibilities for that first 
character.  Maybe just start at char 128 for the dynamically assigned 
ones, or maybe there's a better unicode range for that (something with 
some nice range of glyphs).

       -- Sandro


>
> The performance surprise has everything to do with the difference between typed languages
> and the dynamic compilation of JavaScript, which is counterintuitive at some times.
>
> Best,
>
> Ruben
>
> [1] https://github.com/RubenVerborgh/TripleRepresentationBenchmark
> [2] http://www.stroustrup.com/glossary.html#Gpolymorphism
> [3] https://developers.google.com/v8/design#prop_access
>
Received on Sunday, 3 November 2013 15:59:31 UTC