Re: representing URIs and literals from Ruben Verborgh on 2013-11-03 (public-rdfjs@w3.org from November 2013)

From: Ruben Verborgh <ruben.verborgh@ugent.be>
Date: Sun, 3 Nov 2013 11:08:40 +0000
To: Austin William Wright <aaa@bzfx.net>
Cc: "public-rdfjs@w3.org" <public-rdfjs@w3.org>
Message-Id: <1F18F043-8732-411A-9F9E-B4CBA96F52A4@ugent.be>
Hi Austin,

Thanks for adding these considerations.
I set up a benchmark to compare the performance and memory usage between representations [1].
It appears that object/string-representations are considerably faster than polymorphism.

> What you're describing, and what node-n3 appears to do, Java could do as well: Just look at the string and switch behavior depending on the detected input. This is quintessential object-oriented programming, it's polymorphism.

No, polymorphism is "providing a single interface to entities of different types” [2].
In the object-oriented paradigm, you have:
    x.isLiteral()
and this gives different results depending on the type of x.
However, what I propose is switching behavior on objects on the same type:
    /^”/.test(x) ? ‘x is a literal’ : ‘x is a URI’
where x is always a string.

> I'm not sure how this is faster. It should be slower, you have to actually perform string operations on the string to determine its type. This is far slower than builtin type/class polymorphism!

That’s not correct: it is far faster, as the benchmark indicates.
Checking whether a string-based representation is a literal
is twice as fast as with built-in type/class polymorphism.
- Check prototype-based triples for literals: 17.149s
- Check object/string-based triples for literals: 8.559s

> ECMAScript is distinctly *not* object-oriented, if the definition of OO includes polymorphic (most references I see define it this way, I'll adopt this usage here since it helps us distinguish things better). What ECMAScript has instead is prototypes, including for built-in primitives! What we tend to call 'classes' in ECMAScript aren't really classes, that's just for lack of a better term

That’s true at design time, but the current generation of JavaScript engines
uses hidden classes that greatly affect performance [3].

> What this means is instead of encoding literals as an ECMAScript value such as any of [ 'uri' , '"string"' , '"string"^<datatype>' ], we can encode some literals using native datatypes:
> 
> (50).type === 'http://www.w3.org/2001/XMLSchema#integer'
> (50).toString() === '"50"^^<http://www.w3.org/2001/XMLSchema#integer>’

No, that’s not possible:
    var a = 50;
    a.type = 'http://www.w3.org/2001/XMLSchema#integer';
    console.log(a.type); //  undefined

> This is far more powerful than merely encoding all literals as a string.

If it were possible, yes. Ruby offers this; I think it’s really cool.

> Remember, in ECMAScript, there's no polymorphism, just objects with a prototype chain

Prototypes do provide polymorphism: regarding of what type an object is (= what prototype it has), the same interface can have different effects.

> so there shouldn't be any difference between passing a string and passing an object per se, except that a string removes your ability to use a prototype chain. And when including other effects (not per se), strings appear to fare worse: As I pointed out, you have to parse them, but objects are pre-parsed in memory.

Yes, but as the benchmarks show, string evaluation is faster than a prototype check.
Have a look at the times for finding triples with specific subject and objects:
- Find prototype-based triples with a given subject 18.848s
- Find object/string-based triples with a given subject 9.505s
- Find prototype-based triples with a given object 20.467s
- Find object/string-based triples with a given object 10.571s

Not to mention the creation time and memory usage of the triple structure in memory:
- Generate prototype-based triples 12.692s (1241MB)
- Generate object/string-based triples 4.867s (803MB)

So the pre-parsing takes more time and actually harms performance instead of improving it.

> Though intuitively tempting, it doesn't make sense to encode literals as just strings. Most literals will have types, and in RDF 1.1, all literals have types, untyped strings become the same as `xsd:string`.

I do add types and languages to double-quoted literals:
    N3Util.isLiteral('"Mickey Mouse"'); // true
    N3Util.isLiteral('"Mickey Mouse"@en'); // true
    N3Util.isLiteral('"3"^^<http://www.w3.org/2001/XMLSchema#integer>'); // true
    N3Util.isLiteral('"http://example.org/"'); // true


The performance surprise has everything to do with the difference between typed languages
and the dynamic compilation of JavaScript, which is counterintuitive at some times.

Best,

Ruben

[1] https://github.com/RubenVerborgh/TripleRepresentationBenchmark
[2] http://www.stroustrup.com/glossary.html#Gpolymorphism
[3] https://developers.google.com/v8/design#prop_access
Received on Sunday, 3 November 2013 11:09:18 UTC