Re: representing URIs and literals from Austin William Wright on 2013-11-04 (public-rdfjs@w3.org from November 2013)

From: Austin William Wright <aaa@bzfx.net>
Date: Sun, 3 Nov 2013 20:38:34 -0700
To: Ruben Verborgh <ruben.verborgh@ugent.be>
Cc: "public-rdfjs@w3.org" <public-rdfjs@w3.org>
Message-ID: <CANkuk-Vx8xJbOp3L9mXcaTAFegZOvF8LfXrQ6rRyoM7Az_yrrA@mail.gmail.com>
On Sun, Nov 3, 2013 at 4:08 AM, Ruben Verborgh <ruben.verborgh@ugent.be>wrote:

> Hi Austin,
>
> Thanks for adding these considerations.
> I set up a benchmark to compare the performance and memory usage between
> representations [1].
> It appears that object/string-representations are considerably faster than
> polymorphism.


> > What you're describing, and what node-n3 appears to do, Java could do as
> well: Just look at the string and switch behavior depending on the detected
> input. This is quintessential object-oriented programming, it's
> polymorphism.
>
> No, polymorphism is "providing a single interface to entities of different
> types” [2].
> In the object-oriented paradigm, you have:
>     x.isLiteral()
> and this gives different results depending on the type of x.
> However, what I propose is switching behavior on objects on the same type:
>     /^”/.test(x) ? ‘x is a literal’ : ‘x is a URI’
> where x is always a string.
>

I made a distinction between the standard, "built-in" language construct,
and the pesudo-polypormhism that is being done here, where we are
switch-ing logic in code, based on how the string is formatted. I don't
believe there's a substantial underlying difference.


>
> > I'm not sure how this is faster. It should be slower, you have to
> actually perform string operations on the string to determine its type.
> This is far slower than builtin type/class polymorphism!
>
> That’s not correct: it is far faster, as the benchmark indicates.
> Checking whether a string-based representation is a literal
> is twice as fast as with built-in type/class polymorphism.
> - Check prototype-based triples for literals: 17.149s
> - Check object/string-based triples for literals: 8.559s
>
> > ECMAScript is distinctly *not* object-oriented, if the definition of OO
> includes polymorphic (most references I see define it this way, I'll adopt
> this usage here since it helps us distinguish things better). What
> ECMAScript has instead is prototypes, including for built-in primitives!
> What we tend to call 'classes' in ECMAScript aren't really classes, that's
> just for lack of a better term
>
> That’s true at design time, but the current generation of JavaScript
> engines
> uses hidden classes that greatly affect performance [3].


That's seems to be an argument in favor of using instances, instead of
primitive types or plain Object maps which don't get the benefit from this
V8 compile-time logic.


>


> > What this means is instead of encoding literals as an ECMAScript value
> such as any of [ 'uri' , '"string"' , '"string"^<datatype>' ], we can
> encode some literals using native datatypes:
> >
> > (50).type === 'http://www.w3.org/2001/XMLSchema#integer'
> > (50).toString() === '"50"^^<http://www.w3.org/2001/XMLSchema#integer>’
>
> No, that’s not possible:
>     var a = 50;
>     a.type = 'http://www.w3.org/2001/XMLSchema#integer';
>     console.log(a.type); //  undefined
>

Yet this is what I do with 'rdf'. Try it yourself:

> require('rdf').setBuiltins();
> var a = 4;
> a.datatype
'http://www.w3.org/2001/XMLSchema#integer'
> a.nodeType()
'TypedLiteral'
> (4.5).datatype
'http://www.w3.org/2001/XMLSchema#decimal'
> "_:x".nodeType()
'BlankNode'

Magic!


> > This is far more powerful than merely encoding all literals as a string.
>
> If it were possible, yes. Ruby offers this; I think it’s really cool.
>

More magic, brought to you by the magic `valueOf` method:

Number.prototype.tl = function(t) {
  Object.defineProperty(this,'datatype', { writable: false, configurable :
false, enumerable: false, value: t } );
  return this;
}
> var x = (5).dt('http://www.w3.org/2001/XMLSchema#int');
> x+2
7
> x.datatype
'http://www.w3.org/2001/XMLSchema#int'

... kind of.

> typeof x
'object'
> ++x
6
> typeof x
'number'



>
> > Remember, in ECMAScript, there's no polymorphism, just objects with a
> prototype chain
>
> Prototypes do provide polymorphism: regarding of what type an object is (=
> what prototype it has), the same interface can have different effects.
>
> > so there shouldn't be any difference between passing a string and
> passing an object per se, except that a string removes your ability to use
> a prototype chain. And when including other effects (not per se), strings
> appear to fare worse: As I pointed out, you have to parse them, but objects
> are pre-parsed in memory.
>
> Yes, but as the benchmarks show, string evaluation is faster than a
> prototype check.
> Have a look at the times for finding triples with specific subject and
> objects:
> - Find prototype-based triples with a given subject 18.848s
> - Find object/string-based triples with a given subject 9.505s
> - Find prototype-based triples with a given object 20.467s
> - Find object/string-based triples with a given object 10.571s
>
> Not to mention the creation time and memory usage of the triple structure
> in memory:
> - Generate prototype-based triples 12.692s (1241MB)
> - Generate object/string-based triples 4.867s (803MB)
>
> So the pre-parsing takes more time and actually harms performance instead
> of improving it.


The times you list are cumulative, and include not just the read-time
polymorphism logic (I believe slower than the built-in prototype
polymorphism), but also the creation logic for Objects... It very well may
be faster to mint a String than it is to create an instance of an Object
with more than one property.

Pre-parsing depends on your operation. Your tests are greatly biased
towards creating objects rather than searching for them - by over five
orders of magnitude! This is completely unrealistic on most production
systems. Even "write-heavy" systems will usually be reading each entry at
least once.

When indexing my graphs, my search performs three orders of magnitude
faster than yours. It does, however, use much more memory due indexes --
about four times as much, because you add three indexes. If I wrote my own
search algorithm, this number could go down significantly (a reduction of
redundant information), with faster inserts (the vast majority of insert
time is creating blank objects, this would go away with a custom index,
instead of relying on Object).

$ node --expose-gc benchmark.js 3
0. Empty test environment: 0.00006099s 31MB
1. Generate prototype-based triples: 52.89283712s 386MB
3. Find prototype-based triples with a given subject: 0.039233295s 387MB
$ node --expose-gc benchmark.js 4
0. Empty test environment: 0.000058762s 31MB
2. Generate object/string-based triples: 3.296475518s 122MB
4. Find object/string-based triples with a given subject: 75.158390913s
124MB

Please note for these tests I made many changes, most significantly I
changed to the Node.js-specific `process.hrtime` for benchmarking, the
timer reports for each test individually, I replace your 'prototype'
examples with my own 'rdf' library (which is better optimized), I load more
realistic data (I select among ~100 predicates using a log distribution),
and I perform slightly more reads (though inserts still outnumber reads by
10k times, which is completely unrealistic, but good enough to get the idea
for our purposes).

You appear to be interested in testing just the raw performance of one
construct versus another, and as such use straight loops that run O(n). But
I don't think it's fair to write a case that's going to be a tiny fraction
of runtime once we begin to implement indexes and other computation
structures. A fair comparison will compare the /best/ implementation
against the /best/ implementation (and admittedly, I haven't improved upon
the string-storage examples with indexing, so these tests are still unfair.)

Nonetheless, these updates pushed to:
<https://github.com/Acubed/TripleRepresentationBenchmark>


> > Though intuitively tempting, it doesn't make sense to encode literals as
> just strings. Most literals will have types, and in RDF 1.1, all literals
> have types, untyped strings become the same as `xsd:string`.
>
> I do add types and languages to double-quoted literals:
>     N3Util.isLiteral('"Mickey Mouse"'); // true
>     N3Util.isLiteral('"Mickey Mouse"@en'); // true
>     N3Util.isLiteral('"3"^^<http://www.w3.org/2001/XMLSchema#integer>');
> // true
>     N3Util.isLiteral('"http://example.org/"'); // true
>

Right, here I'm speaking of encoding untyped literals using String. I.e.
encoding the first instance using simply { "Mickey Mouse" }. webr3's
original proof of concept had it, I reason against it.


>
> The performance surprise has everything to do with the difference between
> typed languages
> and the dynamic compilation of JavaScript, which is counterintuitive at
> some times.
>

And compounded by the fact that there's no one engine that we're supposed
to be using, and even within e.g. V8, it varies from release to release.

But I'm slightly confused, your post was about making appropriate design
decisions, now we've discussed what's more efficient. Well, what are we
looking for?

I demonstrated some very impressive functionality advantages to utilizing
instances of objects, and I believe also demonstrated that performance
differences in storing as a String vs. an Object is negligible, if not
favoring Objects, due to that very functionality (e.g. I can implement
indexing natively).

And please pardon me reducing everything down to JavaScript 101, it's not
just experts reading this list ;)

Austin.
Received on Monday, 4 November 2013 03:39:02 UTC