RDFa DOM API feedback from Philip Jägenstedt on 2011-07-02 (public-rdfa-wg@w3.org from July 2011)

From: Philip Jägenstedt <philip@foolip.org>
Date: Sat, 2 Jul 2011 23:03:08 +0200
To: public-rdfa-wg@w3.org
Message-ID: <CAKHWUkbbo8i7EyrcA3Yh6Ec5WD9ER2MjwCo33WuoUe7iQ2JysA@mail.gmail.com>
I've been following RDFa and Microdata for a while now, and have toyed
around a bit with things like
<https://gitorious.org/microdatajs/microdatajs> and
<http://foolip.org/microdatajs/live/>. As you might guess, I'm rather
interested in DOM APIs, so I thought I'd take a look at
<http://dev.w3.org/rdfa/specs/rdfa-dom-api.html> and provide some
feedback. (Although I work for Opera Software, I'm not representing
Opera in any way in this feedback.)

Since I'm not subscribed to public-rdfa-wg, please try to CC me in replies.

== HTML ==

The spec only references the 2008 RDFa in XHTML REC, not
<http://dev.w3.org/html5/rdfa/>. Is this an oversight?

Note that just deferring to the two RDFa specs implies different
processing requirements depending on XHTML and HTML. Different
behavior of API's depending on XHTML/HTML is not going to be well
received by browser implementors, as it creates all kinds of problems.
(Consider, for example, what happens when a DOM is created entirely by
script or when a subtree of a text/html document is moved by script to
a application/xhtml+xml document. It just doesn't make sense to switch
between different modes here.)

I'm going to assume in the following that the intention is for there
to be a single API spec covering both serializations.

== getTriplesByType type? ==

Some underlying assumptions about the model appear to be unstated
here. Specifically, is it type as in @datatype or as in @typeof? (I'd
guess it's @typeof.)

What if a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> predicate
is used explicitly, not using the @typeof shorthand?

== getTriplesByType order ==

Which order should triples be returned in? This must be well-defined
in order for the API to be interoperably implementable.

== Merging of triples ==

If the same triple is expressed several times in the document, are
they merged, or will two instances of the same triple returned by
getTriplesByType?

== Triple.language ==

Is the language normalized, and how? If the language is given with
lang="sv_FI" in markup, what does .language return? If there is any
merging of triples, does that happen before or after language
normalization?

== Dynamic changes ==

getTriplesByType returns a static array of Triples. In contrast, a lot
of HTML APIs return a live NodeList, so that changes to the document
are reflected in that NodeList. This is the case with e.g.
getElementsByTagName and the Microdata getItems API. If returning a
static array is intentional, can you make that more explicit by saying
that it is the triples that are in the document at the time of
invocation that are returned?

== RDF graph vs DOM disconnect ==

The API is readonly and seems to completely disconnects the RDF graph
from the DOM from which it is parsed. This makes it impossible to use
the API to, e.g., change the style of all elements that declare a
subject with type <http://xmlns.com/foaf/0.1/Person>, which would seem
to be one of the main use cases for having an API at all.

There's another serious issue here, best illustrated by an example:

1. getTriplesByType() is called. If this is the first time it is
called, the entire document must be traversed to build an RDF graph.
2. Element/attributes are added/removed by script.
3. getTriplesByType() is called again.

At step 3, does the entire document need to be traversed again? In
other words, is it possible to efficiently cache the graph? Caching
would amount to storing bindings between each element/attribute and
the role it plays in the graph. Consider for example if the @lang
attribute of some element is changed. To update the graph, it's
necessary to know which triples have their language sourced from the
element or any of its children. Adding/removing/updating xmlns
attributes would be similarly messy. With many attributes influencing
the graph, there's going to be a *lot* of bindings to keep track of,
and in practice the graph is going to be extremely tightly coupled to
the DOM.

Note that browsers have some infrastructure for updating collections
dynamically for things like getElementsByTagName, but it's usually a
lot simpler as it's only a single aspect of the element that is
considered and it maintains a collection of elements directly, not a
separate structure (RDF graph) parsed from them.

IMO, a better approach would be an API returning a live NodeList where
the criteria for inclusion/exclusion are much simpler.

== Triples.children ==

children seems a very strange thing to call all triples involving the
same subject, is this really intentional? Regardless, my main question
is related to the above. If getTriplesByType returns a static array,
what does later inspecting triple.children return? Is it the
"children" that triple had at the time getTriplesByType was called, or
something else? If it is the former, then it implies that each call to
getTriplesByType must find all "children" up-front, as it's not
possible to wait until the children IDL attribute is actually read to
find the "children". This seems extremely wasteful. If it is something
else, then the result array isn't static at all. Either way, this must
be defined.

== RDFa Profiles ==

Is it intended that the DOM API work with RDFa Profiles? Supporting it
in browsers seems fairly problematic.

1. Consider this script:

var e = document.querySelector("[profile]");
e.profile = "http://example.com/previously-unseen-profile";
document.getTriplesByType("http://examples.com/some-type");

This clearly will not work, since the browser won't synchronously
download and parse the profile while the script is running. Given
this, how is a script to know when the API is safe to use?

2. Should browsers preemptively fetch/parse all profiles in a
document, even though 99% of documents won't use the getTriplesByType
API? Should that delay the document load event? (related to the above
question)

3. If a profile actually becomes widely used, aren't you worried about
the DDoS that will result? Compare to the problems of DTD outlined in
<http://lists.w3.org/Archives/Public/public-html/2008Jul/0269.html>.

4. Should browsers have Turtle and RDF/XML parsers to handle the case
where the profile is using those syntaxes? MAY is a keyword for
interoperability disaster, at least in the context of web browsers...

== The End ==

Thanks for reading all the way through! If I discover more issues,
I'll follow up with more mail. Finally, I'd like to invite everyone to
provide technical feedback about issues with Microdata, if you haven't
already.

-- 
Philip Jägenstedt
Received on Monday, 4 July 2011 12:21:07 UTC