Re: RDFa DOM API feedback from Manu Sporny on 2011-07-10 (public-rdfa-wg@w3.org from July 2011)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Sun, 10 Jul 2011 18:14:30 -0400
To: Philip Jägenstedt <philip@foolip.org>
CC: public-rdfa-wg@w3.org
Message-ID: <4E1A2446.3050908@digitalbazaar.com>
On 07/04/2011 06:02 PM, Philip Jägenstedt wrote:
> I was mistaken, some of this is still problematic with the
> DataDocument interface, which has getElementsByType,
> getElementsBySubject and getElementsByProperty methods. These now
> return NodeLists, but is it intentional that these collections are not
> live?

Yes, it was intentional.

We /could/ return a live node list, but were concerned that it would
hurt browser performance. This is an area where we could really use some
of your input.

My understanding is that the Microdata spec suffers from the same issue
- if you add an element to the DOM that contains an itemscope statement,
the code managing the live NodeList that getItems() returns would have
to detect the addition and re-parse at least part of the document in
order to update the .properties collection, no?

We attempted to prevent this sort of mandatory re-parsing of the
document unless the Web developer specifically requested it.

> For all three methods, the order must also be defined.

Would it be acceptable if sorted in triple generation order? That is, as
triples are generated by the processor, they're added to the default
Graph in order? That's deterministic and should be easy to do if people
follow the processing rules. Any additional triples added to the graph,
from say a TURTLE parser or JSON-LD parser, could be added to the "end".
We would still need to discuss the ramifications of this, of course.

> All of the mess I originally outlined also applies to
> DocumentData.getSubjects or getValues. Unless the information can be
> cached, implementation is not feasible.

Define "cached". Can there be a delay to the cache? To propose an overly
simplistic strawman mechanism: the first call to the getSubjects()
mechanism forces a parse of the document, but each subsequent call for
the next 1000ms uses the cached values?

Would you be okay if the document is re-parsed completely if a new RDFa
or Microdata attribute is detected in the inserted DOM elements? What
about a .structuredDataDirty flag that notifies the web developer that
they should manually re-parse?

I don't see how both Microdata and RDFa would be able to give anyone
/live/ updates as both seem to have algorithms that require either part
or all of the document to be re-processed. That could kill performance
if the DOM is being updated with Microdata/RDFa items 100+ times per second.

We could introduce a delay/throttle to the cache and a callback when new
RDFa data is detected. Which one of these strategies seems most likely
to address your concerns? Is there another approach that would be better?

>>> == getTriplesByType type? ==
>>>
>>> Some underlying assumptions about the model appear to be unstated
>>> here. Specifically, is it type as in @datatype or as in @typeof? (I'd
>>> guess it's @typeof.)
>>>
>>> What if a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> predicate
>>> is used explicitly, not using the @typeof shorthand?
> 
> This question still applies to getElementsByType

Ah, good catch. Yes, it is referring to @typeof, not @datatype. If
someone uses <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> instead
of the @typeof shorthand to specify the type of the subject,
getElementsByType() should still return the element. That is, it doesn't
matter how rdf:type was set, getElementsByType should return the element
regardless. The underlying model is expected to query rdf:type
regardless of how it is set.

>>> == RDFa Profiles ==
>>>
>>> Is it intended that the DOM API work with RDFa Profiles? Supporting it
>>> in browsers seems fairly problematic.
>>>
>>> 1. Consider this script:
>>>
>>> var e = document.querySelector("[profile]");
>>> e.profile = "http://example.com/previously-unseen-profile";
>>> document.getTriplesByType("http://examples.com/some-type");
>>>
>>> This clearly will not work, since the browser won't synchronously
>>> download and parse the profile while the script is running. Given
>>> this, how is a script to know when the API is safe to use?

We had been discussing this a few months ago and had thought that we
could perform some of this work in something like a Web Worker and block
the RDFa API until all profiles are loaded.

>>> 2. Should browsers preemptively fetch/parse all profiles in a
>>> document, even though 99% of documents won't use the getTriplesByType
>>> API?

Well, the RDFa profile isn't just for getTriplesByType(). The profile
can define prefixes and terms, like so:

foaf -> http://xmlns.com/foaf/0.1/
name -> http://xmlns.com/foaf/0.1/name

so that people can markup stuff like so:

<span property="foaf:name">Philip Jägenstedt</span>

or like so:

<span property="name">Philip Jägenstedt</span>

Ideally, we wanted to delay the fetching of profiles until the Web
developer called one of the RDFa API methods. That way, not everyone has
to pay the structured data tax.

>>> Should that delay the document load event? (related to the above
>>> question)

I think we should avoid delaying the document load event. Perhaps there
should be a new event fired when the RDFa document is ready to be
processed? Or perhaps we should delay the retrieval of the profile
documents until a program makes a call to the RDFa API?

> Should I perhaps just file individual bugs? Discussing so many issues
> in a single email thread is probably going to be messy...

Unfortunately, we don't have a buzilla bug tracker. I'll open issues for
each of these items and point you to them. That will help us ensure that
we deal with all of them as a Working Group.

Thanks for the detailed feedback, Philip - it's very much appreciated. :)

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: PaySwarm Developer Tools and Demo Released
http://digitalbazaar.com/2011/05/05/payswarm-sandbox/
Received on Sunday, 10 July 2011 22:15:18 UTC