Re: RDFa DOM API feedback

Sorry for the slow reply, but better late than never...

On Mon, Jul 11, 2011 at 00:14, Manu Sporny <msporny@digitalbazaar.com> wrote:
> On 07/04/2011 06:02 PM, Philip Jägenstedt wrote:
>> I was mistaken, some of this is still problematic with the
>> DataDocument interface, which has getElementsByType,
>> getElementsBySubject and getElementsByProperty methods. These now
>> return NodeLists, but is it intentional that these collections are not
>> live?
>
> Yes, it was intentional.
>
> We /could/ return a live node list, but were concerned that it would
> hurt browser performance. This is an area where we could really use some
> of your input.
>
> My understanding is that the Microdata spec suffers from the same issue
> - if you add an element to the DOM that contains an itemscope statement,
> the code managing the live NodeList that getItems() returns would have
> to detect the addition and re-parse at least part of the document in
> order to update the .properties collection, no?

The live NodeList returned by getItems() contains all top-level
microdata items with a matching type. It's simple to see when
modifying the DOM if the modification will cause a change in this
collection, you just have to check for the presence of the itemscope
attribute, absence of itemprop and potentially check itemtype. There's
no need to traverse the entire document, you can just update the
collection directly.

It's a similar story with the properties collection: any element with
a matching itemprop attribute in a certain subtree is a match.

> We attempted to prevent this sort of mandatory re-parsing of the
> document unless the Web developer specifically requested it.

If the definition of the collection makes it too slow to have live
collections, then it's going to be slow for subsequent calls to the
API as well. AFAIK, all existing HTML collections can be expressed as
applying a simple test to each element in a certain subtree. If that
test requires looking at anything but the element itself (e.g.
traversing parents to find xmlns) then it'll be messy to implement
efficiently, it seems.

>> For all three methods, the order must also be defined.
>
> Would it be acceptable if sorted in triple generation order? That is, as
> triples are generated by the processor, they're added to the default
> Graph in order? That's deterministic and should be easy to do if people
> follow the processing rules. Any additional triples added to the graph,
> from say a TURTLE parser or JSON-LD parser, could be added to the "end".
> We would still need to discuss the ramifications of this, of course.

Any order which is efficiently implementable would be fine, as long as
it is defined. Microdata uses tree order.

>> All of the mess I originally outlined also applies to
>> DocumentData.getSubjects or getValues. Unless the information can be
>> cached, implementation is not feasible.
>
> Define "cached". Can there be a delay to the cache? To propose an overly
> simplistic strawman mechanism: the first call to the getSubjects()
> mechanism forces a parse of the document, but each subsequent call for
> the next 1000ms uses the cached values?

Updating it periodically would cause the results to vary at random
depending on how fast the script runs, which would not be very nice.

> Would you be okay if the document is re-parsed completely if a new RDFa
> or Microdata attribute is detected in the inserted DOM elements? What
> about a .structuredDataDirty flag that notifies the web developer that
> they should manually re-parse?

Traversing the entire document for any change is going to be very
slow. Maintaining a structuredDataDirty flag is also not trivial,
wouldn't you need to know what the new collection is in order to say
that it has changed? If you do, why not just make the collection live?

> I don't see how both Microdata and RDFa would be able to give anyone
> /live/ updates as both seem to have algorithms that require either part
> or all of the document to be re-processed. That could kill performance
> if the DOM is being updated with Microdata/RDFa items 100+ times per second.
>
> We could introduce a delay/throttle to the cache and a callback when new
> RDFa data is detected. Which one of these strategies seems most likely
> to address your concerns? Is there another approach that would be better?

If you want something that works like existing NodeList collections,
then the key is really to have a simple criteria for inclusion, such
that one can easily check when the DOM is modified if the
collection(s) also need updating.

>>>> == getTriplesByType type? ==
>>>>
>>>> Some underlying assumptions about the model appear to be unstated
>>>> here. Specifically, is it type as in @datatype or as in @typeof? (I'd
>>>> guess it's @typeof.)
>>>>
>>>> What if a <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> predicate
>>>> is used explicitly, not using the @typeof shorthand?
>>
>> This question still applies to getElementsByType
>
> Ah, good catch. Yes, it is referring to @typeof, not @datatype. If
> someone uses <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> instead
> of the @typeof shorthand to specify the type of the subject,
> getElementsByType() should still return the element. That is, it doesn't
> matter how rdf:type was set, getElementsByType should return the element
> regardless. The underlying model is expected to query rdf:type
> regardless of how it is set.

OK, thanks for clarifying. This implies that it's necessary to create
an internal RDF graph for the entire document in order to support
getTriplesByType, right?

>>>> == RDFa Profiles ==

It looks like profiles are being dropped, so let's ignore my feedback
on that for now.

Finally, a disclaimer. While I do work for Opera, I'm not aware of any
plans for Opera to support RDFa or this API. My feedback should not be
taken with the full weight of a potential implementor, this is just
the private me trying to understand RDFa a little better.

-- 
Philip Jägenstedt

Received on Thursday, 21 July 2011 18:30:15 UTC