microdata: the properties of an item


The recent changes to this definition go a bit overboard in throwing away  
properties in order to prevent itemref loops.

Any kind of duplicate element when crawling properties on the current item  
or any of its subitems at any level causes all properties to be thrown  
away. This is much more than is needed to avoid loops and not very  
difficult to trigger by accident where it is quite harmless:

<div itemscope itemref="x">
<div id="x" itemprop="p">foo</div>

(easy to get if you rearrange your markup a bit after adding microdata)

<div itemscope itemref="x x"></div>
<div id="x" itemprop="p">foo</div>

(much like duplicate class names, probably easy to get with  
machine-generated markup)

It's possible to implement this [1], but implementations following the  
spec strictly would be at a disadvantage to tools that don't do full  
checks. With cascading errors it also makes it risky to include subitems  
generated by third parties (code or people) without strict validation of  
these (compare XML and U+FFFE).

I suggest we do something closer to the bare minimum necessary to avoid  

To crawl the properties of an item:

input: top-level item, current item and memory. on first invocation,  
top-level item=current item and memory=[]

1. if memory is [] and current item is top-level item, it is  
self-referring, fail.

2. if current item is in memory, return (to stop recursion).

3. collect all itemprop'd elements in children nodes and itemref'd  
elements recursively into properties (stopping at itemscope)

4. remove any duplicates (these two steps can be optimized easily)

5. for each property which is an item, crawl the properties of that item  
with current item added to memory, top-level item unchanged, and current  
item=this property/item. if that fails, remove the property/item.

6. return properties.

This isn't exactly how I implemented it [2] and the algorithm may have  
bugs, but the general idea should be clear. You only need to consider  
elements with itemscope="" and itemprop="". If you think of these as  
creating a graph, remove any properties that are part of a loop (not those  
that just that lead into a loop).

Another somewhat sane option is ignoring all properties that lead to  
infinite recursion, i.e. as above but also including properties that lead  
into a loop. However I don't think this is a good idea as it propagates  
the error further than necessary and isn't really easier to implement in  
practice, in my experience.

It's possible that this will have to be tweaked for performance after we  
have feedback from native browser implementations and that we will end up  
throwing away slightly more properties, but for now I think my suggestion  
above will suffice.


Philip Jägenstedt

Received on Tuesday, 2 February 2010 23:49:07 UTC