- From: Emiliano Martinez Luque <martinezluque@gmail.com>
- Date: Sun, 16 Jan 2011 11:04:02 -0300
First of all, I would like to say hello to the whatwg community and introduce myself. My name is Emiliano Mart?nez Luque and I used to be (sort of) active in the microformats community (I wrote a parser/extractor/validator: http://code.google.com/p/xmfp). I have been reviewing the microdata specification (and the mailing list archives) and I'm interested in writing a parser/extractor. Well, I have of course a variety of questions regarding implementation and also regarding certain details of the microdata spec. Before I do that, I want to say that I consider the microdata specification to be a huge step forward, as somebody that is interested in writing applications that consume structured data from within the web, I consider the clear separation of the syntax for representing the data from the vocabularies being represented, a definitive (and qualitative) advance. I would also like to praise the simplicity and clarity of the specification. This are my questions: 1) The specification does not define any mechanism for an application using the microdata to deal with possible misuses of data vocabularies. For example, let's say a web developer intends to mark up a data vocabulary for cats (I'm basing this on the examples on the spec). The name-value pairs he intends to markup are the following (expressed in JSON notation): { name:"Hedral", color:"black" } Based on the examples on the spec this could be marked up as: <section itemscope itemtype="http://example.org/animals#cat"> <h1 itemprop="name">Hedral</h1> <p itemprop="color">black<span </section> However, we could assume that authors might sometimes mistype the names of the item properties. In the example: <section itemscope itemtype="http://example.org/animals#cat"> <h1 itemprop="nme">Hedral</h1> <p itemprop="colr">black<span </section> Which a procesor might interpret as: { nme:"Hedral", colr:"black" } I could easily imagine other misuses, like for example an itemprop that should be represented as a simple name-value pair being represented as a full item with item scope or vice versa, etc. Since there are no mechanism specified in the spec for defining and validating the vocabularies being extracted from the microdata, what is the proposed course of action for an implementation in a case like this? Or should applications always assume that the data has been correctly marked up? Which brings me to question 2. 2) The specs specify item types should be identified by URLs. It is not completely clear (or at least not clear to me) whether they represent the string of the URL as a URI for unambiguously representing the item type, a URL for a document that defines that item type or both. which is the case? In the case that it represents a document I would like to know which formats are being considered, and if the general idea is to have a unique format or to let data vocabularies be defined in a variety of different types of formats. I would also like to know if there is any working group/community/forum that is working specifically on producing a format for defining and validating data vocabularies in a machine processable way in a simple manner, and what documentation they are producing. If there is no work on this I would like to propose the following. For the purpose of simply validating: - correct names - correct types (whether it's a name:value pair or a full item) - correct number of occurrences (Whether it can be an array of values or just a single value, whether it is required or not) It would suffice to specify a data structure with the following attributes: property-name, occurrences and childs. Assuming that if a property has childs then it's value is a full item, rather than a simple text value. This could easily be represented in JSON with something like: { property_name:"name of the property as used in itemprop", occurrences:"*", childs:[ {}, {}, {}... ] } Where childs could be an array of data property definitions, for example: { property_name:"name of the property as used in itemprop", occurrences:"*", childs:[ { property_name:"name of the property as used in itemprop for the first child", occurrences:"1" }, { property_name:"name of the property as used in itemprop for the second child", occurrences:"*", childs:[ {}, ...] } ] } This could even be represented in microdata itself: <div itemscope itemtype="datavocabularies.com/microdata"> <p itemprop="property_name">name of the property</p> <p itemprop="occurrences">*</p> <div itemprop="childs" itemtype="datavocabularies.com/microdata"> <p itemprop="property_name">name of the property for the first child</p> <p itemprop="occurrences">1</p> <div> <div itemprop="childs" itemtype="datavocabularies.com/microdata"> <p itemprop="property_name">name of the property for the second child</p> <p itemprop="occurrences">1</p> <div> </div> An application could easily implement this. For example, an implementation in C of this simple recursive data structure, could be: struct data_prop { char property_name[ PROPERTY_NAME_MAX_LENGTH ]; char ocurrences[1]; struct data_prop *childs[ DATA_PROPERTY_MAX_CHILDS ]; }; Where occurrences could be represented by a subset of unix regexp constants (say: *, +, ?, 1). (Of course an extra attribute of (int) number_of_childs would be needed for this to be of any use for an actual C program, I'm just trying to provide an example in a common language.) In this sense an application consuming microdata could receive 2 inputs: the html document containing the microdata and the set of data-vocabularies definitions to validate the represented microdata. It would be very simple to build a validator on top of this. Besides, having a simple syntax for defining data vocabularies and validating microdata, would also be very helpful for coordinating the work of data vocabulary authors. Going further into this, we could also think about a datatype property for specialised applications that may require them, etc. Again, if no work has been done on this, I would like to know if there is interest in the community in starting work on this (within the community forums provided by the whatwg or outside as an independent project). 3) The specification states that itemref references a node within the html tree, referencing it by it's id. However it specifies nothing regarding how the referenced node should be marked up. Since, the nodes referenced may exist before the itemrefs, an application discovering microdata may have to do multiple passes through the html tree to extract this information. I would like to know, if any thought has been given to using itemscope within the referenced node, ie: <div itemscope id="a"> <p itemprop="a1">value of a1</p> <p itemprop="a2">value of a2</p> </div> <div itemscope id="b"> <p itemprop="b1">value of b1</p> <div itemscope id="d" itemref="a"></div> </div> Where a1="value of a1" and a2="value of a2" are childs belonging to the item identified as d which is itself a child of b. The advantage of this is that an application extracting the microdata could then extract all elements marked up with itemscope and then merge them according to itemref references without having to do multiple passes. This might not be very important but could help to have better efficiency when extracting microdata from big quantities of deep referenced documents or when dealing with limited resources. 4) What is the intended behaviour of an application when encountering a loop within the itemref references? ie: <div itemscope id="a" itemref="b c d"></div> <p id="b"><span itemprop="x">x value</span></p> <div id="c"> <p>Y:<span itemprop="y">y value</span></p> <p>Z: <span itemprop="z">z value</span></p> </div> <div itemscope id="d" itemref="a"></div> In a case like this, should the whole node with id="a" be discarded or only the subnode with id="d"? Or is this up to the implementor? I would like to point out that this is another reason to have some (however loose) mechanism for data vocabulary validation for dealing with user errors. 5) The specification states: "The itemref attribute, if specified, must have a value that is an unordered set of unique space-separated tokens that are case-sensitive, consisting of IDs of elements in the same home subtree." (5.2.2 of http://www.whatwg.org/specs/web-apps/current-work/#microdata) I would like to know if there has been any thoughts given to referencing fragments on an outside document. For example, a document with URL http://www.personaldata.com/me.html might contain the following fragment: <div itemscope itemtype="http://www.datavocabulary.com/person"> <p>My name is <span itemprop="name">Pepe</span> and I used work at <a itemprop="org" href="http://www.organization.com/about_us.html#org_data">organization</a></p> </div> While at http://www.organization.com/about_us.html#org_data you could have the following fragment: <div id="org_data" itemtype="http://www.datavocabulary.com/org"> <p itemprop="legal_name">Organization XYZ</p> .... </div> Or something similar for referencing specific data vocabularies outside of the node tree. Or maybe, I'm missing something and this is contemplated within the general use of href? My question is whether there is a mechanism for referencing items from a document outside the home subtree as subproperties of a microdata item? Is it correct to use href for this? And, should an application dealing with microdata be aware of this? Other than that, thank you for this great spec and best regards, -- Emiliano Mart?nez Luque http://www.metonymie.com
Received on Sunday, 16 January 2011 06:04:02 UTC