- From: <bugzilla@jessica.w3.org>
- Date: Fri, 16 Apr 2010 00:01:21 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9533 Summary: The Microdata extration algorithm should include image alt-text when extracting the contents of an element as a string Product: HTML WG Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: HTML future versions AssignedTo: dave.null@w3.org ReportedBy: jackalmage@gmail.com QAContact: public-html-bugzilla@w3.org CC: mjs@apple.com, Paul.Cotton@microsoft.com, rubys@intertwingly.net, mike@w3.org In the "Values" section of the Microdata section of the spec (currently section 5.2.4), the "Otherwise" clause says that the value of the Microdata property should be the textContent of the element. The textContent extraction algorithm defined in DOM3CORE does not include the value of the @alt attribute on an <img> element in the returned string. This is a problem for many common cases on the web, where Microdata may be used to extract information from a page that uses an image logo with appropriate alt-text. For example, it is common for corporate pages to have markup resembling "<h1><img src=foo alt='Example Corp'></h1>". Currently, using this markup to get the company name as the value of some Microdata property is impossible. If you set an @itemprop on the <img>, the value for the property is the value of the @src attribute. If you set an @itemprop on the <h1>, the value for the property is the empty string. Currently, the only way to get the company name as the value of a Microdata property is to duplicate the company name in a <meta> element and set the @itemprop on that instead. This is precisely the type of duplication that Microdata is intended to prevent Ideally, you would be able to set an @itemprop on the <h1> and get the value of the <img>'s @alt attribute, as you are getting the text inside the element, and @alt is the textual replacement for the image. It can be argued that more elements could benefit from special handling when formatting their text content. For example, the <q> element could emit its contents with quotes, the <bdo> element could emit its contents with unicode directionality characters, or the <br> element could substitute itself with a linebreak. However, these elements will still emit *something* useful if they just provide their plain textContent, even if it ends up being somewhat misformatted. <img alt> provides *nothing* and will require data duplication in the current algorithm, and thus is much more important to address. The actual change to algorithms extracting Microdata from a document are so minimal as to be trivial. If one is using DOM methods, one has to manually iterate through an element's nodes per the DOM3Core algorithm for textContent, and add a single additional case to extract the @alt value from <img alt>. This is somewhat more difficult than just requesting the .textContent value from a node, but is still quite trivial. If one is using lower-level or alternate methods to parse a page and extract Microdata from it, then the change should be excessively trivial - a single additional case while building the text content string, as described earlier in this paragraph. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Friday, 16 April 2010 00:01:23 UTC