- From: <bugzilla@jessica.w3.org>
- Date: Fri, 16 Apr 2010 00:01:21 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9533
Summary: The Microdata extration algorithm should include image
alt-text when extracting the contents of an element as
a string
Product: HTML WG
Version: unspecified
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: HTML future versions
AssignedTo: dave.null@w3.org
ReportedBy: jackalmage@gmail.com
QAContact: public-html-bugzilla@w3.org
CC: mjs@apple.com, Paul.Cotton@microsoft.com,
rubys@intertwingly.net, mike@w3.org
In the "Values" section of the Microdata section of the spec (currently section
5.2.4), the "Otherwise" clause says that the value of the Microdata property
should be the textContent of the element. The textContent extraction algorithm
defined in DOM3CORE does not include the value of the @alt attribute on an
<img> element in the returned string.
This is a problem for many common cases on the web, where Microdata may be used
to extract information from a page that uses an image logo with appropriate
alt-text. For example, it is common for corporate pages to have markup
resembling "<h1><img src=foo alt='Example Corp'></h1>". Currently, using this
markup to get the company name as the value of some Microdata property is
impossible. If you set an @itemprop on the <img>, the value for the property
is the value of the @src attribute. If you set an @itemprop on the <h1>, the
value for the property is the empty string.
Currently, the only way to get the company name as the value of a Microdata
property is to duplicate the company name in a <meta> element and set the
@itemprop on that instead. This is precisely the type of duplication that
Microdata is intended to prevent
Ideally, you would be able to set an @itemprop on the <h1> and get the value of
the <img>'s @alt attribute, as you are getting the text inside the element, and
@alt is the textual replacement for the image.
It can be argued that more elements could benefit from special handling when
formatting their text content. For example, the <q> element could emit its
contents with quotes, the <bdo> element could emit its contents with unicode
directionality characters, or the <br> element could substitute itself with a
linebreak. However, these elements will still emit *something* useful if they
just provide their plain textContent, even if it ends up being somewhat
misformatted. <img alt> provides *nothing* and will require data duplication
in the current algorithm, and thus is much more important to address.
The actual change to algorithms extracting Microdata from a document are so
minimal as to be trivial. If one is using DOM methods, one has to manually
iterate through an element's nodes per the DOM3Core algorithm for textContent,
and add a single additional case to extract the @alt value from <img alt>.
This is somewhat more difficult than just requesting the .textContent value
from a node, but is still quite trivial. If one is using lower-level or
alternate methods to parse a page and extract Microdata from it, then the
change should be excessively trivial - a single additional case while building
the text content string, as described earlier in this paragraph.
--
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Friday, 16 April 2010 00:01:23 UTC