[Bug 9533] New: The Microdata extration algorithm should include image alt-text when extracting the contents of an element as a string

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9533

           Summary: The Microdata extration algorithm should include image
                    alt-text when extracting the contents of an element as
                    a string
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML future versions
        AssignedTo: dave.null@w3.org
        ReportedBy: jackalmage@gmail.com
         QAContact: public-html-bugzilla@w3.org
                CC: mjs@apple.com, Paul.Cotton@microsoft.com,
                    rubys@intertwingly.net, mike@w3.org


In the "Values" section of the Microdata section of the spec (currently section
5.2.4), the "Otherwise" clause says that the value of the Microdata property
should be the textContent of the element.  The textContent extraction algorithm
defined in DOM3CORE does not include the value of the @alt attribute on an
<img> element in the returned string.

This is a problem for many common cases on the web, where Microdata may be used
to extract information from a page that uses an image logo with appropriate
alt-text.  For example, it is common for corporate pages to have markup
resembling "<h1><img src=foo alt='Example Corp'></h1>".  Currently, using this
markup to get the company name as the value of some Microdata property is
impossible.  If you set an @itemprop on the <img>, the value for the property
is the value of the @src attribute.  If you set an @itemprop on the <h1>, the
value for the property is the empty string.  

Currently, the only way to get the company name as the value of a Microdata
property is to duplicate the company name in a <meta> element and set the
@itemprop on that instead.  This is precisely the type of duplication that
Microdata is intended to prevent

Ideally, you would be able to set an @itemprop on the <h1> and get the value of
the <img>'s @alt attribute, as you are getting the text inside the element, and
@alt is the textual replacement for the image.

It can be argued that more elements could benefit from special handling when
formatting their text content.  For example, the <q> element could emit its
contents with quotes, the <bdo> element could emit its contents with unicode
directionality characters, or the <br> element could substitute itself with a
linebreak.  However, these elements will still emit *something* useful if they
just provide their plain textContent, even if it ends up being somewhat
misformatted.  <img alt> provides *nothing* and will require data duplication
in the current algorithm, and thus is much more important to address.

The actual change to algorithms extracting Microdata from a document are so
minimal as to be trivial.  If one is using DOM methods, one has to manually
iterate through an element's nodes per the DOM3Core algorithm for textContent,
and add a single additional case to extract the @alt value from <img alt>. 
This is somewhat more difficult than just requesting the .textContent value
from a node, but is still quite trivial.  If one is using lower-level or
alternate methods to parse a page and extract Microdata from it, then the
change should be excessively trivial - a single additional case while building
the text content string, as described earlier in this paragraph.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Friday, 16 April 2010 00:01:23 UTC