W3C home > Mailing lists > Public > public-html@w3.org > August 2007

Re: Heuristic Tests for Data Tables (Discussion)

From: Philip Taylor <philip@zaynar.demon.co.uk>
Date: Fri, 17 Aug 2007 17:12:50 +0100
Message-ID: <46C5C902.4010908@zaynar.demon.co.uk>
To: Ben 'Cerbera' Millard <cerbera@projectcerbera.com>
CC: HTMLWG <public-html@w3.org>

Ben 'Cerbera' Millard wrote:
> Sadly, there are tables where <td> is used for everything.
> [...]
> HTML4 does not try and guess header relationships. Doing this in HTML5 
> would be a new ability for HTML. <td>-only data tables are about as 
> common as tables which <th>, in those I've seen.
> [...]
> You'd need to determine which tables are data tables and which are 
> layout tables, too. From what I've read, screen readers are already 
> quite capable at that. It would be cool to get them more involved in this.

I would have expected tables with no <th> to be more likely layout than 
those that do use <th>, but I'm not sure now if that expectation is 
true. With the data I collected a while ago [1], I saw <table> on 75% of 
pages and <th> on 4% of pages. Looking at the first six pages that used 
<th> [2], none use it in a useful way - they are all layout tables, and 
the <th> cells all contain some combination of whitespace, non-header 
text and images. So it seems as important to determine layout vs data 
for tables that do have <th> as much as for those that don't.

Are there any details on how existing implementations distinguish 
between layout and data tables? I guess a first approximation would be 
to see whether the table contains any block-level or image elements, in 
which case it is more likely layout, but I don't know if more subtlety 
is needed.

I believe that would be very useful if we want to investigate the 
properties of a large collection of 'normal' data tables, and develop 
algorithms to work on them - if we can automatically recognise data 
tables then we can build up that collection fairly easily, and see if 
various header-association tests (like those you suggested) can be 
implemented in a way that's useful (matches a reasonable number of 
tables correctly) and safe (doesn't match an unreasonable number 
incorrectly). I'd be happy to help with collecting that kind of data if 

[1] http://canvex.lazyilluminati.com/survey/2007-07-17/analyse.cgi/index

Philip Taylor
Received on Friday, 17 August 2007 16:12:58 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:15:25 UTC