W3C home > Mailing lists > Public > public-html@w3.org > August 2007

Heuristic Tests for Data Tables (Discussion)

From: T.V Raman <raman@google.com>
Date: Tue, 14 Aug 2007 11:13:15 -0700
Message-ID: <18113.61627.579759.692899@retriever.corp.google.com>
To: cerbera@projectcerbera.com
Cc: public-html@w3.org

The other interesting question t answer would be "how many of the
data tables have a real data alternative as XML" on the Web. 

For the MLB example, you can actually get at the XML -- and a few
years ago I figured out the recipe and implemented it in
Emacspeak -- afte rwhich I've never gone back to scraping the

The goal of saying this is not to diminish the value of the
examples you found, but rather to add to what we have to think of
Though there is value in figuring out how to scrape HTML, it's
often useful to ask:

A)  Where did the HTML come from,
B)  Is there a better alternative.

As  a second example, consider weather forecasts -- though one
can scrape HTML pages from wunderground, their RSS feed is a lot
easier to use.

Ben 'Cerbera' Millard writes:
 > Sadly, there are tables where <td> is used for everything. For examples:
 > * <http://sports.espn.go.com/nhl/players/profile?statsId=1339>
 > * <http://www.fleetlions.org.uk/events.html>
 > * <http://esw.w3.org/topic/RecentChanges>
 > More examples easily found from web searches like 
 > <http://www.google.com/search?q=table+2>:
 > * 
 > <http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_id=9993&p_table=STANDARDS>
 > * <http://www.biomedcentral.com/1471-2105/6/60/table/T2>
 > HTML4 does not try and guess header relationships. Doing this in HTML5 would 
 > be a new ability for HTML. <td>-only data tables are about as common as 
 > tables which <th>, in those I've seen. This ability, if possible, would add 
 > significant accessibility to these present-day tables with no authoring 
 > effort.
 > In the first example from ESPN:
 > 1. The first row is several columns wide and spans the entire table. It 
 > contains a text string. You could imply this is a <caption>.
 > 2. The data types of the second row are all text strings with no numbers. 
 > This is different to the subsequent rows. You could imply these are <th> 
 > cells for those subsequent rows.
 > 3. In those subsequent rows, several formats are used. These formats are 
 > fairly consistent down each column. You could imply these are the <td> cells 
 > which correspond to the <th> cells you implied in step 2, without 
 > associating them to the implied <caption> from step 1.
 > I don't have a concrete proposal for what these tests should be. The above 
 > steps might not work in other tables.
 > You'd need to determine which tables are data tables and which are layout 
 > tables, too. From what I've read, screen readers are already quite capable 
 > at that. It would be cool to get them more involved in this.
 > Ideas and research are welcome. :-)
 > --
 > Ben 'Cerbera' Millard
 > Collections of Interesting Data Tables
 > <http://sitesurgeon.co.uk/tables/readme.html> 

Best Regards,

Title:  Research Scientist      
Email:  raman@google.com
WWW:    http://emacspeak.sf.net/raman/
Google: tv+raman 
GTalk:  raman@google.com, tv.raman.tv@gmail.com
PGP:    http://emacspeak.sf.net/raman/raman-almaden.asc
Received on Tuesday, 14 August 2007 18:37:35 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:15:25 UTC