HTML Semantics extractor

Hi folks,

I've written some time ago an little XSLT to show what kind of semantics
an HTML document can have. It can help show how using HTML for structure
instead of presentation is a win-win for everybody. I release it now
since XHTML 2.0 seems to present some clearer semantics than the
previous versions did.

It could be much improved of course, and I welcome any
suggestions/corrections/comments. 

The XSLT: http://www.w3.org/2002/08/extract-semantic.xsl

Examples of results:
* the HTML Activity:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/MarkUp/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* the Semantic Web Activity:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/2001/sw/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* the W3C Manual of Style:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/2001/06/manual/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* the W3C Quality Assurance Activity:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/QA/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* a (not so) random personal web site:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.la-grange.net/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&

Currently, the stylesheet detects the following semantics:
- generic metadata: title, description, contact, languages
- citations and quotes, with their source if provided (cite attribute)
- definitions marked with <dfn> and <dt>/<dd>
- the outline of the document when it uses correctly the h1/h2/h3/...
order

If you have suggestions of other interesting semantics to extract, I'll
try to implement them.

Dom
-- 
Dominique Hazaël-Massieux - http://www.w3.org/People/Dom/
W3C/INRIA
mailto:dom@w3.org

Received on Tuesday, 27 August 2002 09:07:44 UTC