W3C home > Mailing lists > Public > www-html@w3.org > August 2002

HTML Semantics extractor

From: Dominique HazaŽl-Massieux <dom@w3.org>
Date: 27 Aug 2002 15:07:40 +0200
To: www-html@w3.org
Cc: karl@w3.org
Message-Id: <1030453662.10227.207.camel@stratustier>

Hi folks,

I've written some time ago an little XSLT to show what kind of semantics
an HTML document can have. It can help show how using HTML for structure
instead of presentation is a win-win for everybody. I release it now
since XHTML 2.0 seems to present some clearer semantics than the
previous versions did.

It could be much improved of course, and I welcome any
suggestions/corrections/comments. 

The XSLT: http://www.w3.org/2002/08/extract-semantic.xsl

Examples of results:
* the HTML Activity:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/MarkUp/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* the Semantic Web Activity:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/2001/sw/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* the W3C Manual of Style:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/2001/06/manual/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* the W3C Quality Assurance Activity:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.w3.org/QA/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&
* a (not so) random personal web site:
http://www.w3.org/2000/06/webdata/xslt?xmlfile=http://cgi.w3.org/cgi-bin/tidy?docAddr=http://www.la-grange.net/&xslfile=http://www.w3.org/2002/08/extract-semantic.xsl&

Currently, the stylesheet detects the following semantics:
- generic metadata: title, description, contact, languages
- citations and quotes, with their source if provided (cite attribute)
- definitions marked with <dfn> and <dt>/<dd>
- the outline of the document when it uses correctly the h1/h2/h3/...
order

If you have suggestions of other interesting semantics to extract, I'll
try to implement them.

Dom
-- 
Dominique HazaŽl-Massieux - http://www.w3.org/People/Dom/
W3C/INRIA
mailto:dom@w3.org
Received on Tuesday, 27 August 2002 09:07:44 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:15:52 GMT