W3C home > Mailing lists > Public > www-html@w3.org > August 2002

HTML Semantics extractor

From: Dominique HazaŽl-Massieux <dom@w3.org>
Date: 27 Aug 2002 15:07:40 +0200
To: www-html@w3.org
Cc: karl@w3.org
Message-Id: <1030453662.10227.207.camel@stratustier>

Hi folks,

I've written some time ago an little XSLT to show what kind of semantics
an HTML document can have. It can help show how using HTML for structure
instead of presentation is a win-win for everybody. I release it now
since XHTML 2.0 seems to present some clearer semantics than the
previous versions did.

It could be much improved of course, and I welcome any

The XSLT: http://www.w3.org/2002/08/extract-semantic.xsl

Examples of results:
* the HTML Activity:
* the Semantic Web Activity:
* the W3C Manual of Style:
* the W3C Quality Assurance Activity:
* a (not so) random personal web site:

Currently, the stylesheet detects the following semantics:
- generic metadata: title, description, contact, languages
- citations and quotes, with their source if provided (cite attribute)
- definitions marked with <dfn> and <dt>/<dd>
- the outline of the document when it uses correctly the h1/h2/h3/...

If you have suggestions of other interesting semantics to extract, I'll
try to implement them.

Dominique HazaŽl-Massieux - http://www.w3.org/People/Dom/
Received on Tuesday, 27 August 2002 09:07:44 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:06:00 UTC