- From: Karl Dubost <karl@w3.org>
- Date: Wed, 11 Jul 2007 21:31:12 +0900
- To: Dominique Hazael-Massieux <dom@w3.org>
- Cc: QA Dev <public-qa-dev@w3.org>
The semantic Data Extractor helps to understand the semantic data
contained in your document.
http://www.w3.org/2003/12/semantic-extractor.html
It is XSLT driven,
http://www.w3.org/2002/08/extract-semantic (make sure to view source)
There might be interests
- to add some rules
- to create it in another language than XSLT
The tool could be used to help to create a Content "Conformance/
Sanity" checker for authors.
What it does?
Extraction of
* Document title
- content of "title" element [PCDATA]
* Document Author Information
- content of "address" element [inline]
- in "meta" element, when name="author", content of "content"
attribute [CDATA]
* Used Languages through the document
- content of "lang" attributes through the document [CDATA]
* Metadata Profiles
- content of "profile" attribute [URIs]
* Document Description (abstract)
- in "meta" element, when name="description", content of "content"
attribute [CDATA]
* Available Translations
- in "link" element, when rel="alternate" and lang="" NOT empty,
content of "href" attribute [URIs]
* Alternate Stylesheets
- in "link" element, when rel="alternate" and media="" NOT empty,
content of "href" attribute [URIs]
* Alternate formats
- in "link" element, when rel="alternate" and NOT lang and NOT
media and type="" NOT empty, content of "href" attribute [URIs]
* Navigation of the site
- in "link" element, when rel="something", content of "href"
attribute [URIs]
(something is)
start: start page
next: next page
previous: previous page
content: table of content
index: Index of the site
glossary: glossary of the site
copyright: copyrights for the page
chapter: chapter for this section
section: section of the site
subsection: subsection of the site
appendix: appendix for this page
help: help for this page
bookmark: bookmarkable resource for this page
* Definitions (sorted)
- content of "dfn" element [inline]
- content of "dt" element [inline]
* Acronyms and Abbreviations
- content of "acronym" element and associated "title" attribute
[inline]
- content of "abbr" element and associated "title" attribute [inline]
* Quotes
- content of "blockquote" element and associated "cite" attribute
[block]
- content of "q" element and associated "cite" attribute [inline]
* Sources and references
- content of "cite" element [inline]
* Outline of the document
- if "h1" element is here, extract content of h1, h2, h3, h4, h5,
h6 elements and contained "a" element if any. [inline]
--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
QA Weblog - http://www.w3.org/QA/
*** Be Strict To Be Cool ***
Received on Wednesday, 11 July 2007 12:31:59 UTC