- From: Karl Dubost <karl@w3.org>
- Date: Wed, 11 Jul 2007 21:31:12 +0900
- To: Dominique Hazael-Massieux <dom@w3.org>
- Cc: QA Dev <public-qa-dev@w3.org>
The semantic Data Extractor helps to understand the semantic data contained in your document. http://www.w3.org/2003/12/semantic-extractor.html It is XSLT driven, http://www.w3.org/2002/08/extract-semantic (make sure to view source) There might be interests - to add some rules - to create it in another language than XSLT The tool could be used to help to create a Content "Conformance/ Sanity" checker for authors. What it does? Extraction of * Document title - content of "title" element [PCDATA] * Document Author Information - content of "address" element [inline] - in "meta" element, when name="author", content of "content" attribute [CDATA] * Used Languages through the document - content of "lang" attributes through the document [CDATA] * Metadata Profiles - content of "profile" attribute [URIs] * Document Description (abstract) - in "meta" element, when name="description", content of "content" attribute [CDATA] * Available Translations - in "link" element, when rel="alternate" and lang="" NOT empty, content of "href" attribute [URIs] * Alternate Stylesheets - in "link" element, when rel="alternate" and media="" NOT empty, content of "href" attribute [URIs] * Alternate formats - in "link" element, when rel="alternate" and NOT lang and NOT media and type="" NOT empty, content of "href" attribute [URIs] * Navigation of the site - in "link" element, when rel="something", content of "href" attribute [URIs] (something is) start: start page next: next page previous: previous page content: table of content index: Index of the site glossary: glossary of the site copyright: copyrights for the page chapter: chapter for this section section: section of the site subsection: subsection of the site appendix: appendix for this page help: help for this page bookmark: bookmarkable resource for this page * Definitions (sorted) - content of "dfn" element [inline] - content of "dt" element [inline] * Acronyms and Abbreviations - content of "acronym" element and associated "title" attribute [inline] - content of "abbr" element and associated "title" attribute [inline] * Quotes - content of "blockquote" element and associated "cite" attribute [block] - content of "q" element and associated "cite" attribute [inline] * Sources and references - content of "cite" element [inline] * Outline of the document - if "h1" element is here, extract content of h1, h2, h3, h4, h5, h6 elements and contained "a" element if any. [inline] -- Karl Dubost - http://www.w3.org/People/karl/ W3C Conformance Manager, QA Activity Lead QA Weblog - http://www.w3.org/QA/ *** Be Strict To Be Cool ***
Received on Wednesday, 11 July 2007 12:31:59 UTC