Functional Documentation of Semantic Data Extractor from Karl Dubost on 2007-07-11 (public-qa-dev@w3.org from July 2007)

From: Karl Dubost <karl@w3.org>
Date: Wed, 11 Jul 2007 21:31:12 +0900
To: Dominique Hazael-Massieux <dom@w3.org>
Cc: QA Dev <public-qa-dev@w3.org>
Message-Id: <F51C68CB-2BEF-49B8-95E7-8ADA305ACF2A@w3.org>

The semantic Data Extractor helps to understand the semantic data  
contained in your document.
http://www.w3.org/2003/12/semantic-extractor.html

It is XSLT driven,
http://www.w3.org/2002/08/extract-semantic (make sure to view source)

There might be interests
	- to add some rules
	- to create it in another language than XSLT

The tool could be used to help to create a Content "Conformance/ 
Sanity" checker for authors.



What it does?

Extraction of

* Document title
   - content of "title" element [PCDATA]

* Document Author Information
   - content of "address" element [inline]
   - in "meta" element, when name="author", content of "content"  
attribute [CDATA]

* Used Languages through the document
   - content of "lang" attributes through the document [CDATA]

* Metadata Profiles
   - content of "profile" attribute [URIs]

* Document Description (abstract)
   - in "meta" element, when name="description", content of "content"  
attribute [CDATA]

* Available Translations
   - in "link" element, when rel="alternate" and lang="" NOT empty,  
content of "href" attribute [URIs]

* Alternate Stylesheets
   - in "link" element, when rel="alternate" and media="" NOT empty,  
content of "href" attribute [URIs]

* Alternate formats
   - in "link" element, when rel="alternate" and NOT lang and NOT  
media and type="" NOT empty, content of "href" attribute [URIs]

* Navigation of the site
   - in "link" element, when rel="something", content of "href"  
attribute [URIs]
   (something is)
   start: start page
   next: next page
   previous: previous page
   content: table of content
   index: Index of the site
   glossary: glossary of the site
   copyright: copyrights for the page
   chapter: chapter for this section
   section: section of the site
   subsection: subsection of the site
   appendix: appendix for this page
   help: help for this page
   bookmark: bookmarkable resource for this page

* Definitions (sorted)
   - content of "dfn" element [inline]
   - content of "dt" element [inline]

* Acronyms and Abbreviations
   - content of "acronym" element and associated "title" attribute  
[inline]
   - content of "abbr" element and associated "title" attribute [inline]

* Quotes
   - content of "blockquote" element and associated "cite" attribute  
[block]
   - content of "q" element and associated "cite" attribute [inline]

* Sources and references
   - content of "cite" element [inline]

* Outline of the document
   - if "h1" element is here, extract content of h1, h2, h3, h4, h5,  
h6 elements and contained "a" element if any. [inline]


-- 
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager, QA Activity Lead
   QA Weblog - http://www.w3.org/QA/
      *** Be Strict To Be Cool ***

Received on Wednesday, 11 July 2007 12:31:59 UTC