Validator Architecture: now, and in the future

Dear all,

The first HTML Validator [1], announced almost 15 years ago, was  
basically a simple CGI wrapper around James Clark's SGML parser. The  
W3C Markup validator, started as a “Kinder, Gentler” tool[2], remained  
in essence a wrapper around sgmls (and later the open version,  
onsgmls) with a lot of layers of UI, heuristics, pre-parsing, guessing  
etc.

[1] http://lists.w3.org/Archives/Public/www-html/1994Jul/0015
[2] http://validator.w3.org/about.html

The Web and the validator have evolved, and it is now much more  
difficult to explain "what the validator does". More complicated, even  
is “what the validator should do”. Why so complicated? Because the  
markup validator aims to be a tool to check almost any kind of markup  
on the web, from legacy tag soup, cutting edge html5 or XML documents  
mixing and matching languages and namespaces.

How can a single tool cater for such varied types of markup? “It's  
complicated” is an answer, but not very satisfying. In the recent  
past, I have tried to use my meager flowchart skills to explain how  
the validator works, and how it should work. I think I got to a  
reasonably comprehensive and usable point, and have added the chart to  
the roadmap of the validator:

http://qa-dev.w3.org/wmvs/HEAD/todo.html#roadmap
(the flowchart itself is available as png, svg, pdf and graffle formats)

The chart itself is still in flux, and I would like your help in  
checking if the flow as it is described makes sense. Anything  
illogical? Anything missing?


Next of course comes the actual implementation of the flow. I added a  
quick summary of the steps necessary, soon to be added as bugzilla  
entries. Each of these steps constitute a semi-independent project, so  
if anyone is interested in getting involved, work on this would be  
extremely useful, in particular for those willing to improve the  
adoption rate of SVG, Math on the web, RDF, compound XML document  
formats, etc.

-- 
olivier

Received on Tuesday, 24 February 2009 19:04:56 UTC