Re: [whatwg] Thesis draft about HTML5 conformance checking from Henri Sivonen on 2007-03-28 (www-validator@w3.org from March 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 28 Mar 2007 15:24:09 +0300
To: olivier Thereaux <ot@w3.org>
Cc: WHAT WG List <whatwg@whatwg.org>, www-validator Community <www-validator@w3.org>
Message-Id: <C32365C0-D5C1-4BE9-8185-801F93793935@iki.fi>
On Mar 12, 2007, at 05:27, olivier Thereaux wrote:

> On Mar 11, 2007, at 02:15 , Henri Sivonen wrote:
>> The draft of my master's thesis is available for commenting at:
>> http://hsivonen.iki.fi/thesis/
>
> Henri, congratulations on your work on the HTML conformance checker  
> and on the Thesis.

Thanks.

> It's been a truly informative and enlightening reading, especially  
> the parts where you develop on the (im)possibility of using only  
> schemas to describe conformance to the html5 specs. This is a  
> question that has been bothering me for a long time, especially as  
> there is only one (as of today) production-ready conformance  
> checking tool not based on some kind (or combination) of schema- 
> based parsers,

I take it that you mean the Feed Validator?

>> [2.3.2] I share the view of the Web that holds WebKit, Presto,  
>> Gecko and Trident (the engines of Safari, Opera, Mozilla/Firefox  
>> and IE, respectively) to be the most important browser engines.
>
> Did you have a chance to look at engines in authoring tools?

I didn't investigate them beyond mentioning three authoring tools  
that have a RELAX NG-driven auto-completion feature.

> What type of parser do NVU, Amaya, golive etc work on?

For authoring tools, the key thing is that their serializers work  
with browser parsers. The details of how authoring tools recovers  
from bad markup is not as crucial as recovery in browsers because  
with authoring tools the author has a chance review the recovery result.

> How about parsing engines for search engine robots? These are  
> probably as important, if not more as some of the browser engines  
> in defining the "generic" engine for the web today.

Search engines are secretive about what they do, but I would assume  
that they'd want be compatible with browsers in order to fight SEO  
cloaking.

>> [4.1] The W3C Validator sticks strictly to the SGML validity  
>> formalism. It is often argued that it would be inappropriate for a  
>> program to be called a “validator” unless it checks exactly for  
>> validity in the SGML sense of the word – nothing more, nothing less.
>
> That's very true, there's a strong reluctance from part of the  
> validator user community tool to do anything else than formal  
> validation, mostly (?) out of fear that it would eventually make  
> the term of "validation" meaningless. The only thing the validator  
> does beyond DTD validation are the preparse checks on encoding,  
> presence of doctype, media type etc.

ISO and the W3C have already expanded the notion of validation to  
cover schema languages other than DTDs. In colloquial usage  
"validation" is already understood to mean checking in general. The  
notion of a "schema" could be detached from a schema language to be  
be an abstract partitioning of the set of possible XML documents into  
two disjoint sets: valid and invalid. Calling the process of deciding  
which set a given document instance belongs into "validation" would  
give a formal definition that matched the colloquial usage.

I do sympathize with Hixie's reluctance to call "HTML5 conformance  
checking" "HTML5 validation", though. Calling it "conformance  
checking" makes sure that others don't have a claim on defining what  
it means. Fighting the colloquial usage will probably be futile,  
though, outside spec lawyerism.

>> [6.1.3] Erroneous Source Is Not Shown
>> The error messages do not show the erroneous markup. For this  
>> reason it is unnecessarily hard for the user to see where the  
>> problem is.
>
> Was this by lack of time?

Yes. Showing the source code based on the SAX-reported line and  
column numbers is useful but it isn't novel enough or central enough  
to proving the feasibility of the chosen implementation approach for  
it to delay the publication of the thesis.

Observing the thesis projects of my friends who started before me has  
taught me that it is a mistake to promise a complete software product  
as a precondition for the completion of the thesis. Software always  
has one more bug to fix or one more feature to add. On the other  
hand, as far as the academic requirements go, one could even write a  
thesis explaining why a project failed.

> Did you have a look at existing implementations?

On this particular point, not yet.

> Oh I see [ 8.10 Showing the Erroneous Source Markup] as future  
> work. If you're looking for a decent, though by no means perfect,  
> implementation, look for sub truncate_line  in
> http://dev.w3.org/cvsweb/~checkout~/validator/httpd/cgi-bin/check

Thanks. I'll keep this in mind.

>> [8.1] Even though the software developed in this project is Free  
>> Software / Open Source, it has not been developed in a way that  
>> would make it easily approachable to potential contributors.  
>> Perhaps the most pressing need for change in order to move the  
>> software forward after the completion of this thesis is moving the  
>> software to a public version control system and making building  
>> and deploying the software easy.
>
> Making it available on a more open-sourcey system, with a multi- 
> user revision system will probably not create an explosion of code  
> contributors (you've had very helpful contributions from e.g Elika,  
> and most OS projects, even successful ones, never have more than a  
> handful of coders), but you may be able to create a healthy  
> community of users, reviewers, bug spotters, translators, document  
> editors, beyond the whatwg community.

I am not expecting an explosion of contributors. However, I have a  
reason to believe that my current arrangement has caused at least one  
potential contributor to walk away. I'd rather avoid turning people  
away.

Also, in the future, I'd like to make it super-easy for CMS  
developers to integrate the conformance checker back end to their  
products. To enable this, the barrier for getting a runnable copy  
should be low.

I'm very pessimistic about translations. Even the online markup  
checkers whose authors have borne the burden of making the messages  
translatable aren't getting numerous translation contributions.

> If you're interested in using w3c logistics, and benefit from the  
> existing communities around w3c, I'm happy to invite you.

Thank you. I'll keep your offer in mind when it is time to figure out  
where to put the source.

>> [8.8] To support the use of the conformance checker back end from  
>> other applications (non-Java applications in particular), a Web  
>> service would be useful.
>
> Indeed. Did you have a chance to look at EARL?

I did. I also had a look at the SOAP and Unicorn outputs of the W3C  
Validator. I like EARL the least of the three, because its  
assumptions about the nature of the checker software do not work well  
with implementations that have a grammar-based schema inside. Grammar- 
based implementations cannot cite an exact conformance criterion when  
a derivation in the grammar fails as demonstrated by the EARL output  
of the W3C Validator. The SOAP and Unicorn formats, even if crufty to  
my taste, match better the SAX ErrorHandler interface.

I think I saw Relaxed having its own SAX ErrorHandler-friendly  
format, but now I can't find it.

> I wrote some basic notes at http://lists.w3.org/Archives/Public/www- 
> validator/2007Mar/0005

Thanks. My notes are at http://lists.w3.org/Archives/Public/www- 
validator/2006Dec/0060.html and http://wiki.whatwg.org/wiki/ 
Conformance_Checker_Web_Service_Interface_Ideas

Thank you for your comments.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 28 March 2007 12:24:13 UTC