Re: HTML5 Authoring Conformance Study from Maciej Stachowiak on 2010-03-21 (public-html@w3.org from March 2010)

From: Maciej Stachowiak <mjs@apple.com>
Date: Sun, 21 Mar 2010 06:53:28 -0700
To: Sam Ruby <rubys@intertwingly.net>
Cc: HTMLwg WG <public-html@w3.org>
Message-id: <87766B8D-4D9E-4BF2-93A3-6BA3D1B675E2@apple.com>
On Mar 21, 2010, at 5:22 AM, Sam Ruby wrote:

> On 03/21/2010 03:20 AM, Maciej Stachowiak wrote:
>>
>> I've decided that it's worthwhile to review the HTML5 conformance  
>> errors
>> reported on notable sites in more details. I started the following  
>> wiki
>> page to collect data:
>>
>> http://www.w3.org/html/wg/wiki/HTML5_Authoring_Conformance_Study
>>
>> Thanks to Aryeh Gregor and myself, we now have a full  
>> classification of
>> HTML5 conformance errors on the Alexa Top 10. Thanks also to Sam Ruby
>> for his blog post that inspired the set of sites chosen and links to
>> similar data in raw form. If anyone would like to help with gathering
>> the data for the remaining sites, it would be much appreciated. The
>> methodology is documented on the wiki page.
>
> "full"?  Not hardly.  <grin>


Note that this list (so far) is only attempt to classify the  
categories that validators fall into. It's not an an attempt to  
justify them. I found that in a number of cases, I personally had no  
idea why something was disallowed.

> I still remain deeply concerned about a "Ready?  Fire!  Aim?"  
> approach to solving these problems.  The first thing that needs to  
> be done is to decide on what problems does Authhor Conformance  
> Requirements address, and how does the having them makes things  
> better?  In short, we would be best served by requiring a change  
> proposal for such things.

If we were at the start of the project, that would be a fine approach.  
As things stand now, I would personally prefer not to spend several  
additional years on getting authoring conformance requirements just  
right. It's true that if we could get consensus on removing them all  
and replacing them with nothing all the way to REC, then that might  
save time on the whole. But I would be highly surprised if we could  
quickly get consensus on such a radical approach. Consider: validator.w3.org 
  would become a tautology machine. It seems like a tough sell to get  
that through Last Call.


In the course of reviewing these errors, I concluded that there is at  
least one other good reason for document conformance errors besides  
interoperability. Namely, situations where it is likely the author has  
made a mistake that may have unintended consequences, even if those  
consequences are 100% consistent between user agents.

For example, I think duplicate IDs are a legitimate error. Even if  
they don't break the page in an obvious way, they will have surprising  
effects the moment you call getElementById or attempt to use them as  
fragment identifiers. It seems reasonable to me that every conformance  
checker should be required to report that error, at least unless it is  
specially configured to silence it.


> Meanwhile, I've selected one issue each from the top ten list to  
> explore further here.

In your comments below, some point out errors in the Wiki page, which  
I have endeavored to fix. Others question the motivation for  
particular conformance requirements, so I didn't change the  
classification for those, though in some cases I had a comment about  
likely reasoning.

>
> google.com:
>
> the script tag is not unclosed, the html and body tags are unclosed.  
> HTML5 has many elements which do not require close tags.  It even  
> has many tags that are entirely optional.  Both of these tags are  
> entirely optional, but apparently if present must be explicitly  
> closed.  What operational interop problem does this solve?

Actually, the close tags for html and body are both optional. On  
closer review of the markup, I believe the unclosed tag is <center>  
(there are two <center> open tags but only one close.) The validator  
error message could clearly be improved here at the very least. Fixed  
in the wiki page.

>
> facebook.com:
>
> How is this a "bad doctype"?  What operational interop problem does  
> it solve to identify this doctype as non-conforming?  I thought the  
> HTML5 strategy was that the web is to be considered as non-versioned.

The XHTML 1.0 Strict doctype is actually allowed in general - it's not  
flagged as an error on other pages that use it. I believe the  
validator is complaining about the newline in the doctype string -  
that's the only difference I can find compared to the msn.com doctype  
which is not flagged as an error.

>
> yahoo.com:
>
> y-pkgid could arguably conform to "proposal Y" for issue-41.   
> Allowing "modid" would both inhibit the ability of the validator to  
> catch misspellings, and the ability for future versions of the spec  
> to define new attributes.

This seems to hint at a possible third reason for conformance  
requirements besides interoperability and catching likely authoring  
errors: protecting future ability to evolve the language.

Splitting out custom attributes with a hyphen seems reasonable if we  
are considering treating them differently. I updated the wiki page to  
reflect that.

I believe that in this particular case, a data-* attribute would be an  
appropriate replacement for modid and y-pkgid though.

>
> youtube.com:
>
> What interop issues are solved by disallowing div elements inside of  
> span elements?
>
> live.com:
>
> This issue has already been widely discussed.  Additional  
> information can be found here: http://philip.html5.org/data/xmlns-bindings.txt

Philip just pointed me to a newer data set: http://philip.html5.org/data/xmlns-attributes.txt

Added a link to the wiki page.


>
> wikipedia.org:
>
> While there are no errors, there is a warning, and getting the  
> definition of IRI correct is definitely something that is relevant  
> to HTML5.

The warning is not mandated by HTML5 as far as I know. So it seems  
irrelevant to discussion of HTML5 conformance requirements.

>
> blogger.com:
>
> What interop issues are solved by disallowing blank targets?

I don't think this is an interop issue, but it does seem like a likely  
author oversight, since target="" has no effect.

>
> baidu.com:
>
> What interop issues are solved by requiring script elements to come  
> before </body>?

I don't know about <script> itself, but for any element with visible  
rendering, putting it after </body> is likely not to give the results  
the author intended. This particular <script> uses document.write. In  
this case, it writes another <script> tag, but in general  
document.write could end up putting arbitrary content after </body>.

>
> msn.com:
>
> Separate issues: whitespaces within query and whitespace either  
> before or after the IRI.

It looks like there is only one case of trailing whitespace  
(mysteriously reported as whitespace in query) and none of leading.  
The rest all seemed to be internal whitespace. I split the two  
categories. On qq.com all the URL errors were internal whitespace.

>
> qq.com:
>
> I realize that X-UA-compatible is controversial, but non-conforming?

It seems like this actually *does* meet your standard of creating an  
interop problem, in that it is a vendor-specific feature which invokes  
a nonstandard rendering mode. However, it is being used on 3 of the  
top 10 sites.


Regards,
Maciej
Received on Sunday, 21 March 2010 13:54:02 UTC