[whatwg] Make quoted attributes a conformance criteria from Henri Sivonen on 2009-08-04 (public-whatwg-archive@w3.org from August 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 4 Aug 2009 14:59:00 +0300
Message-ID: <7869C41F-3975-4A58-91A9-8C1349A465FE@iki.fi>
On Aug 3, 2009, at 05:45, Ian Hickson wrote:

> On Thu, 23 Jul 2009, Keryx Web wrote:
>> No suggested text, but a rewrite will be necessary if quotation  
>> marks becomes
>> a conformance criterion.
>
> Instead of preventing anyone from not using quote marks, I would  
> instead
> recommend asking your validator vendor to offer you an option to  
> require
> quote marks and warn you when you have forgotten them.

There's a usability cost and a QA cost to adding optional features to  
a validator, which is why I try to resist requests to add more  
configuration and optional features to Validator.nu.

I've gotten requests to add checks inspired by XHTML. These requests  
generally aren't about true polyglot checking (since few people know  
how long the actual polyglot checking corner case list is). Also,  
these requests aren't about code style in general. (Well, actually,  
Anne asked for indent style checking on Sam's blog comments and  
another commenter thought Anne was making fun of the quote issue...)

Since the requests happen to be about the most prominent syntactic  
features of XML as opposed to being across the board about code style,  
I suspect part of the requests is about unease about letting go of  
extra requirements taught as part of XHTML-as-text/html evangelism.  
When adding optional warnings to Validator.nu, I'd like to tell apart  
actual problems from unease of letting go of XHTML-as-text/html before  
proceeding. (I expect "actual problems" to be with us always, but I  
expect the unease to pass with a little time.)

The top 4 requests are:
  * Flagging unquoted attributes.
  * Flagging implied tags.
  * Flagging non-lowercase element and attribute names.
  * Flagging inconsistent use of /> on void elements.

I think the implied end tags are different from the rest, and I think  
an option to flag implied tags would be a useful feature to have. I  
want to implement it, but I have some higher-priority Gecko items on  
my plate first.

Implied tags is different from the rest, because tag inference doesn't  
necessarily work like authors expect, so automatically generating the  
tags might not do the right thing.

OTOH, the other cases can be safely fixed automatically. (Except some  
quoting issues; more on that later.) In fact, indent style is also  
something that could be made consistent automatically by an HTML-aware  
text editor. When issues are code style issues in nature and don't  
need human intervention to change to a particular style, I think it's  
more useful to have an editor that simply reformats code than to have  
a validator that flags failure to comply to code style without  
performing the reformatting. Consider Eclipse JDT: If you have bad  
indents, you don't get warning or error markers in the margin.  
Instead, you can ask Eclipse to reformat code according to a wide  
variety of settings.

This creates a mild issue: If different people collaborate and have  
different code formatter settings, having another person's editor  
reformat code creates some source control issues. However, you don't  
get error messages, so you are still avoiding the problem that I  
raised earlier on public-html and the Maciej mentioned here about tool  
interop (so that you can swap tools without getting a huge bunch of  
errors).

Also note that we can't really eliminate this source control re- 
indenting issue on the spec level. There's no way we could get  
everyone to agree on One True HTML indent style. And as long as there  
isn't One True indent style consistently applied everywhere, I think  
it doesn't matter much if other syntax is used consistently.

Now, people are going to say that it's good to use /> consistently,  
because it helps you see which elements are void elements. It doesn't  
work that way, though. Because /> has no effect on HTML elements, you  
still need to *know* which elements are void elements, and pretending  
that /> means something poisons the mental model people have and is  
actually bad for teaching. (People write <div class="foo"/> or <script  
src="..."/> having heard that /> closes the element.) Unfortunately,  
we need to keep /> conforming to make it easy to upgrade XHTML-as-text/ 
html-emitting systems to HTML5.

I think it would be rather arbitrary to add a feature for checking the  
*consistent* use of <br> vs. <br/>. Why not <br/> vs. <br />?  
foo='bar' vs. foo="bar"? Or indent style?

As for lower-case names, I don't think people *really* want lower-case  
names. I think, as a matter of code style, they want *canonical-case*  
names, which aren't all lower-case for SVG-in-text/html and MathML-in- 
text/html (definitionURL). I think adding checking for this would have  
disproportionate ill impact on the parser code base compared to  
benefit. In a reasonable general-purpose HTML parser implementation,  
the case information is lost before it is decided if a tag belongs to  
an HTML, SVG or MathML element and maintaining a special-purpose  
parser for validation wouldn't be good. On the benefit side, I don't  
think accidentally holding down the shift key when typing a name is a  
notable practical authoring problem. Due to this disparity in benefit  
and code complexity badness, I'm not planning on implementing this  
check.

Back to the unquoted attributes request. I think it's the hardest one  
of the four to decide whether to implement or not. I think it is easy  
to decide unquoted attributes shouldn't be errors, and it's easy to  
decide that if the feature were available, it should be optional.  
(Making it mandatory would annoy people updating existing sites using  
quote omission and people who know just fine that they can omit quotes  
on stuff like type=radio and don't want to type extra.)

There's one case that clearly needs an unconditional warning, though:  
<foo bar=baz/> when the format of the value of bar doesn't exclude a  
trailing slash. In this case, the /> feature interacts badly with the  
quote omission feature.

I think a good way to proceed here is to write more complex code for  
detecting <foo bar=baz/> first and seeing if it together with more  
precise datatyping than in old DTD-based validators is enough to catch  
actual problems without introducing more UI options.

If after that change users of Validator.nu still face uncaught  
problems due to quote omission (e.g. class or alt eating up whatever  
follows and somehow managing not to generate any subsequent error), I  
think exploring a feature for optionally warning about unquoted  
attributes would make sense.

> This would address your use case, as far as I can tell, without  
> preventing
> anyone who _likes_ omitting quote marks from doing so.
[...]
> Omitting quotes would also make a large number of pages invalid for  
> more
> or less stylistic reasons, which would make it harder for people to
> transition to HTML5, and may annoy them ("Why do I have to add these
> quotes, they don't really add anything -- bah! I hate html5").

I think that quote omission should stay conforming for these reasons.

> (Tools, of course, can just quote everything. There's no reason  
> other than
> user preference for the authoring tool to not quote values, as far  
> as I
> can tell.)

I encourage anyone who is writing an HTML serializer to use double  
quotes for attributes unconditionally, unless there's a specific need  
to optimize file size to the point of counting bytes. (Single quotes  
are worse, because developers are tempted to escape ' as &apos; in  
attribute value, but &apos; has compat issues with IE versions still  
out there.)

> On Sat, 25 Jul 2009, Keryx Web wrote:
>>
>> Consider this PHP template:
>>
>> <input type=text value=$login name=login>
>>
>> Value is the suggested text, if no user data is available it says  
>> "login".
>> Otherwise its the users login name (no spaces allowed). All is well.
>>
>> One day a developer decides that "login name" is a better value,  
>> and hard
>> codes it into the PHP business logic, producing this HTML:
>>
>> <input type=text value=login name name=login>
>>
>> All of a sudden you *effectively* have produced this:
>>
>> <input type=text value=login name="">
>>
>> And it stops working.
>
> I agree that this is an issue, and I would strongly recommend that  
> people
> who write templates not make assumptions about the values they are
> inserting.

If you aren't manually typing both the attribute name and the value in  
a text editor, you should always use double quotes for generated  
values to avoid trouble.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 4 August 2009 04:59:00 UTC