Attribute syntax feedback from Ian Hickson on 2008-12-29 (public-html@w3.org from December 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Mon, 29 Dec 2008 13:32:34 +0000 (UTC)
To: 'HTML WG' <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0812291043570.24109@hixie.dreamhostps.com>
On Sat, 12 Jul 2008, Frank Ellermann wrote:
> Julian wrote:
> > 
> > Can we please stick to the contents of the HTML5 spec?
> 
> ...stick to STD 66, don't invent new URLs.  Don't do whatever some 
> browsers do if existing standards are better for the job at hand.

The existing standards aren't better -- they don't define how to handle 
errors, which we need to define because errors in URIs are amongst the 
most common errors on the Web. For example, over 2% of pages seem to 
include whitespace in the query component of a URL in the <a href=""> 
attribute, and the URI specs don't define how to handle that. [1]

[1] http://lists.w3.org/Archives/Public/public-html/2008Aug/0923.html


On Sat, 12 Jul 2008, Frank Ellermann wrote:
> 
> The OP asked for standard terminology, mentioning "protocol" as an 
> example.  And I tried to figure out what he had in mind after you had 
> explained that "protocol" is a traditional name that can't be changed.  
> If your <hostport> would be different from the RFC 2396 <hostport> this 
> would be confusing.  I fear it is, you are talking about <ihost> ":" 
> <port>, aren't you ?  Ditto <ihost> vs. <host>.

In HTML5 the parsing of URLs (including strings that aren't valid URLs) is 
defined purely in terms of the URI spec, not the IRI spec, so as to 
side-step the issue of character encodings, since we need to preserve the 
encodings to get compatibility with legacy content. So in fact, the use of 
<host> rather than <ihost> is intentional.


> Ian Hickson wrote:
> > It's not clear to me how you tested this.
> 
> By clicking on the quoted URLs, http://example.com:0x50/ reports an 
> error (displaying http://example.com:50/), and the same test with 0x80 
> reaches http://example.com (at port 80).  That matches what you specify 
> in 2.3.5, "ignore all non-digits in <port>".
> 
> Similar I tested http://example.com:/, and arrived at example.com (port 
> 80).  Your draft says port would be set to 0, apparently.  Is the 0 only 
> a trick to indicate an erroneous port for the purposes of chapter 2.3.5?
> 
> | Remove any characters in the new value that are not in the range
> | U+0030 DIGIT ZERO .. U+0039 DIGIT NINE. If the resulting string
> | is empty, set it to a single U+0030 DIGIT ZERO character ('0').

I don't understand what you mean when you say that you "clicked the quoted 
URLs". The section in question is defining a DOM API, not the parsing of 
attributes.


> > I recommend testing with a modern browser
> 
> I like old browsers, they enjoy security by obscurity, I know their bugs 
> (after some years), they are smaller and faster than "popular" monsters 
> wanting met to get a modern OS with modern hardware for the purpose of 
> watching modern ads.  I figured out how stuff works, changing browsers 
> is almost as bad as changing text editors.

I'm not suggesting using a modern browser for your browsing needs, but 
from a purely pragmatic perspective, only browsers with significant market 
share actually matter when reverse engineering behavior for HTML5. This is 
because that's what the browser developers want to be compatible with, and 
thus that is what HTML5 must be compatible with if it is to be relevant.


On Thu, 21 Aug 2008, Phillips, Addison wrote:
> 
> During a recent discussion of this thread, our working group noted that 
> Section 2.5.1 defines the term "valid URL" using four bullet points. The 
> third bullet point says:
> 
> --
> The URL is a valid IRI reference and its query component contains no 
> unescaped non-ASCII characters.
> --
> 
> This definition isn't quite complete. "Non-ASCII characters" can be 
> escaped in lots of ways using a wide variety of character encodings. You 
> should mention the use of UTF-8 to escape non-ASCII characters or 
> (better?) reference section 3.1 of IRI (3987).

As far as I can tell, IRIs are valid even if they contain escaped 
sequences that are not valid UTF-8. Is this wrong?


On Fri, 22 Aug 2008, Phillips, Addison wrote:
>
> I have to admit that I read Section 2.5.1 of [1] as trying to define a 
> 'valid URL' to mean (essentially) an IRI in which the full string uses 
> the UTF-8 encoding. Although I'm very much aware of the 'query part 
> encoding issue', I certainly didn't read the section that way. In fact, 
> I was kind of pleased to think that 'valid URL' meant "IRI from 
> end-to-end". Presumably "invalid URLs" could use non-UTF-8 encoded query 
> parts, which *are* handled by the processing instructions.

No, the spec is just trying to define "valid URL" as a shortcut for "valid 
URI or IRI that will be interpreted as specified by the URI and IRI 
specs", something which is non-trivial to define due to the legacy 
encoding issues when parsing in non-UTF-8 documents.


> If, as you suggest, the third bullet point instead means to allow the 
> query part to use any character encoding (e.g. the document encoding), 
> then the section is problematic.

The bulleted list is not trying to do anything but reflect the exact cases 
where a string will be processed as defined by the URI and IRI specs while 
still being valid according to those specs. It's not trying to define 
anything new (unlike the "parsing" and "resolving" sections, which define 
the details that are missing from the URI and IRI specs).


> 1. The phrase "non-ASCII characters" isn't correct. What is meant is 
> that there are no unescaped non-'unreserved' production characters. (See 
> RFC 3986 and yes I mean to cite the URI production here). ASCII covers 
> more characters than are permitted in the query portion of *any* URI. 
> For example, the character '#' is both an ASCII character and one that 
> terminates a query part.

This is covered by virtue of the "valid URL" definition also requiring 
that the string be a valid URI or IRI per the URI or IRI specs.


> 2. The bullet point probably doesn't mean characters anyway. If we 
> intend to allow non-UTF-8 encodings into a "valid URL's" query part, 
> then we should be clear and say 'byte values'. Those byte values may 
> represent characters using some encoding (the document's encoding). But 
> both URI and IRI permit the transmission of non-character data.

Characters are what is meant here, because this is post-decoding.


On Mon, 20 Oct 2008 noah_mendelsohn@us.ibm.com wrote:
>
> The first concern we discussed is that the semantics of microsyntaxes 
> like signed integer [1] are a) unduly burried in the imperative parsing 
> rules and b) thus at some risk of not making it into any authoring 
> specification.

I've tried to fix this. Please do let me know if I did not do so to your 
satisfaction (notwithstanding that the definitions aren't yet split out 
into an entirely distinct document).


> So, to reiterate the concern, now that the details have been set out:
> 
> a) There are probably clearer and simpler ways of conveying the intended 
> semantic than burying them in the parsing rules.  Alternatives range 
> from informal "these strings have the obvious interpretation as 
> integers, high order digits on the left, etc., with '-' indicating 
> negative numbers" to more rigorous or even formal mappings using the 
> appropriate polynomial. I'm not here recommending which of the many 
> options should be chosen, just suggesting that burying the semantics in 
> the parsing rules is suboptimal.

I went with the simple phrase "interpreted as a number in base ten"; I can 
be more detailed if you like, let me know.


On Thu, 4 Dec 2008, Simon Pieters wrote:
> 
> The HTML5 microsyntax parsing rules for numbers basically are to first 
> skip whitespace and the ignore trailing garbage. However, this doesn't 
> match IE and Mozilla for all attributes. We have a bug about a page 
> using the following markup:
> 
>    <td><input type="text" maxLength=3D59 size=3D30=20 name="email"></td>
> 
> ...and the page doesn't work in Opera (can only enter 3 characters) but 
> works in IE and Mozilla.
> 
> IE checks if the next character is [a-zA-Z] or [\192-\214] or 
> [\216-\246] or [\248-\501] or [\506-\535] or [\592-\680] or \902 or 
> [\905-\906] or \908 or [\910-\929] .... etc, and if so, drop the 
> attribute. It does this for maxlength, hspace, vspace, cols, rows, span, 
> colspan, rowspan, scrollamount, scrolldelay, start, value.
> 
> Mozilla checks if the next character is [a-fA-F] and if so, act as if 
> the attribute was absent. It does this for maxlength, hspace, vspace, 
> border, cols, rows, size, span, colspan, rowspan, cellpadding, 
> cellspacing, topmargin, leftmargin, marginwidth, marginheight, 
> scrollamount, scrolldelay, start, value.
> 
> For other attributes (e.g. width and height) IE and Mozilla match HTML5.

On Thu, 4 Dec 2008, Boris Zbarsky wrote:
> 
> Odd.  We do the same thing for width and height that we do for 
> everything else, as far as I can tell...
> 
> So <input width="500f"> will look just like <input> in Mozilla 
> (certainly does over here).
> 
> I do agree that treating a-fA-F garbage as special is a bit weird; it's 
> an artifact of using a general-purpose string-to-integer function which 
> treats this case as a hex number where a decimal one was expected and 
> returns an "unable to parse string" error.

On Thu, 4 Dec 2008, Simon Pieters wrote:
> 
> So does <input width=500> since width isn't an attribute for input. But 
> <img width=1a> and <img> (in quirks mode) render the same (and different 
> from <img width=1>).

On Thu, 4 Dec 2008, Boris Zbarsky wrote:
> 
> Er, good point.  I just looked at the attribute parsing, not the usage.  
> Need an <input type="image"> to see the width attribute there in action, 
> of course.
> 
> But yes, "1a" is ignored any time we're parsing an integer, while "1g" 
> is parsed as "1".  I don't think we want to spec this behavior.  ;)

I haven't changed the spec, because Firefox's behavior is only for 
characters in the ranges a-f A-F and only in quirks, and Safari doesn't do 
this at all, quirks or not.

I am _very_ dubious about doing different things for different attributes 
here.


On Thu, 4 Dec 2008, Jonas Sicking wrote:
> 
> Holy crap, we need to just purge that function from our codebase and 
> write a real one. I've had way too many fights with it at this point :(

I would be very interested in hearing the results of you doing this. If 
the problem Opera ran into is an isolated incident, that would be great. 
If it turns out we need some weird parsing scheme here, I should spec it.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 29 December 2008 13:33:12 UTC