W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2008

[whatwg] Parsing, syntax, and content model feedback

From: Edward Z. Yang <edwardzyang@thewritingpot.com>
Date: Thu, 25 Dec 2008 08:37:31 -0500
Message-ID: <49538C9B.20303@thewritingpot.com>
Ian Hickson wrote:
> On Mon, 22 Dec 2008, Edward Z. Yang wrote:
>> "in the range 0x0000 to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 
>> 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x0000 to 
>> 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 
>> 0xFDD0 to 0xFDDF"
>>
>> U+000B is not a range.
> 
> While this is technically true, I don't really see a better way to phrase 
> this that isn't verbose (e.g. "ranges and codepoints" or some such).
> 
> If it helps, consider the whole set of subranges and code points to be a 
> single discontinuous range, hence the use of the singular "range". :-)

The spec made me double-take when I read it (since it fairly clearly
separates range from codepoints). Also, I messed up the copypaste while
quoting, so the text I cited is not actually what's there, it's:

> in the ranges U+0001 to U+0008,  U+000B,  U+000E to U+001F,  U+007F  to U+009F, U+D800 to U+DFFF, U+FDD0 to U+FDDF, and characters U+FFFE...

It seems fairly clear to me that U+000B should moved to the list of
characters (at the cost of the nice ordering) or we should collapse
ranges/characters into one "range".

> On Tue, 23 Dec 2008, Edward Z. Yang wrote:
> You're still checking the next input character at that point, so "P" is 
> still the "next input character", so the next six are "PUBLIC".
> 
> At least, that's how I'm defending what the spec says. :-)

The spec is pretty unambiguous about this:

> The next input character is the first character in the input stream that has not yet been consumed. Initially, the next input character is the first character in the input.

and, at the beginning of the section:

> Consume the next input character:

So, the spec is wrong.

> In practice I think having the text be clear ("PUBLIC") is less confusing 
> than having it be pedantic ("P" and "UBLIC" or "this and the next five" or 
> some such). It's not like people are going to assume the spec is allowing 
> "XPUBLIC" or "*PUBLIC" and so forth, right?

I understand this consideration, and there's several ways we could go
about doing this. I think the easiest would be to un-consume a
character, and then perform the checks, and then reconsume the character.

As for people making this mistake... well, you're looking at one. :-)

Cheers,
Edward

(accidentally emailed only Ian; re-sending to WHATWG list)
Received on Thursday, 25 December 2008 05:37:31 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:08:46 UTC