Re: html5 nitpicks

On 5/12/08, Ian Hickson <ian@hixie.ch> wrote:
> On Mon, 30 Jul 2007, Jim Jewett wrote:
>> 3.2.1. Common parser idioms
>> (and again in 3.2.6. Tokens)

[I now find the list of integers in 3.2.3.6]

>> As I read this, "here 'is the' string" tokenizes to the

[the 4-member set {here, string, the', 'is} ]

>> ... and the single-quote
>> marks around 'is the' do not function to group.  This
>> should be called out explicitly.

> Why would one use quote marks in the token attributes?

Normally you wouldn't -- but the main reason to support both " and '
is that sometimes people do want one of them within the string.  I
wouldn't recommend using either

    don't
or
    do not

as tokens, but they are valid in some languages, at least with special
quoting.  If they are invalid in HTML, that should be called out
explicitly.  Right now, the first is valid, but the second is not.

I think it would be reasonable to limit tokens to (a subset of)
unicode identifier characters (basically, letters, numbers, and
underscore, but not starting with a number
http://unicode.org/reports/tr31/).

But the algorithm doesn't do that.

Since it doesn't, I think this should be called out, particularly for
quotation marks and commas, because they often do have other meanings
when parsing a string.

>> 3.2.3.6. Lists of integers
>> """
>> A valid list of integers is a number of valid integers
>> separated by U+002C COMMA characters, with no
>> other characters (e.g. no space characters).
>> """
>> but the algorithm allows spaces.

>> I personally think spaces should be allowed, but
>> if they aren't, then the parsing should be explicit
>> that this allowance is for error recovery.

> The algorithm does a whole lot of stuff for error
> recovery. I'm not sure it would be possible to
> cover each case accurately.

Spaces (and sometimes tabs) are a special category that people will
assume are valid if they aren't told otherwise.  (And as I said
before, they probably should be valid.)

For many people, the most natural way to write "a list of ..." is to
separate the items with a comma *and* a space.

For some people, comma alone does not separate numbers, because it is
used for grouping.  (In fairness, some people now use space for the
same purpose.)

    "1,234,567"
    "1, 234, 567"
    "1 234 567"

All of the above *could* represent a single number much larger than
one thousand.

Without context, the last two could also represent a list of 3 numbers.

But writing a list the first way -- which is the only currently valid
way -- would normally be considered a typo.  If it has to be done that
way for backwards compatibility, then so be it -- but at least make it
obvious with an example.

-jJ

Received on Wednesday, 14 May 2008 19:57:30 UTC