Re: html5 nitpicks

On Wed, 14 May 2008, Jim Jewett wrote:
> On 5/12/08, Ian Hickson <ian@hixie.ch> wrote:
> > On Mon, 30 Jul 2007, Jim Jewett wrote:
> >> 3.2.1. Common parser idioms
> >> (and again in 3.2.6. Tokens)
> 
> [I now find the list of integers in 3.2.3.6]
> 
> >> As I read this, "here 'is the' string" tokenizes to the
> 
> [the 4-member set {here, string, the', 'is} ]
> 
> >> ... and the single-quote marks around 'is the' do not function to 
> >> group.  This should be called out explicitly.
> 
> > Why would one use quote marks in the token attributes?
> 
> Normally you wouldn't -- but the main reason to support both " and '
> is that sometimes people do want one of them within the string.  I
> wouldn't recommend using either
> 
>     don't
> or
>     do not
> 
> as tokens, but they are valid in some languages, at least with special 
> quoting.  If they are invalid in HTML, that should be called out 
> explicitly.  Right now, the first is valid, but the second is not.

I'm not really sure how to call this out. This doesn't seem to be about 
the apostrophe _not_ being used as a token separator, it seems more about 
the space being one. But then that's the whole point of the token 
definition. I'm not sure how to point out that two words are treated as 
two words without sounding overly condescending.


> I think it would be reasonable to limit tokens to (a subset of) unicode 
> identifier characters (basically, letters, numbers, and underscore, but 
> not starting with a number http://unicode.org/reports/tr31/).
> 
> But the algorithm doesn't do that.
> 
> Since it doesn't, I think this should be called out, particularly for 
> quotation marks and commas, because they often do have other meanings 
> when parsing a string.

All the terms are called some variation of "space-separated tokens". I 
don't really know how to call more attention to the fact that tokens are 
space separated (and only space separated) than by repeating it every time 
we talk about them.

Is there any reason to believe that authors are actually being confused by 
this and are trying to use other characters to delimit tokens?


> >> 3.2.3.6. Lists of integers
> >> """
> >> A valid list of integers is a number of valid integers
> >> separated by U+002C COMMA characters, with no
> >> other characters (e.g. no space characters).
> >> """
> >> but the algorithm allows spaces.
> 
> >> I personally think spaces should be allowed, but if they aren't, then 
> >> the parsing should be explicit that this allowance is for error 
> >> recovery.
> 
> > The algorithm does a whole lot of stuff for error recovery. I'm not 
> > sure it would be possible to cover each case accurately.
> 
> Spaces (and sometimes tabs) are a special category that people will 
> assume are valid if they aren't told otherwise.  (And as I said before, 
> they probably should be valid.)

I agree. The definition actually calls this out explicitly.


> For many people, the most natural way to write "a list of ..." is to 
> separate the items with a comma *and* a space.
> 
> For some people, comma alone does not separate numbers, because it is 
> used for grouping.  (In fairness, some people now use space for the same 
> purpose.)
> 
>     "1,234,567"
>     "1, 234, 567"
>     "1 234 567"
> 
> All of the above *could* represent a single number much larger than one 
> thousand.

I don't think people will be writing coordinates for image maps that are 
so high that they will feel the need to comma-group (or space-group) 
the digits. If they do, we probably have bigger problems.


> Without context, the last two could also represent a list of 3 numbers.
> 
> But writing a list the first way -- which is the only currently valid
> way -- would normally be considered a typo.  If it has to be done that
> way for backwards compatibility, then so be it -- but at least make it
> obvious with an example.

Adding examples for image maps is a good idea. I've added one.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 22 August 2008 00:48:40 UTC