- From: Jim Jewett <jimjjewett@gmail.com>
- Date: Wed, 14 May 2008 15:56:55 -0400
- To: "Ian Hickson" <ian@hixie.ch>
- Cc: public-html@w3.org
On 5/12/08, Ian Hickson <ian@hixie.ch> wrote: > On Mon, 30 Jul 2007, Jim Jewett wrote: >> 3.2.1. Common parser idioms >> (and again in 3.2.6. Tokens) [I now find the list of integers in 3.2.3.6] >> As I read this, "here 'is the' string" tokenizes to the [the 4-member set {here, string, the', 'is} ] >> ... and the single-quote >> marks around 'is the' do not function to group. This >> should be called out explicitly. > Why would one use quote marks in the token attributes? Normally you wouldn't -- but the main reason to support both " and ' is that sometimes people do want one of them within the string. I wouldn't recommend using either don't or do not as tokens, but they are valid in some languages, at least with special quoting. If they are invalid in HTML, that should be called out explicitly. Right now, the first is valid, but the second is not. I think it would be reasonable to limit tokens to (a subset of) unicode identifier characters (basically, letters, numbers, and underscore, but not starting with a number http://unicode.org/reports/tr31/). But the algorithm doesn't do that. Since it doesn't, I think this should be called out, particularly for quotation marks and commas, because they often do have other meanings when parsing a string. >> 3.2.3.6. Lists of integers >> """ >> A valid list of integers is a number of valid integers >> separated by U+002C COMMA characters, with no >> other characters (e.g. no space characters). >> """ >> but the algorithm allows spaces. >> I personally think spaces should be allowed, but >> if they aren't, then the parsing should be explicit >> that this allowance is for error recovery. > The algorithm does a whole lot of stuff for error > recovery. I'm not sure it would be possible to > cover each case accurately. Spaces (and sometimes tabs) are a special category that people will assume are valid if they aren't told otherwise. (And as I said before, they probably should be valid.) For many people, the most natural way to write "a list of ..." is to separate the items with a comma *and* a space. For some people, comma alone does not separate numbers, because it is used for grouping. (In fairness, some people now use space for the same purpose.) "1,234,567" "1, 234, 567" "1 234 567" All of the above *could* represent a single number much larger than one thousand. Without context, the last two could also represent a list of 3 numbers. But writing a list the first way -- which is the only currently valid way -- would normally be considered a typo. If it has to be done that way for backwards compatibility, then so be it -- but at least make it obvious with an example. -jJ
Received on Wednesday, 14 May 2008 19:57:30 UTC