- From: Thomas Broyer <t.broyer@gmail.com>
- Date: Wed, 22 Aug 2007 16:44:08 +0200
- To: public-html@w3.org
In the close Tag open state [1]: If the content model flag is set to the RCDATA or CDATA states but no start tag token has ever been emitted by this instance of the tokeniser (fragment case), or, if the content model flag is set to the RCDATA or CDATA states and the next few characters do not match the tag name of the last start tag token emitted (case insensitively), or if they do but they are not immediately followed by one of the following characters […] What does "case insensitive" really means here? (i.e. which algorithm should be applied?) My opinion is that characters from the range A-Z should be lowercased (i.e. add 0x0020 to the character's code point, or in other terms, only lowercase ASCII characters) prior to comparing: other tokenization states only lowercase those characters, and the tag name of the last start tag token emitted is already lowercased using this algorithm (due to how the other states lowercase the tag name). I'm not well aware of Unicode's case-insensitivity definition: maybe non-ASCII characters could be compared equal to ASCII characters (if we're in RCDATA or CDATA content model, it means the tag name of the last start tag token emitted is part of a known set of names, which all happen to be ASCII-only). I doubt Unicode defines such things, but it would be clearer if it used the same "lowercasing algorithm" as elsewhere in the tokenization section. this also applies to the other mentions of "case-insensitive" in the tokenization section (comparing the next few characters with DOCTYPE, PUBLIC or SYSTEM). [1] http://www.w3.org/html/wg/html5/#close1 For multipage version: http://www.whatwg.org/specs/web-apps/current-work/multipage/section-tokenisation.html#close1 -- Thomas Broyer
Received on Wednesday, 22 August 2007 14:44:16 UTC