Case-insensitive comparison (part of my detailed review of tokenization)

In the close Tag open state [1]:
If the content model flag is set to the RCDATA or CDATA states but no
start tag token has ever been emitted by this instance of the
tokeniser (fragment case), or, if the content model flag is set to the
RCDATA or CDATA states and the next few characters do not match the
tag name of the last start tag token emitted (case insensitively), or
if they do but they are not immediately followed by one of the
following characters […]

What does "case insensitive" really means here? (i.e. which algorithm
should be applied?)

My opinion is that characters from the range A-Z should be lowercased
(i.e. add 0x0020 to the character's code point, or in other terms,
only lowercase ASCII characters) prior to comparing: other
tokenization states only lowercase those characters, and the tag name
of the last start tag token emitted is already lowercased using this
algorithm (due to how the other states lowercase the tag name).

I'm not well aware of Unicode's case-insensitivity definition: maybe
non-ASCII characters could be compared equal to ASCII characters (if
we're in RCDATA or CDATA content model, it means the tag name of the
last start tag token emitted is part of a known set of names, which
all happen to be ASCII-only). I doubt Unicode defines such things, but
it would be clearer if it used the same "lowercasing algorithm" as
elsewhere in the tokenization section.

this also applies to the other mentions of "case-insensitive" in the
tokenization section (comparing the next few characters with DOCTYPE,
PUBLIC or SYSTEM).

[1] http://www.w3.org/html/wg/html5/#close1
For multipage version:
http://www.whatwg.org/specs/web-apps/current-work/multipage/section-tokenisation.html#close1

-- 
Thomas Broyer

Received on Wednesday, 22 August 2007 14:44:16 UTC