HTML5: more comments

1. Section 2.1.6. This section talks about character encodings. It displays HTML5's distressing tendency to define absolutely everything in its own terms :-).

It would be useful if this section referred to charmed, since underlying terminology such as "character encoding", "character", "code point", etc. are defined there.

2. Section 2.2.2. Contains this note:

--
This specification might have certain additional requirements on character encodings, image formats, audio formats, and video formats in the respective sections.
--

This note is not clear. I'm not sure what an "additional requirement on character encodings" would be? Does it mean that certain character encodings will be required/banned?

3. Section 2.3. Introduces this term:

--
Comparing two strings in a compatibility caseless manner means using the Unicode compatibility caseless match operation to compare the two strings.
--

a. The specific reference to compatibility caseless matching should be provided (D146 in chapter 3).
b. I am unsure that compatibility caseless matching is desirable. It is only used twice in the whole document that I can find:

2.5.9 (hashname reference value matching)
4.10.7.1.17 (radio button name attribute matching)

In both cases, name attributes defined in the document are being matched. I think that compatibility decomposition in the matching operation would be a surprise to users, who might expect, for example, these to be separate values: ①⑴⒈. More to the point, the Korean Hangul script has a complex relationship with compatibility decomposition.

I would suggest replacing compatibility caseless matching with either canonical caseless matching. 

c. It seems to me that this is a stab at making 'name' attributes into 'identifiers', in which case compatibility decomposition is to be desired, with identifier caseless matching (which does use compatibility normalization but is slightly simpler). But then I note that this flies in the face of CSS3 Selectors matching that we previously discussed.

4. Section 2.6.3. Deals with processing URLs. The steps here result in defining the "character encoding" of the URL, which is applies to the query portion of the URL. I put character encoding in quotes, because what it really is the character encoding of the document or script containing the URL as a string. Step 8.2 contains an implicit encoding conversion (to the document character encoding). A health warning should be supplied about what to do when the character cannot be encoded into the target encoding.

5. Section 2.6.3. Step 8.1 replaces characters that cannot be encoded into the target encoding with the question mark character (0x3F). Should this be, instead, the replacement character for the target encoding? For example, UTF-8 would use U+FFFD. Some encodings use _. Inserting an ASCII question mark into a stateful encoding, such as ISO-2022, will require shift sequences be inserted also.

6. Section 2.7.4. This section should probably be titles "Extracting character encodings from meta elements" (and ditto in the text). The word "encoding" by itself could mean other things, such as transfer encoding.

7. Section 2.7.4, step 3, step 5. These characters are the "space characters" defined elsewhere. Why not use that term?

8. Section 2.7.4. Shouldn't the terminators include apostrophe and double quote? That is, consider this meta tag:

  <meta http-equiv="Content-Type" value="text/html;charset=UTF-8" />

There is no opening double quote, but there is a closing one. So step 6 would include U+0022 and U+0027 in the list of characters to find at the end.

== stopped at 2.8.3 ==




Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

Received on Sunday, 17 July 2011 17:20:53 UTC