- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Sat, 05 Feb 2011 21:33:16 +0100
- To: Maciej Stachowiak <mjs@apple.com>
- CC: "public-html@w3.org LIST" <public-html@w3.org>
SUMMARY The spec currently is very pedantic when talking about Unicode code points. Frequently, this affects readability for everybody except maybe somebody in the process doing an implementation. Example: "a valid non-negative integer, followed by a U+003B SEMICOLON character (;), followed by one or more space characters, followed by a substring that is an ASCII case-insensitive match for the string "URL", followed by a U+003D EQUALS SIGN character (=), followed by a valid URL." (taken from [1]). RATIONALE We can reduce verbosity without losing precision; repeating the full Unicode code point name each and every time is not useful, in particular when talking about common things like LF, CR, SPACE or 0-9, A-Z and a-z. DETAILS Rough proposal: 1) Introduce more named character classes such as for "ASCII digits", "ASCII lowercase letters", and "ASCII uppercase letters" (the spec already does this for white space). 2) Collapse the long format U+nnnn UNICODE CODE POINT NAME (c) to "c" (U+nnnn) ...for those characters that appear frequently. (The bug [1] contains proposals from Aryeh and Anne, and I'm happy with other notations if they get broader support). Detailed proposal for 1): In 2.5.1, replace "The alphanumeric ASCII characters are those in the ranges U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z." by "The uppercase ASCII letters are those in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z. The lowercase ASCII letters are those in the range U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z. The ASCII digits are those in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9). The alphanumeric ASCII characters are those in the ranges uppercase ASCII letters, lowercase ASCII letters, or ASCII digits." Use these three classes throughout when previously the exact ranges were cited. (I can provide a diff for this change if this is considered useful). (We may want to add additional classes for all ASCII letters, and maybe hexdigit) Detailed proposal for 2): The proposed shorthand notation applied to the example above yields: "a valid non-negative integer, followed by a ";" (U+003B), followed by one or more space characters, followed by a substring that is an ASCII case-insensitive match for the string "URL", followed by a "=" (U+003D), followed by a valid URL." Note that there are spec sections where keeping the long notation may be the right thing; for instance in the parser definition. The aim of this proposal is to reduce the amount of redundant information in prose. IMPACT 1. Positive Effects Increases readability by not repeating character names all over again. 2. Negative Effects Making these changes will be quite some work, and it may not be possible to automate it fully. 3. Conformance Classes Changes None. 4. Risks None. REFERENCES [1] <http://www.w3.org/Bugs/Public/show_bug.cgi?id=11124>
Received on Saturday, 5 February 2011 20:33:58 UTC