Change Proposal for ISSUE-150, was: ISSUE-150 code-point-verbosity: Chairs Solicit Proposals

SUMMARY

The spec currently is very pedantic when talking about Unicode code 
points. Frequently, this affects readability for everybody except maybe 
somebody in the process doing an implementation.

Example:

"a valid non-negative integer, followed by a U+003B SEMICOLON character 
(;), followed by one or more space characters, followed by a substring 
that is an ASCII case-insensitive match for the string "URL", followed 
by a U+003D EQUALS SIGN character (=), followed by a valid URL." (taken 
from [1]).

RATIONALE

We can reduce verbosity without losing precision; repeating the full 
Unicode code point name each and every time is not useful, in particular 
when talking about common things like LF, CR, SPACE or 0-9, A-Z and a-z.

DETAILS

Rough proposal:

1) Introduce more named character classes such as for "ASCII digits", 
"ASCII lowercase letters", and "ASCII uppercase letters" (the spec 
already does this for white space).

2) Collapse the long format

   U+nnnn UNICODE CODE POINT NAME (c)

to

   "c" (U+nnnn)

...for those characters that appear frequently.

(The bug [1] contains proposals from Aryeh and Anne, and I'm happy with 
other notations if they get broader support).

Detailed proposal for 1):

In 2.5.1, replace

"The alphanumeric ASCII characters are those in the ranges U+0030 DIGIT 
ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to 
U+005A LATIN CAPITAL LETTER Z, U+0061 LATIN SMALL LETTER A to U+007A 
LATIN SMALL LETTER Z."

by

"The uppercase ASCII letters are those in the range U+0041 LATIN CAPITAL 
LETTER A to U+005A LATIN CAPITAL LETTER Z.

The lowercase ASCII letters are those in the range U+0061 LATIN SMALL 
LETTER A to U+007A LATIN SMALL LETTER Z.

The ASCII digits are those in the range U+0030 DIGIT ZERO (0) to U+0039 
DIGIT NINE (9).

The alphanumeric ASCII characters are those in the ranges uppercase 
ASCII letters, lowercase ASCII letters, or ASCII digits."

Use these three classes throughout when previously the exact ranges were 
cited. (I can provide a diff for this change if this is considered useful).

(We may want to add additional classes for all ASCII letters, and maybe 
hexdigit)

Detailed proposal for 2):

The proposed shorthand notation applied to the example above yields:

"a valid non-negative integer, followed by a ";" (U+003B), followed by 
one or more space characters, followed by a substring that is an ASCII 
case-insensitive match for the string "URL", followed by a "=" (U+003D), 
followed by a valid URL."

Note that there are spec sections where keeping the long notation may be 
the right thing; for instance in the parser definition. The aim of this 
proposal is to reduce the amount of redundant information in prose.


IMPACT

1. Positive Effects

Increases readability by not repeating character names all over again.

2. Negative Effects

Making these changes will be quite some work, and it may not be possible 
to automate it fully.

3. Conformance Classes Changes

None.

4. Risks

None.

REFERENCES

[1] <http://www.w3.org/Bugs/Public/show_bug.cgi?id=11124>

Received on Saturday, 5 February 2011 20:33:58 UTC