W3C home > Mailing lists > Public > xml-editor@w3.org > July to September 2010

Errata: EncName definition for XML 1.0 and XML 1.1

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 28 Jul 2010 20:34:11 +0300
To: xml-editor@w3.org
Message-ID: <20100728203411761283.5e8d2cdf@xn--mlform-iua.no>
I'm sorry if this has been reported before.  However, both XML 1.0 and 
XML 1.1 have the following line: [1][2]

[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* 
/* Encoding name contains only Latin characters */

The expression "Latin characters" is confusing. It makes it possible to 
read the comment as saying that Latin letters only - thus no numbers, 
no punctuation and no non-Latin letters - are found in an encoding 
And it may also be misinterpreted to mean that even non-ASCII "Latin 
characters" are allowed.

I assume that the expression is an attempt to assure that no one 
interprets [A-Za-z] to mean "any uppercase or lowercase letter, 
irrespective of script or charset". I.e. readers are to understand 
[A-Za-z] as referring to ASCII Latin letters only. Probably, in the 
days when non-UNICODE encodings dominated text editing, it is was 
common for RegEx implementations to use [A-Za-z] as a reference to 
UPPERCASE/lowercase letters irrespective of the script in use. (At the 
very least, before Mac OS X arrived, the Macintosh text editor I used 
myself, worked like that.)

If I am obliged to suggest a replacement, then I'd say:

/* A letter in an encoding name is always a Latin ASCII letter. */

	or - probably better:

/* [A-Za-z] refers to ASCII Latin letters only. */

[1] http://www.w3.org/TR/2008/REC-xml-20081126/#NT-EncName 
[2] http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncName
leif halvard silli
Received on Thursday, 29 July 2010 12:51:51 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:12:52 UTC