W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > January 2006

IRI language in a few specs

From: François Yergeau <francois@yergeau.com>
Date: Wed, 11 Jan 2006 07:49:20 -0800
To: public-xml-core-wg@w3.org
Message-id: <43C52900.5010801@yergeau.com>

I have an action to craft uniform language for IRIs in XLink 1.1, XLink 
1.0, xml:base, xinclude, XML 1.0, and XML 1.1 (as errata for all but 
XLink 1.1).  This is my attempt (still missing XLink 1.0, and not 
perfectly uniform):


XLink 1.1
===========================================================
section 5.4 [http://www.w3.org/TR/2005/WD-xlink11-20050707/#link-locators]

Change from:
-------------------------
The value of the href attribute must be an IRI reference as defined in 
[IETF RFC 3987] or must result in an IRI reference after the escaping 
procedure described below is applied. (By design, all URIs (Uniform 
Resource Identifiers) as defined in [IETF RFC 3986] are also IRIs.)

XLink 1.0 described a procedure for escaping characters found in the 
href attribute value that were not allowed in URIs. For XLink 1.1, those 
details are normatively described in Section 3.1 of [IETF RFC 3987]. 
However, for backwards compatibility, XLink 1.1 processors must escape 
one additional character, the space. All occurrences of a space in the 
value of an href attribute must be replaced by %20.
-------------------------
to:
-------------------------
The value of the href attribute must be an IRI reference as defined in 
[IETF RFC 3987] or must result in an IRI reference after the escaping 
procedure described below is applied. (By design, all URIs (Uniform 
Resource Identifiers) as defined in [IETF RFC 3986] are also IRIs.)

To convert the value of the href attribute to an IRI reference, the 
following characters must be escaped:

     * the control characters #x0 to #x1F and #x7F (most of which cannot 
appear in XML)
     * space #x20

       Note: Authors are advised to avoid unescaped spaces, as XML 
Schema has identified them as an interoperability risk.

     * the delimiters < #x3C, > #x3E and " #x22
     * the unwise characters { #x7B, } #x7D, | #x7C, \ #x5C, ^ #x5E and 
` #x60

These characters are escaped by applying to them steps 2.1 to 2.3 of 
Section 3.1 of [IETF RFC 3987].

If necessary for the implementation, an IRI reference is converted to a 
URI reference according to the prescriptions of Section 3.1 of [IETF RFC 
3987].  The two conversions (href value to IRI reference, IRI reference 
to URI reference) may be merged.
-------------------------
[first para unchanged, rest adapted from XInclude)


XML Base 1.0
===========================================================
section 3.0 [http://www.w3.org/TR/2001/REC-xmlbase-20010627/#syntax]

Change from:
-------------------------
The value of this attribute is interpreted as a URI Reference as defined 
in RFC 2396 [IETF RFC 2396], after processing according to Section 3.1.
-------------------------
to:
-------------------------
The value of this attribute is interpreted as an IRI Reference as 
defined in RFC 3987 [IETF RFC 3987], after processing according to 
Section 3.1.
-------------------------


section 3.1 [http://www.w3.org/TR/2001/REC-xmlbase-20010627/#escaping]

Change from:
-------------------------
The set of characters allowed in xml:base attributes is the same as for 
XML, namely [Unicode]. However, some Unicode characters are disallowed 
from URI references, and thus processors must encode and escape these 
characters to obtain a valid URI reference from the attribute value.

The disallowed characters include all non-ASCII characters, plus the 
excluded characters listed in Section 2.4 of [IETF RFC 2396], except for 
the number sign (#) and percent sign (%) characters and the square 
bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters 
must be escaped as follows:

    1. Each disallowed character is converted to UTF-8 [IETF RFC 2279] 
as one or more bytes.

    2. Any bytes corresponding to a disallowed character are escaped 
with the URI escaping mechanism (that is, converted to %HH, where HH is 
the hexadecimal notation of the byte value).

    3. The original character is replaced by the resulting character 
sequence.
-------------------------
to:
-------------------------
To convert the value of the xml:base attribute to an IRI reference, the 
following characters must be escaped:

     * the control characters #x0 to #x1F and #x7F (most of which cannot 
appear in XML)
     * space #x20

       Note:

       Authors are advised to avoid unescaped spaces, as XML Schema has 
identified them as an interoperability risk.

     * the delimiters < #x3C, > #x3E and " #x22
     * the unwise characters { #x7B, } #x7D, | #x7C, \ #x5C, ^ #x5E and 
` #x60

These characters are escaped by applying to them steps 2.1 to 2.3 of 
Section 3.1 of [IETF RFC 3987].

If necessary for the implementation, an IRI reference is converted to a 
URI reference according to the prescriptions of Section 3.1 of [IETF RFC 
3987].  The two conversions (xml:base  value to IRI reference, IRI 
reference to URI reference) may be merged.
-------------------------


XInclude 1.0
section 4.1.1 [http://www.w3.org/TR/2004/REC-xinclude-20041220/#IRIs]
===========================================================

Change from:
-------------------------
The href attribute value is converted to either a URI reference or an 
IRI reference, as appropriate to the implementation.

Work is currently in progress to produce an RFC defining 
Internationalized Resource Identifiers (IRIs). Since this work is not 
yet complete, in this section we define IRI references syntactically. We 
expect to issue an erratum replacing portions of this section with a 
reference to the RFC when it is published. For a more general definition 
and discussion of IRIs see [IRI draft] (work in progress).

[Definition: An IRI reference is a string that can be converted to a URI 
reference by escaping the following additional characters:]

     * the Unicode plane 0 characters #xA0 - #xD7FF, #xF900-#xFDCF, 
#xFDF0-#xFFEF
     * the Unicode plane 1-14 characters #x10000-#x1FFFD ... #xE0000-#xEFFFD

To convert the value of the href attribute to an IRI reference, the 
following characters must be escaped:

     * space #x20

       Note:

       Authors are advised to avoid unescaped spaces, as XML Schema has 
identified them as an interoperability risk.

     * the delimiters < #x3C, > #x3E and " #x22
     * the unwise characters { #x7B, } #x7D, | #x7C, \ #x5C, ^ #x5E and 
` #x60

These characters are escaped as follows:


    1. Each additional character is converted to UTF-8 [Unicode] as one 
or more bytes.

    2. The resulting bytes are escaped with the URI escaping mechanism 
(that is, converted to %HH, where HH is the hexadecimal notation of the 
byte value).

    3. The original character is replaced by the resulting character 
sequence.

To convert an IRI reference to a URI reference, the additional 
characters allowed in IRIs must be escaped using the same method.
-------------------------
to:
-------------------------
The value of the href attribute must be an IRI reference as defined in 
[IETF RFC 3987] or must result in an IRI reference after the escaping 
procedure described below is applied. (By design, all URIs (Uniform 
Resource Identifiers) as defined in [IETF RFC 3986] are also IRIs.)

To convert the value of the href attribute to an IRI reference, the 
following characters must be escaped:

     * the control characters #x0 to #x1F and #x7F (most of which cannot 
appear in XML)
     * space #x20

       Note:

       Authors are advised to avoid unescaped spaces, as XML Schema has 
identified them as an interoperability risk.

     * the delimiters < #x3C, > #x3E and " #x22
     * the unwise characters { #x7B, } #x7D, | #x7C, \ #x5C, ^ #x5E and 
` #x60

These characters are escaped by applying to them steps 2.1 to 2.3 of 
Section 3.1 of [IETF RFC 3987].

If necessary for the implementation, an IRI reference is converted to a 
URI reference according to the prescriptions of Section 3.1 of [IETF RFC 
3987].  The two conversions (href value to IRI reference, IRI reference 
to URI reference) may be merged.
-------------------------


XML 1.0
section 4.2.2 [http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent]
XML 1.1
section 4.2.2 
[http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-external-ent]
===========================================================

Change from:
-------------------------
System identifiers (and other XML strings meant to be used as URI 
references) MAY contain characters that, according to [IETF RFC 2396] 
and [IETF RFC 2732], must be escaped before a URI can be used to 
retrieve the referenced resource. The characters to be escaped are the 
control characters #x0 to #x1F and #x7F (most of which cannot appear in 
XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the 
unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and 
'`' #x60, as well as all characters above #x7F. Since escaping is not 
always a fully reversible process, it MUST be performed only when 
absolutely necessary and as late as possible in a processing chain. In 
particular, neither the process of converting a relative URI to an 
absolute one nor the process of passing a URI reference to a process or 
software component responsible for dereferencing it SHOULD trigger 
escaping. When escaping does occur, it MUST be performed as follows:

    1. Each character to be escaped is represented in UTF-8 [Unicode3] 
as one or more bytes.

    2. The resulting bytes are escaped with the URI escaping mechanism 
(that is, converted to %HH, where HH is the hexadecimal notation of the 
byte value).

    3. The original character is replaced by the resulting character 
sequence.
-------------------------
to:
-------------------------
System identifiers (and other XML strings meant to be used as URI 
references) MAY contain characters that, according to [IETF RFC 3986], 
must be escaped before a URI can be used to retrieve the referenced 
resource. This escaping MUST be performed following the prescriptions of 
Section 3.1 of [IETF RFC 3987], including the escaping (optional in RFC 
3987) of the follwoing characters:

     * the control characters #x0 to #x1F and #x7F (most of which cannot 
appear in XML)
     * space #x20

       Note:

       Authors are advised to avoid unescaped spaces, as XML Schema has 
identified them as an interoperability risk.

     * the delimiters < #x3C, > #x3E and " #x22
     * the unwise characters { #x7B, } #x7D, | #x7C, \ #x5C, ^ #x5E and 
` #x60
-------------------------
Received on Wednesday, 11 January 2006 15:49:14 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:21:33 GMT