proposal for Issue #23 (relax requirement for NFC transcoding) from Phillips, Addison on 2010-09-29 (public-iri@w3.org from September 2010)

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 29 Sep 2010 12:05:36 -0400
To: "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA412A12C0853@EX-IAD6-B.ant.amazon.com>

(co-chair hat OFF, individual contribution)

During a recent editorial meeting, I drew an action item to propose text to address Issue #23 [1]. This issue concerns step 1b of Section 3.1, where RFC 3987 requires a "normalizing" transcoder be used when converting an IRI from a legacy encoding to Unicode:

b. If the IRI is in some digital representation (e.g., an
octet stream) in some known non-Unicode character
encoding, convert the IRI to a sequence of characters
from the UCS normalized according to NFC.

For reference, the text in the current draft (no longer numbered in a list) reads:

If the IRI or IRI reference is an octet stream in some known non-
Unicode character encoding, convert the IRI to a sequence of
characters from the UCS; this sequence SHOULD also be normalized
according to Unicode Normalization Form C (NFC, [UTR15]). In this
case, retain the original character encoding as the "document
character encoding". (DESIGN QUESTION: NOT WHAT MOST IMPLEMENTATIONS
DO, CHANGE? )

In other cases (written on paper, read aloud, or otherwise
represented independent of any character encoding) represent the IRI
as a sequence of characters from the UCS normalized according to
Unicode Normalization Form C (NFC, [UTR15]).

Previously [2], it was proposed that this be relaxed from a firm requirement to a SHOULD [as you see just above]. However, the editors and co-chairs now feel that even this is too strong. Attempts to require NFC elsewhere have met with resistance on the practicality of implementation and the fact that significant harm to interoperability has not been observed. We therefore felt that normalization should not be normative; normalization should still be recommended to users appropriately, but not as an implementation requirement.

I propose the following changes.

1. Change the text above to read:

If the IRI or IRI reference is an octet stream in some known non-
Unicode character encoding, convert the IRI to a sequence of
characters from the UCS.

In other cases (written on paper, read aloud, or otherwise
represented independent of any character encoding) represent the IRI
as a sequence of characters from the UCS.

2. Add the following text just after the second paragraph above:

NOTE: Some character encodings or transcriptions can be converted to or represented by more than one sequence of Unicode characters. Ideally the resulting IRI would use a normalized form, such as Unicode Normalization Form C (NFC, [UTR15]), since that ensures a stable, consistent representation that is most likely to produce the intended results. Implementers and users are cautioned that, while denormalized character sequences are valid, they might be difficult for other users or processes to guess and might produce unexpected results.

Thanks,

Addison

[1] http://trac.tools.ietf.org/wg/iri/trac/ticket/23
[2] http://lists.w3.org/Archives/Public/public-iri/2007Jul/0008.html

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 29 September 2010 16:07:36 UTC