RE: proposal for Issue #23 (relax requirement for NFC transcoding) from Phillips, Addison on 2010-10-01 (public-iri@w3.org from October 2010)

From: Phillips, Addison <addison@lab126.com>
Date: Fri, 1 Oct 2010 00:42:32 -0400
To: Bjoern Hoehrmann <derhoermi@gmx.net>
CC: "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA412A13D407C@EX-IAD6-B.ant.amazon.com>

(personal response)

Bjoern wrote:

> >1. Change the text above to read:
> >
> >   If the IRI or IRI reference is an octet stream in some known
> non-
> >   Unicode character encoding, convert the IRI to a sequence of
> >   characters from the UCS.
> >
> >   In other cases (written on paper, read aloud, or otherwise
> >   represented independent of any character encoding) represent
> the IRI
> >   as a sequence of characters from the UCS.
> 
> IRIs are by definition a sequence of characters from the UCS. With
> the requirement gone, I do not think there is a point in having this
> section in the document.

3.1 is not superfluous! The first step in processing an IRI is to obtain it as a sequence of Unicode characters. It will not occur to all users (or implementers) that the IRI is the UCS sequence, not necessarily the octets you find floating in your tag soup.

> 
> >2. Add the following text just after the second paragraph above:
> >
> >NOTE: Some character encodings or transcriptions can be converted
> to or
> >represented by more than one sequence of Unicode characters.
> Ideally the
> >resulting IRI would use a normalized form, such as Unicode
> Normalization
> >Form C (NFC, [UTR15]), since that ensures a stable, consistent
> >representation that is most likely to produce the intended results.
> >Implementers and users are cautioned that, while denormalized
> character
> >sequences are valid, they might be difficult for other users or
> >processes to guess and might produce unexpected results.
> 
> Normalization is already discussed in 5.3.2.2 "Character
> Normalization", any discussion of it should be moved there if it's not already
> covered.

We could do that. However, it is probably worth pointing out that there are normalizing and non-normalizing transcodings. Perhaps something more terse and aimed at the normalization section:

NOTE: Sometimes a particular character encoding or transcription can be represented by more than one sequence of Unicode characters. To help ensure interoperability, ideally the resulting IRI would use a normalized form. See Character Normalization [5.3.2.2].

Best Regards,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

Received on Friday, 1 October 2010 04:43:03 UTC