RE: proposal for Issue #23 (relax requirement for NFC transcoding) from Larry Masinter on 2010-10-01 (public-iri@w3.org from October 2010)

From: Larry Masinter <masinter@adobe.com>
Date: Fri, 1 Oct 2010 00:27:41 -0700
To: "Phillips, Addison" <addison@lab126.com>, Bjoern Hoehrmann <derhoermi@gmx.net>
CC: "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D04768564CC@nambxv01a.corp.adobe.com>
I don't mind most of what you propose, Addison, but I think there's an
additional thing that would make things clearer, I hope:

======================================
IDEA
====

The key idea is to separate the notion of "an IRI" as a sequence
of Unicode code points from the different concepts of "the presentation 
of an IRI" (as visible glyphs, signs, spoken sounds, etc.) and
"an sequence of bytes which is intended to represent an IRI in a
character encoding" .

This document would then define normatively what "an IRI" is, but only
give "best practice" descriptions of reasonable ways of transforming
to and from the other forms in general.

This also sets a different context for the "bidi" discussion because
many of the concerns are about visual presentation rather than
codepoint representation.

Since attributes and content of XML is nominally also "sequence
of Unicode code points", there is no transformation for IRIs in
XML. 

This distinction isn't quite as important for URIs because there is
little (but not no) ambiguity or choice in the transformation between
URI and "presentation of URI" (except perhaps l vs I and O vs 0).

To be precise, then:

" The first step in processing an IRI is to obtain
 it as a sequence of Unicode characters"

No, this is not a step in "processing an IRI". This is a step
in processing a "presentation of an IRI" or of processing a
"sequence of bytes which is intended to represent an IRI in
a character encoding". 

Yes, 3.1 is not superfluous, but it belongs as a separate
processing step, not part of processing "an IRI".


====
END IDEA
=======

I'll propose more specific wording to make the idea clearer, if
you think this would help.


Larry
--
http://larry.masinter.net



-----Original Message-----
From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf Of Phillips, Addison
Sent: Thursday, September 30, 2010 9:43 PM
To: Bjoern Hoehrmann
Cc: public-iri@w3.org
Subject: RE: proposal for Issue #23 (relax requirement for NFC transcoding)

(personal response)

Bjoern wrote:

> >1. Change the text above to read:
> >
> >   If the IRI or IRI reference is an octet stream in some known
> non-
> >   Unicode character encoding, convert the IRI to a sequence of
> >   characters from the UCS.
> >
> >   In other cases (written on paper, read aloud, or otherwise
> >   represented independent of any character encoding) represent
> the IRI
> >   as a sequence of characters from the UCS.
> 
> IRIs are by definition a sequence of characters from the UCS. With
> the requirement gone, I do not think there is a point in having this
> section in the document.

3.1 is not superfluous! The first step in processing an IRI is to obtain it as a sequence of Unicode characters. It will not occur to all users (or implementers) that the IRI is the UCS sequence, not necessarily the octets you find floating in your tag soup.

> 
> >2. Add the following text just after the second paragraph above:
> >
> >NOTE: Some character encodings or transcriptions can be converted
> to or
> >represented by more than one sequence of Unicode characters.
> Ideally the
> >resulting IRI would use a normalized form, such as Unicode
> Normalization
> >Form C (NFC, [UTR15]), since that ensures a stable, consistent
> >representation that is most likely to produce the intended results.
> >Implementers and users are cautioned that, while denormalized
> character
> >sequences are valid, they might be difficult for other users or
> >processes to guess and might produce unexpected results.
> 
> Normalization is already discussed in 5.3.2.2 "Character
> Normalization", any discussion of it should be moved there if it's not already
> covered.

We could do that. However, it is probably worth pointing out that there are normalizing and non-normalizing transcodings. Perhaps something more terse and aimed at the normalization section:

NOTE: Sometimes a particular character encoding or transcription can be represented by more than one sequence of Unicode characters. To help ensure interoperability, ideally the resulting IRI would use a normalized form. See Character Normalization [5.3.2.2].

Best Regards,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.
Received on Friday, 1 October 2010 08:00:38 UTC