RE: proposal for Issue #23 (relax requirement for NFC transcoding)

(co-chair hat OFF)

Trac issue: http://trac.tools.ietf.org/wg/iri/trac/ticket/23 

So here is a slightly different version of the same proposal. Bjoern, does this suitably allay your concerns? I recognize that you would prefer us to eliminate the conversion step altogether, but some feel that the conversion step is not altogether obvious, even if it is intrinsic to what follows.

---
An IRI or IRI reference is a sequence of characters from the UCS. For IRIs that are not already encoded in Unicode (as when written on paper, read aloud, or represented in a text stream using a legacy character encoding), convert the IRI to Unicode. Note that some character encodings or transcriptions can be converted to or represented by more than one sequence of Unicode characters. Ideally the resulting IRI would use a normalized form, such as Unicode Normalization Form C [UAX15] (see [Section 5] Normalization and Comparison), since that ensures a stable, consistent representation that is most likely to produce the intended results. Implementers and users are cautioned that, while denormalized character sequences are valid, they might be difficult for other users or processes to reproduce and might lead to unexpected results.
---

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: "Martin J. Dürst" [mailto:duerst@it.aoyama.ac.jp]
> Sent: Friday, October 01, 2010 1:33 AM
> To: Phillips, Addison
> Cc: Bjoern Hoehrmann; public-iri@w3.org
> Subject: Re: proposal for Issue #23 (relax requirement for NFC
> transcoding)
> 
> Hello Addison, Björn,
> 
> On 2010/10/01 13:42, Phillips, Addison wrote:
> > (personal response)
> >
> > Bjoern wrote:
> >
> >>> 1. Change the text above to read:
> >>>
> >>>    If the IRI or IRI reference is an octet stream in some known
> >> non-
> >>>    Unicode character encoding, convert the IRI to a sequence of
> >>>    characters from the UCS.
> >>>
> >>>    In other cases (written on paper, read aloud, or otherwise
> >>>    represented independent of any character encoding) represent
> >> the IRI
> >>>    as a sequence of characters from the UCS.
> >>
> >> IRIs are by definition a sequence of characters from the UCS.
> 
> Well, they can also be on paper or in sound. The "from the UCS" is
> relevant even for paper and sound because even there, IRIs cannot
> contain characters that aren't part of Unicode (e.g. logo-like
> stuff,...).
> 
> >> With
> >> the requirement gone, I do not think there is a point in having
> this
> >> section in the document.
> >
> > 3.1 is not superfluous! The first step in processing an IRI is to
> obtain it as a sequence of Unicode characters. It will not occur to
> all users (or implementers) that the IRI is the UCS sequence, not
> necessarily the octets you find floating in your tag soup.
> >
> >>
> >>> 2. Add the following text just after the second paragraph above:
> >>>
> >>> NOTE: Some character encodings or transcriptions can be
> converted
> >> to or
> >>> represented by more than one sequence of Unicode characters.
> >> Ideally the
> >>> resulting IRI would use a normalized form, such as Unicode
> >> Normalization
> >>> Form C (NFC, [UTR15]), since that ensures a stable, consistent
> >>> representation that is most likely to produce the intended
> results.
> >>> Implementers and users are cautioned that, while denormalized
> >> character
> >>> sequences are valid, they might be difficult for other users or
> >>> processes to guess and might produce unexpected results.
> >>
> >> Normalization is already discussed in 5.3.2.2 "Character
> >> Normalization", any discussion of it should be moved there if
> it's not already
> >> covered.
> 
> Reading section 5.3.2.2, it contains MUSTs that are very much
> predicated
> on the assumption of normalizing transcoders and use of NFC when
> creating IRIs. I think this section requires some rework.
> 
> > We could do that. However, it is probably worth pointing out that
> there are normalizing and non-normalizing transcodings. Perhaps
> something more terse and aimed at the normalization section:
> >
> > NOTE: Sometimes a particular character encoding or transcription
> can be represented by more than one sequence of Unicode characters.
> To help ensure interoperability, ideally the resulting IRI would
> use a normalized form. See Character Normalization [5.3.2.2].
> 
> I don't understand "character encoding or transcription".
> "Character
> encoding" refers to thing such as UTF-8, Shift_JIS,...
> Of course I can represent 'UTF-8' also by 'utf-8', but that's not
> what
> we are talking about :-).
> 
> Regards,   Martin.
> 
> 
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Friday, 22 October 2010 21:41:36 UTC