Issue #5 "presentation of IRI" (was: Re: proposal for Issue #23 (relax requirement for NFC transcoding)) from Martin J. Dürst on 2010-10-05 (public-iri@w3.org from October 2010)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 05 Oct 2010 13:52:14 +0900
To: Larry Masinter <masinter@adobe.com>, "Phillips, Addison" <addison@lab126.com>
CC: Bjoern Hoehrmann <derhoermi@gmx.net>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4CAAAEFE.8020200@it.aoyama.ac.jp>
On 2010/10/01 16:27, Larry Masinter wrote:
> I don't mind most of what you propose, Addison, but I think there's an
> additional thing that would make things clearer, I hope:
>
> ======================================
> IDEA
> ====
>
> The key idea is to separate the notion of "an IRI" as a sequence
> of Unicode code points from the different concepts of "the presentation
> of an IRI" (as visible glyphs, signs, spoken sounds, etc.) and
> "an sequence of bytes which is intended to represent an IRI in a
> character encoding" .

This is actually issue #5, so I changed the subject.

[The rest of this mail is mostly my individual opinion.]

I think that the distinction that Larry makes here is very helpful as a 
working theory, for us during the writing of the spec.

However, if the final spec explicitly distinguishes between "an IRI" and 
"the presentation of an IRI", it will be very difficult to read, and we 
will get lots of questions. To start, if anything on paper, on the 
airwaves, or in some character encoding is just a "presentation of an 
IRI", then what's actually left to be the IRI itself? The term IRI would 
then be reserved for a Platonic concept, without being applicable to 
actual instances.

It's like the following dialog:
A (pointing to a table): This is a table, yes?
B: No, this is not a table, this is just a presentation of a table.
A: So, show me a table!
B: Sorry, I can't, tables don't exist, they are just an abstract concept.

(replace table with IRI, or anything else)


> This document would then define normatively what "an IRI" is, but only
> give "best practice" descriptions of reasonable ways of transforming
> to and from the other forms in general.

The document defines IRIs mainly through the syntax. It would be weird 
to say "this thing here on paper isn't an IRI, just a presentation of an 
IRI, but it conforms to IRI syntax".

Of course I agree that we have to distinguish where to be normative and 
where not. But it may to a large extent be an orthogonal issue. For 
example, we don't want to introduce fuzziness in the case of conversions 
that are straightforward (those where Unicode Normalization Forms and 
other character variant issues are not relevant). On the other hand, 
there sure also are non-normative issues for what ever according to this 
proposal would be called "an IRI".


> This also sets a different context for the "bidi" discussion because
> many of the concerns are about visual presentation rather than
> codepoint representation.

Well, but that shows that the "presentation of an IRI" term is a 
slippery slope. In the end, we may have to make some compromises, but 
first and foremost, we should try to make sure there are unique visual 
representations for all IRIs.


> Since attributes and content of XML is nominally also "sequence
> of Unicode code points", there is no transformation for IRIs in
> XML.
>
> This distinction isn't quite as important for URIs because there is
> little (but not no) ambiguity or choice in the transformation between
> URI and "presentation of URI" (except perhaps l vs I and O vs 0).

I guess the URI spec made the distinction whenever it was relevant, 
without introducing a new term. I think we can and should do the same.


> To be precise, then:
>
> " The first step in processing an IRI is to obtain
>   it as a sequence of Unicode characters"
>
> No, this is not a step in "processing an IRI". This is a step
> in processing a "presentation of an IRI" or of processing a
> "sequence of bytes which is intended to represent an IRI in
> a character encoding".
>
> Yes, 3.1 is not superfluous, but it belongs as a separate
> processing step, not part of processing "an IRI".
>
>
> ====
> END IDEA
> =======
>
> I'll propose more specific wording to make the idea clearer, if
> you think this would help.

I think it might make sense to get an actual wording proposal, but 
mainly to see that it will make things much more wordy and convoluted 
than necessary.

Regards,   Martin.


> Larry
> --
> http://larry.masinter.net
>
>
>
> -----Original Message-----
> From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf Of Phillips, Addison
> Sent: Thursday, September 30, 2010 9:43 PM
> To: Bjoern Hoehrmann
> Cc: public-iri@w3.org
> Subject: RE: proposal for Issue #23 (relax requirement for NFC transcoding)
>
> (personal response)
>
> Bjoern wrote:
>
>>> 1. Change the text above to read:
>>>
>>>    If the IRI or IRI reference is an octet stream in some known
>> non-
>>>    Unicode character encoding, convert the IRI to a sequence of
>>>    characters from the UCS.
>>>
>>>    In other cases (written on paper, read aloud, or otherwise
>>>    represented independent of any character encoding) represent
>> the IRI
>>>    as a sequence of characters from the UCS.
>>
>> IRIs are by definition a sequence of characters from the UCS. With
>> the requirement gone, I do not think there is a point in having this
>> section in the document.
>
> 3.1 is not superfluous! The first step in processing an IRI is to obtain it as a sequence of Unicode characters. It will not occur to all users (or implementers) that the IRI is the UCS sequence, not necessarily the octets you find floating in your tag soup.
>
>>
>>> 2. Add the following text just after the second paragraph above:
>>>
>>> NOTE: Some character encodings or transcriptions can be converted
>> to or
>>> represented by more than one sequence of Unicode characters.
>> Ideally the
>>> resulting IRI would use a normalized form, such as Unicode
>> Normalization
>>> Form C (NFC, [UTR15]), since that ensures a stable, consistent
>>> representation that is most likely to produce the intended results.
>>> Implementers and users are cautioned that, while denormalized
>> character
>>> sequences are valid, they might be difficult for other users or
>>> processes to guess and might produce unexpected results.
>>
>> Normalization is already discussed in 5.3.2.2 "Character
>> Normalization", any discussion of it should be moved there if it's not already
>> covered.
>
> We could do that. However, it is probably worth pointing out that there are normalizing and non-normalizing transcodings. Perhaps something more terse and aimed at the normalization section:
>
> NOTE: Sometimes a particular character encoding or transcription can be represented by more than one sequence of Unicode characters. To help ensure interoperability, ideally the resulting IRI would use a normalized form. See Character Normalization [5.3.2.2].
>
> Best Regards,
>
> Addison
>
> Addison Phillips
> Globalization Architect (Lab126)
> Chair (W3C I18N, IETF IRI WGs)
>
> Internationalization is not a feature.
> It is an architecture.
>
>
>
>
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 5 October 2010 04:53:06 UTC