W3C home > Mailing lists > Public > public-iri@w3.org > October 2010

Re: Issue #5 "presentation of IRI" (was: Re: proposal for Issue #23 (relax requirement for NFC transcoding))

From: Mark Davis ☕ <mark@macchiato.com>
Date: Tue, 5 Oct 2010 08:04:29 -0700
Message-ID: <AANLkTimijkqzW8A-KfU=6G60Dd0SN32vT6OpqB9ocG3N@mail.gmail.com>
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Larry Masinter <masinter@adobe.com>, "Phillips, Addison" <addison@lab126.com>, Bjoern Hoehrmann <derhoermi@gmx.net>, "public-iri@w3.org" <public-iri@w3.org>
> but first and foremost, we should try to make sure there are unique visual
representations for all IRIs.

Depending on what you mean by this, it can vary from impossible to trivially
true.

That is, for example, suppose you mean that if Unicode string 1 can appear
with the same display as a different Unicode string 2 in some font(s), then
they cannot both be valid IRIs. But then many perfectly reasonable Cyrillic
strings could not be used in IRIs, because they look the same as (or
confusably close to) some ASCII; and vice versa. For example, take the
following with IDNA2008:

http://unicode.org/cldr/utility/confusables.jsp?a=scope&r=IDNA2008

Even with the tighter restrictions of UTS46+UTS39, there are still two cases
left:

http://unicode.org/cldr/utility/confusables.jsp?a=scope&r=UTS46%2BUTS39

And of course, Cyrillic vs ASCII characters are not the only ones that have
this characteristic. And they don't have to look identical; for example, at
Google we've seen the use of σ (sigma) to spoof "o", because in context at
address-bar sizes, they are confusable.

Now, I know that you are familiar with these cases, so you must mean
something else by your statement, but I'm not exactly sure what it is.

<http://unicode.org/cldr/utility/confusables.jsp?a=scope&r=IDNA2008>
Mark

*— Il meglio è l’inimico del bene —*


On Mon, Oct 4, 2010 at 21:52, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>wrote:

> On 2010/10/01 16:27, Larry Masinter wrote:
>
>> I don't mind most of what you propose, Addison, but I think there's an
>> additional thing that would make things clearer, I hope:
>>
>> ============================== ========
>> IDEA
>> ====
>>
>> The key idea is to separate the notion of "an IRI" as a sequence
>> of Unicode code points from the different concepts of "the presentation
>> of an IRI" (as visible glyphs, signs, spoken sounds, etc.) and
>> "an sequence of bytes which is intended to represent an IRI in a
>> character encoding" .
>>
>
> This is actually issue #5, so I changed the subject.
>
> [The rest of this mail is mostly my individual opinion.]
>
> I think that the distinction that Larry makes here is very helpful as a
> working theory, for us during the writing of the spec.
>
> However, if the final spec explicitly distinguishes between "an IRI" and
> "the presentation of an IRI", it will be very difficult to read, and we will
> get lots of questions. To start, if anything on paper, on the airwaves, or
> in some character encoding is just a "presentation of an IRI", then what's
> actually left to be the IRI itself? The term IRI would then be reserved for
> a Platonic concept, without being applicable to actual instances.
>
> It's like the following dialog:
> A (pointing to a table): This is a table, yes?
> B: No, this is not a table, this is just a presentation of a table.
> A: So, show me a table!
> B: Sorry, I can't, tables don't exist, they are just an abstract concept.
>
> (replace table with IRI, or anything else)
>
>
>  This document would then define normatively what "an IRI" is, but only
>> give "best practice" descriptions of reasonable ways of transforming
>> to and from the other forms in general.
>>
>
> The document defines IRIs mainly through the syntax. It would be weird to
> say "this thing here on paper isn't an IRI, just a presentation of an IRI,
> but it conforms to IRI syntax".
>
> Of course I agree that we have to distinguish where to be normative and
> where not. But it may to a large extent be an orthogonal issue. For example,
> we don't want to introduce fuzziness in the case of conversions that are
> straightforward (those where Unicode Normalization Forms and other character
> variant issues are not relevant). On the other hand, there sure also are
> non-normative issues for what ever according to this proposal would be
> called "an IRI".
>
>
>  This also sets a different context for the "bidi" discussion because
>> many of the concerns are about visual presentation rather than
>> codepoint representation.
>>
>
> Well, but that shows that the "presentation of an IRI" term is a slippery
> slope. In the end, we may have to make some compromises, but first and
> foremost, we should try to make sure there are unique visual representations
> for all IRIs.
>
>
>  Since attributes and content of XML is nominally also "sequence
>> of Unicode code points", there is no transformation for IRIs in
>> XML.
>>
>> This distinction isn't quite as important for URIs because there is
>> little (but not no) ambiguity or choice in the transformation between
>> URI and "presentation of URI" (except perhaps l vs I and O vs 0).
>>
>
> I guess the URI spec made the distinction whenever it was relevant, without
> introducing a new term. I think we can and should do the same.
>
>
>  To be precise, then:
>>
>> " The first step in processing an IRI is to obtain
>>  it as a sequence of Unicode characters"
>>
>> No, this is not a step in "processing an IRI". This is a step
>> in processing a "presentation of an IRI" or of processing a
>> "sequence of bytes which is intended to represent an IRI in
>> a character encoding".
>>
>> Yes, 3.1 is not superfluous, but it belongs as a separate
>> processing step, not part of processing "an IRI".
>>
>>
>> ====
>> END IDEA
>> =======
>>
>> I'll propose more specific wording to make the idea clearer, if
>> you think this would help.
>>
>
> I think it might make sense to get an actual wording proposal, but mainly
> to see that it will make things much more wordy and convoluted than
> necessary.
>
> Regards,   Martin.
>
>
>  Larry
>> --
>> http://larry.masinter.net
>>
>>
>>
>> -----Original Message-----
>> From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On
>> Behalf Of Phillips, Addison
>> Sent: Thursday, September 30, 2010 9:43 PM
>> To: Bjoern Hoehrmann
>> Cc: public-iri@w3.org
>> Subject: RE: proposal for Issue #23 (relax requirement for NFC
>> transcoding)
>>
>> (personal response)
>>
>> Bjoern wrote:
>>
>>  1. Change the text above to read:
>>>>
>>>>   If the IRI or IRI reference is an octet stream in some known
>>>>
>>> non-
>>>
>>>>   Unicode character encoding, convert the IRI to a sequence of
>>>>   characters from the UCS.
>>>>
>>>>   In other cases (written on paper, read aloud, or otherwise
>>>>   represented independent of any character encoding) represent
>>>>
>>> the IRI
>>>
>>>>   as a sequence of characters from the UCS.
>>>>
>>>
>>> IRIs are by definition a sequence of characters from the UCS. With
>>> the requirement gone, I do not think there is a point in having this
>>> section in the document.
>>>
>>
>> 3.1 is not superfluous! The first step in processing an IRI is to obtain
>> it as a sequence of Unicode characters. It will not occur to all users (or
>> implementers) that the IRI is the UCS sequence, not necessarily the octets
>> you find floating in your tag soup.
>>
>>
>>>  2. Add the following text just after the second paragraph above:
>>>>
>>>> NOTE: Some character encodings or transcriptions can be converted
>>>>
>>> to or
>>>
>>>> represented by more than one sequence of Unicode characters.
>>>>
>>> Ideally the
>>>
>>>> resulting IRI would use a normalized form, such as Unicode
>>>>
>>> Normalization
>>>
>>>> Form C (NFC, [UTR15]), since that ensures a stable, consistent
>>>> representation that is most likely to produce the intended results.
>>>> Implementers and users are cautioned that, while denormalized
>>>>
>>> character
>>>
>>>> sequences are valid, they might be difficult for other users or
>>>> processes to guess and might produce unexpected results.
>>>>
>>>
>>> Normalization is already discussed in 5.3.2.2 "Character
>>> Normalization", any discussion of it should be moved there if it's not
>>> already
>>> covered.
>>>
>>
>> We could do that. However, it is probably worth pointing out that there
>> are normalizing and non-normalizing transcodings. Perhaps something more
>> terse and aimed at the normalization section:
>>
>> NOTE: Sometimes a particular character encoding or transcription can be
>> represented by more than one sequence of Unicode characters. To help ensure
>> interoperability, ideally the resulting IRI would use a normalized form. See
>> Character Normalization [5.3.2.2].
>>
>> Best Regards,
>>
>> Addison
>>
>> Addison Phillips
>> Globalization Architect (Lab126)
>> Chair (W3C I18N, IETF IRI WGs)
>>
>> Internationalization is not a feature.
>> It is an architecture.
>>
>>
>>
>>
>>
>>
>>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
>
>
Received on Tuesday, 5 October 2010 15:05:05 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 30 April 2012 19:52:00 GMT