W3C home > Mailing lists > Public > www-international@w3.org > January to March 2011

Re: Cool IRIs & diacritics, for a change

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 8 Feb 2011 21:08:22 +0100
To: Richard Ishida <ishida@w3.org>
Cc: www-international@w3.org
Message-ID: <20110208210822104542.d17ece18@xn--mlform-iua.no>
Richard Ishida, Tue, 08 Feb 2011 12:09:01 +0000:
> It's not clear to me how many resources you have on the server.  For 
> example does http://example.com/ link to the same resource as 
> http://example.com/å.html ?

Yes, both link to the same resource.

> If so, we need to also be sure that the problem doesn't occur because 
> of content negotiation problems.  That may mean testing differently 
> named resources in different directories.

Apache on a Mac is set up with content negotiation enabled. But I have 
not tested what happens if I disable content negotiation.

Meanwhile, perhaps it only causes confusion, but I have uploaded two of 
the tests here:



Those pages are identical, except that the resources/file names in the 
second test are "Web encoded" (with composed letters), while 
resources/file names in the former are "Mac encoded" (decomposed). The 
conversion of the file names to "Web encoding" in the latter test, was 
done automatically by my FTP program, Yummy FTP. The conversion seems 
to work perfect.

Interestingly, having uploaded the UTF8Mac tests, I can say that they 
work much better in Apache on my Mac than they do in the Apache server 
in the Linux server of my ISP. Thus, the test results that I have 
recorded in the test are only valid for UTF-8-Mac in Apache on my Mac. 
(On the linux server, the UTF-8-Mac encoded tests barely work at all - 
except for some of the decomposed percent encoded links.)

Leif Halvard Silli

> RI
> Richard Ishida
> Internationalization Activity Lead
> W3C (World Wide Web Consortium)
> http://www.w3.org/International/

> http://rishida.net/

> On 05/02/2011 00:11, Leif Halvard Silli wrote:
>> The article 'Normalization in HTML and CSS' states: [1]
>> ]]  To improve interoperability, the W3C recommends
>>      the use of NFC normalized text on the Web.        [[
>> However, with the exception of browsing Wikipedia, in my experience as
>> a Mac OS X user, this advice is only accurate for letters where the
>> Unicode Normalization is irrelevant.
>> Here are some results from some tests involving the letters å (Latin a
>> with ring), й (Cyrillic short i), æ (Latin ae ligature) and я (Cyrillic
>> ya). The 'å' and the 'й' have two Unicode representations - composed
>> and decomposed, and they are thus the only letters for which the advice
>> to use composed normalization is relevant. The file names of the target
>> files followed the normal Mac OS X convention (which is decomposed). I
>> tested with both composed and decomposed values inside the @href
>> attribute.
>> 	A) Results, when *clicking* the links:
>> 		Unsuccessful link resolving:
>> Composed link variant: http://example.com/ (%C3%A5)
>> Composed link variant: http://example.com/ (%D0%B9)
>> 		Successful link resolving:
>> Decomposed link variant: http://example.com/ (a%CC%8A)
>> Decomposed link variant: http://example.com/ (%D0%B8%CC%86)
>> For the rest, composed vs. decomposed didn't matter:
>> 	http://example.com/ (%D0%B8%CC%86)
>> 	http://example.com/ (a%CC%8A)
>> 	http://example.com/й.html

>> 	http://example.com/å.html

>> 	http://example.com/å (%C3%A5%C3%A5)
>> 	http://example.com/й (%D0%B9%D0%B9)
>> 	http://example.com/åå.html

>> 	http://example.com/йй.html

>> 	http://example.com/æ

>> 	http://example.com/я

>> 	B) Results, when *typing* the IRI in the URL bar:
>> 		Unsuccessful link resolving:
>> Same as when clicking, except that two more links did no work:
>> Decomposed link variant: http://example.com/åå (%C3%A5%C3%A5)
>> Decomposed link variant: http://example.com/йй

>> (%D0%B8%CC%86%D0%B8%CC%86)
>> 		Successful link resolving:
>> For the rest, composed vs. decomposed @href value didn't matter.
>> Summary of results:
>> 	1) Clicking: when the @href value was a composed IRI and the IRI
>> 	   was cool and consisted of a single letter, then if you
>> 	   **clicked** the link, it failed. Otherwise it worked.
>> 	2) Typing the link: Same as for 1) except that one more
>> 	   link failed (see above)
>> 	3) In both case 1) and 2), the link would work if it contained
>>         a suffix.
>> Questions and conclusions:
>> 	- I tested both Cyrillic and Latin letters with diacritics, to
>> 	  prove that it is the combination of base letter plus diacritic
>> 	  which is the problem (but only when the letter is composed).
>> 	- is the article [1] simply outdated? Have new thing happened?
>> 	  or perhaps it doesn't speak about how to link to *filenames*?
>> 	- Unfortunately, I must say, I did not find any way to *type*
>> 	  **COOL** URLs that contain letters that can be decomposed.
>> 	- why does Wikipedia work, then? I suppose the a *composed*
>> 	  'å', such as the when you type an 'å' in the URL bar,
>> 	  is *ambiguous*: it can be interpreted two ways, perhaps.
>> 	  But wikipedia has probably hardcoded 'å' (%C3%A5) to mean
>> 	  'å'. OTOH, I don't understand why browsers considers '%C3%A5'
>> 	  ambiguous when the page is UTF-8 encoded ... ???
>> If someone on other platforms, such as Linux or Windows, has got other
>> or confirming results, please share.  (For instance, it is difficult to
>> test file names with composed letters on Mac.)
>> Most Latin written non-ASCII letters of European languages *can* be
>> decomposed. Thus it is an important problem. In most/many languages,
>> diacritics are not something 'extra' that one can choose to not
>> include. E.g. the letter 'å' is perceived not as an 'a' with a
>> 'ring-above'. In my own case, it took my a long time to understand the
>> fact that the problems I constantly see only relate to 3 of the many
>> non-ASCII letters I use.
>> [1] http://www.w3.org/International/questions/qa-html-css-normalization

Received on Tuesday, 8 February 2011 20:08:58 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:32 UTC