W3C home > Mailing lists > Public > www-international@w3.org > January to March 2011

Re: Cool IRIs & diacritics, for a change

From: Richard Ishida <ishida@w3.org>
Date: Tue, 08 Feb 2011 12:09:01 +0000
Message-ID: <4D51325D.50703@w3.org>
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
CC: www-international@w3.org
It's not clear to me how many resources you have on the server.  For 
example does http://example.com/ link to the same resource as 
http://example.com/å.html ?

If so, we need to also be sure that the problem doesn't occur because of 
content negotiation problems.  That may mean testing differently named 
resources in different directories.

RI



Richard Ishida
Internationalization Activity Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/

On 05/02/2011 00:11, Leif Halvard Silli wrote:
> The article 'Normalization in HTML and CSS' states: [1]
>
> ]]  To improve interoperability, the W3C recommends
>      the use of NFC normalized text on the Web.        [[
>
> However, with the exception of browsing Wikipedia, in my experience as
> a Mac OS X user, this advice is only accurate for letters where the
> Unicode Normalization is irrelevant.
>
> Here are some results from some tests involving the letters å (Latin a
> with ring), й (Cyrillic short i), æ (Latin ae ligature) and я (Cyrillic
> ya). The 'å' and the 'й' have two Unicode representations - composed
> and decomposed, and they are thus the only letters for which the advice
> to use composed normalization is relevant. The file names of the target
> files followed the normal Mac OS X convention (which is decomposed). I
> tested with both composed and decomposed values inside the @href
> attribute.
>
> 	A) Results, when *clicking* the links:
>
> 		Unsuccessful link resolving:
> Composed link variant: http://example.com/ (%C3%A5)
> Composed link variant: http://example.com/ (%D0%B9)
>
> 		Successful link resolving:
> Decomposed link variant: http://example.com/ (a%CC%8A)
> Decomposed link variant: http://example.com/ (%D0%B8%CC%86)
> For the rest, composed vs. decomposed didn't matter:
> 	http://example.com/ (%D0%B8%CC%86)
> 	http://example.com/ (a%CC%8A)
> 	http://example.com/й.html
> 	http://example.com/å.html
> 	http://example.com/å (%C3%A5%C3%A5)
> 	http://example.com/й (%D0%B9%D0%B9)
> 	http://example.com/åå.html
> 	http://example.com/йй.html
> 	http://example.com/
> 	http://example.com/я
>
> 	B) Results, when *typing* the IRI in the URL bar:
>
> 		Unsuccessful link resolving:
> Same as when clicking, except that two more links did no work:
> Decomposed link variant: http://example.com/åå (%C3%A5%C3%A5)
> Decomposed link variant: http://example.com/йй
> (%D0%B8%CC%86%D0%B8%CC%86)
> 		Successful link resolving:
> For the rest, composed vs. decomposed @href value didn't matter.
>
> Summary of results:
> 	1) Clicking: when the @href value was a composed IRI and the IRI
> 	   was cool and consisted of a single letter, then if you
> 	   **clicked** the link, it failed. Otherwise it worked.
> 	2) Typing the link: Same as for 1) except that one more
> 	   link failed (see above)
> 	3) In both case 1) and 2), the link would work if it contained
>         a suffix.
>
> Questions and conclusions:
> 	- I tested both Cyrillic and Latin letters with diacritics, to
> 	  prove that it is the combination of base letter plus diacritic
> 	  which is the problem (but only when the letter is composed).
> 	- is the article [1] simply outdated? Have new thing happened?
> 	  or perhaps it doesn't speak about how to link to *filenames*?
> 	- Unfortunately, I must say, I did not find any way to *type*
> 	  **COOL** URLs that contain letters that can be decomposed.
> 	- why does Wikipedia work, then? I suppose the a *composed*
> 	  'å', such as the when you type an 'å' in the URL bar,
> 	  is *ambiguous*: it can be interpreted two ways, perhaps.
> 	  But wikipedia has probably hardcoded 'å' (%C3%A5) to mean
> 	  'å'. OTOH, I don't understand why browsers considers '%C3%A5'
> 	  ambiguous when the page is UTF-8 encoded ... ???
>
> If someone on other platforms, such as Linux or Windows, has got other
> or confirming results, please share.  (For instance, it is difficult to
> test file names with composed letters on Mac.)
>
> Most Latin written non-ASCII letters of European languages *can* be
> decomposed. Thus it is an important problem. In most/many languages,
> diacritics are not something 'extra' that one can choose to not
> include. E.g. the letter 'å' is perceived not as an 'a' with a
> 'ring-above'. In my own case, it took my a long time to understand the
> fact that the problems I constantly see only relate to 3 of the many
> non-ASCII letters I use.
>
> [1] http://www.w3.org/International/questions/qa-html-css-normalization
Received on Tuesday, 8 February 2011 12:09:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 February 2011 12:09:31 GMT