- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Sat, 5 Feb 2011 01:11:04 +0100
- To: www-international@w3.org
The article 'Normalization in HTML and CSS' states: [1] ]] To improve interoperability, the W3C recommends the use of NFC normalized text on the Web. [[ However, with the exception of browsing Wikipedia, in my experience as a Mac OS X user, this advice is only accurate for letters where the Unicode Normalization is irrelevant. Here are some results from some tests involving the letters å (Latin a with ring), й (Cyrillic short i), æ (Latin ae ligature) and я (Cyrillic ya). The 'å' and the 'й' have two Unicode representations - composed and decomposed, and they are thus the only letters for which the advice to use composed normalization is relevant. The file names of the target files followed the normal Mac OS X convention (which is decomposed). I tested with both composed and decomposed values inside the @href attribute. A) Results, when *clicking* the links: Unsuccessful link resolving: Composed link variant: http://example.com/å (%C3%A5) Composed link variant: http://example.com/й (%D0%B9) Successful link resolving: Decomposed link variant: http://example.com/å (a%CC%8A) Decomposed link variant: http://example.com/й (%D0%B8%CC%86) For the rest, composed vs. decomposed didn't matter: http://example.com/й (%D0%B8%CC%86) http://example.com/å (a%CC%8A) http://example.com/й.html http://example.com/å.html http://example.com/åå (%C3%A5%C3%A5) http://example.com/йй (%D0%B9%D0%B9) http://example.com/åå.html http://example.com/йй.html http://example.com/æ http://example.com/я B) Results, when *typing* the IRI in the URL bar: Unsuccessful link resolving: Same as when clicking, except that two more links did no work: Decomposed link variant: http://example.com/åå (%C3%A5%C3%A5) Decomposed link variant: http://example.com/йй (%D0%B8%CC%86%D0%B8%CC%86) Successful link resolving: For the rest, composed vs. decomposed @href value didn't matter. Summary of results: 1) Clicking: when the @href value was a composed IRI and the IRI was cool and consisted of a single letter, then if you **clicked** the link, it failed. Otherwise it worked. 2) Typing the link: Same as for 1) except that one more link failed (see above) 3) In both case 1) and 2), the link would work if it contained a suffix. Questions and conclusions: - I tested both Cyrillic and Latin letters with diacritics, to prove that it is the combination of base letter plus diacritic which is the problem (but only when the letter is composed). - is the article [1] simply outdated? Have new thing happened? or perhaps it doesn't speak about how to link to *filenames*? - Unfortunately, I must say, I did not find any way to *type* **COOL** URLs that contain letters that can be decomposed. - why does Wikipedia work, then? I suppose the a *composed* 'å', such as the when you type an 'å' in the URL bar, is *ambiguous*: it can be interpreted two ways, perhaps. But wikipedia has probably hardcoded 'å' (%C3%A5) to mean 'å'. OTOH, I don't understand why browsers considers '%C3%A5' ambiguous when the page is UTF-8 encoded ... ??? If someone on other platforms, such as Linux or Windows, has got other or confirming results, please share. (For instance, it is difficult to test file names with composed letters on Mac.) Most Latin written non-ASCII letters of European languages *can* be decomposed. Thus it is an important problem. In most/many languages, diacritics are not something 'extra' that one can choose to not include. E.g. the letter 'å' is perceived not as an 'a' with a 'ring-above'. In my own case, it took my a long time to understand the fact that the problems I constantly see only relate to 3 of the many non-ASCII letters I use. [1] http://www.w3.org/International/questions/qa-html-css-normalization -- leif halvard silli
Received on Saturday, 5 February 2011 00:11:40 UTC