W3C home > Mailing lists > Public > www-international@w3.org > January to March 2011

Cool IRIs & diacritics, for a change

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Sat, 5 Feb 2011 01:11:04 +0100
To: www-international@w3.org
Message-ID: <20110205011104587134.b94ef84b@xn--mlform-iua.no>
The article 'Normalization in HTML and CSS' states: [1]

]]  To improve interoperability, the W3C recommends
    the use of NFC normalized text on the Web.        [[

However, with the exception of browsing Wikipedia, in my experience as 
a Mac OS X user, this advice is only accurate for letters where the 
Unicode Normalization is irrelevant.

Here are some results from some tests involving the letters å (Latin a 
with ring), й (Cyrillic short i), æ (Latin ae ligature) and я (Cyrillic 
ya). The 'å' and the 'й' have two Unicode representations - composed 
and decomposed, and they are thus the only letters for which the advice 
to use composed normalization is relevant. The file names of the target 
files followed the normal Mac OS X convention (which is decomposed). I 
tested with both composed and decomposed values inside the @href 

	A) Results, when *clicking* the links:

		Unsuccessful link resolving:
Composed link variant: http://example.com/ (%C3%A5)
Composed link variant: http://example.com/ (%D0%B9)

		Successful link resolving:
Decomposed link variant: http://example.com/ (a%CC%8A)
Decomposed link variant: http://example.com/ (%D0%B8%CC%86)
For the rest, composed vs. decomposed didn't matter:
	http://example.com/ (%D0%B8%CC%86)
	http://example.com/ (a%CC%8A)


	http://example.com/å (%C3%A5%C3%A5)
	http://example.com/й (%D0%B9%D0%B9)




	B) Results, when *typing* the IRI in the URL bar:

		Unsuccessful link resolving:
Same as when clicking, except that two more links did no work:
Decomposed link variant: http://example.com/åå (%C3%A5%C3%A5)
Decomposed link variant: http://example.com/йй 
		Successful link resolving:
For the rest, composed vs. decomposed @href value didn't matter.

Summary of results:
	1) Clicking: when the @href value was a composed IRI and the IRI
	   was cool and consisted of a single letter, then if you 
	   **clicked** the link, it failed. Otherwise it worked.
	2) Typing the link: Same as for 1) except that one more
	   link failed (see above)
	3) In both case 1) and 2), the link would work if it contained 
       a suffix.

Questions and conclusions:
	- I tested both Cyrillic and Latin letters with diacritics, to 
	  prove that it is the combination of base letter plus diacritic
	  which is the problem (but only when the letter is composed).
	- is the article [1] simply outdated? Have new thing happened?
	  or perhaps it doesn't speak about how to link to *filenames*?
	- Unfortunately, I must say, I did not find any way to *type*
	  **COOL** URLs that contain letters that can be decomposed.
	- why does Wikipedia work, then? I suppose the a *composed*
	  'å', such as the when you type an 'å' in the URL bar, 
	  is *ambiguous*: it can be interpreted two ways, perhaps.
	  But wikipedia has probably hardcoded 'å' (%C3%A5) to mean
	  'å'. OTOH, I don't understand why browsers considers '%C3%A5'
	  ambiguous when the page is UTF-8 encoded ... ???

If someone on other platforms, such as Linux or Windows, has got other 
or confirming results, please share.  (For instance, it is difficult to 
test file names with composed letters on Mac.)

Most Latin written non-ASCII letters of European languages *can* be 
decomposed. Thus it is an important problem. In most/many languages, 
diacritics are not something 'extra' that one can choose to not 
include. E.g. the letter 'å' is perceived not as an 'a' with a 
'ring-above'. In my own case, it took my a long time to understand the 
fact that the problems I constantly see only relate to 3 of the many 
non-ASCII letters I use.

[1] http://www.w3.org/International/questions/qa-html-css-normalization

leif halvard silli
Received on Saturday, 5 February 2011 00:11:40 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:32 UTC