Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Andrew Cunningham on 2009-02-04 (www-style@w3.org from February 2009)

From: Andrew Cunningham <andrewc@vicnet.net.au>
Date: Thu, 05 Feb 2009 10:02:37 +1100
To: Henri Sivonen <hsivonen@iki.fi>
CC: Jonathan Kew <jonathan@jfkew.plus.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-ID: <498A1E8D.6000702@vicnet.net.au>
Henri Sivonen wrote:
> On Feb 4, 2009, at 00:44, Andrew Cunningham wrote:
>
>> Henri Sivonen wrote:
>>
>>> The central reason for using NFC for interchange (i.e. what goes 
>>> over HTTP) is that legacy software (including the text rendering 
>>> code in legacy browsers) works better with NFC.
>>>
>> I'd be interested in knowing what oyu'd define as legacy browsers, 
>> and which operating systems you have in mind when you emntion it.
>
> For the purpose of the quoted sentence, I'd define versions of the 
> major browsers other than the most recent release version of each one 
> as "legacy". (For the purpose of specifying new behaviors in specs, 
> even the now-current versions should be considered legacy from the 
> point of view of the potential future spec feature.)
>

although you identified one of the problem areas of "legacy" browsers as 
text rendering code, which in this instance would be a furphy, 
considering your definition of legacy in this instance, and irrelevant 
tot he discusison. Although from the point of view of changes to a spec 
or future spec even the current release and possibly the next scheduled 
release or two would also be considered legacy. I.e. in the sense you 
use legacy ALL browsers are legacy possibly including versions under 
development.

But you do make a point about stability and backwards behaviour. If 
changes were to occur regarding normalisation, obviously they wouldn't 
effect existing versions of CSS 1.0 or 2.1. But not sure that would 
necessarily be a valid argument for CSS 3.0
>> if a browser can't render combining diacritics, then it will not be 
>> able to render NFC data when the NFC data uses combining diacritics.
>
> Right. However, that does not make those browsers useless, because 
> there are lot of things that can be communicated with precomposed 
> characters.
>

Depends on the language. if heavilly use of combining diacritics is 
required by a particular language, and if those diacritics play a 
critical role in the legibility and readability of the text ... then its 
a significant issue for those languages.

But then again, for the languages I tend to work there other issues such 
as text layout, line breaking, ordered lists where browsers have serious 
defects.

Rule of thumb is that even most recent browsers are fully ready for a 
truly multilingual environment. there are quite a few writing scripts 
where much more work needs to be done. Both in terms of text layout in 
web browsers and various aspects of CSS. But tis is tsrating to get very 
off topic, so probably best to leave those comments as is.
>> So for a "legacy" browser when a document contains combining 
>> diacritics it doesn't matter if the text is NFC or NFD, it will not 
>> correctly render it.
>>
>> For legacy browsers, Unicode will always be a barrier regardless of 
>> normalisation form.
>
> Only for cases where the mapping from characters to graphemes is not 
> one-to-one. In a lot of cases that have utility, the mapping is 
> one-to-one.
>

Most writing scripts aren't simple one-to-one mappings

Even the Latin and Cyrillic scripts need to use GSUB features to be able 
to support various languages. Sticking with European languages for a second:

The new crop of core fonts form Windows (those in Vista or the EU Font 
update) use GSUB features for Romanian.

If you have content in Russian, Macedonian and Serbian, and part of that 
text needs to display in an italic typeface and you need to use an 
italic typeface, and want to use the same font for all Cyrillic text, 
then you need a browser that can utilise OpenType language systems and 
GSUB lookups to provide the correct glyphs for Serbian and Macedonian vs 
Russian.

Then if you add all the Latin script languages unsupported by OpenType 
language systems that need culturally appropriate glyph alternatives for 
the same character, then you have a bigger problem.

Current browsers do allow a user any control over rendering, when 
alternatives are available in a font, nor does CSS. Nor is there a way 
of distinguishing between versions of a font, when the differences

This is really getting off topic for the thread. Sorry. But its a bug 
bear with me. I tend to mainly work on web content in lesser used 
languages and come across the limitations of web browsers and CSS on a 
daily basis. For some projects I may be working with a number of 
languages. The fonts I use are OpenType fonts, they contain all the 
OpenType rules needed to display each of those langauges correctly. But 
when I need to display that text in a web browser, I'm forced to take 
the original opentype font make a copy of that font and edit the 
opentype table in that font to change the default values of some of 
those features, save the modified font and in the CSS files specify 
either teh original font or the modified font.in theory all the text 
should display fine with the original font. But CSS gives me no way to 
control the alternative behaviours and web browsers have no UI to expose 
or control such behaviours either. But then they also sometimes struggle 
with obligatory OpenType features. Sometimes this may be a problem with 
the underlying OS font rendering / text layout system, and sometimes its 
a problem with the application.

I suppose that is enough of that rant. Just tired from trying to tweak 
some fonts for a Sudanese project.

>>>
>> There are discrepancies between canonical ordering in normalisation 
>> for some Myanmar characters compared to the data storage order 
>> recommended in UTN11. Current Unicode 5.1 fonts for the Myanmar block 
>> as based on UNT11. I believe Martin H is working on a draft of 
>> version 3 of UTN11 (esp. since UTN11 was Burmese centric and also 
>> needs to address a range of issues with ethnic minority languages, 
>> Pali and Sanskrit). Very few if any web sites actually normalise 
>> content, Wikipedia and the Mediawiki platform being one of those. 
>> From memory the problem came to light when trying to work out 
>> rendering problems in the Burmese version of Wikipedia. Haven't 
>> followed the discussion in any detail and have only had second hand 
>> reports on the meetings in Yangon last year.
>
> So this is not a fundamental objection to normalization but a bug in 
> coordinating between UTNs and the bug is expected to be transient?
we hope ;) although would be interesting to see if the ethnic langauges 
will through a spanner in the works vis-a-vis.
>
>>>> In a number of African languages it is useful to work with NFD data,
>>>
>>> Even if it useful to perform in-RAM editing operations on NFD in a 
>>> text editor, it doesn't follow that NFD should be used for interchange.
>>>
>> except where it is useful process NFD data in a client side script.
>
> That's indeed an interesting case. Considering the subject matter of 
> this thread, it isn't an argument *for* browsers normalizing without 
> the author asking them to but *against* browsers normalizing in 
> CSS/HTML/JS parsers without the author asking them to.
>
yep, personally i think default wholesale normalisation would be 
interesting, defaulting to NFC. But I'd want a mechanism in CSS and in 
the browsers for the web developer to specify alternative behaviour when 
required. I think normalisation is required. But I'd also liek to have 
the flexibility of using the normalisation form appropriate to the web 
development project at hand.
>>>> esp if you also want to comply with certain AAA checkpoints in WCAG 
>>>> 2.0.
>>>
>>> Hold on. What WCAG 2.0 checkpoints require content *not* to be in 
>>> NFC? If that's the case, there are pretty serious defect *somewhere*.
>>>
>> As far as I know WCAG 2.0 is normalisation form agnostic, it doesn't 
>> require any particular normalisation form. But there is a stuff about 
>> guidance for pronunciation, and for tonal  African languages this 
>> means dealing with tone marking (where in day to day usage it isn't 
>> included) - partly or language learners, students and in some case to 
>> aid in disambiguating ideas or words. It could be handled at the 
>> server end or at the client end. To handle at the client end, easier 
>> to use NFD data, and for langauges like Igbo, etc run simple regex to 
>> toggle between tonal versions and standrad versions.
>
> I see. This doesn't mean that serving content in NFD is *required* 
> only that one implementation strategy for a case that is unusual on a 
> global scale becomes *easier* if the DOM data is in NFD.
>
yes, nor is it an argument against normalisation, rather a 
recommendation for some control of normalisation forms by the web developer.
>>> I can see how the editing buffer in RAM would need to be in a form 
>>> other than NFC and perhaps in UTF-16 or UTF-32, but why is it 
>>> desirable to write something other than NFC-normalized UTF-8 to 
>>> persistent storage or to a network socket?
>>>
>> So you are suggesting all files should use NFC when transmitted to 
>> the browser, and at the client end convert to NFD when needed to be 
>> processed in that form?
>
> Yes, that's what I was suggesting. Your client side scripting point 
> is, however, somewhat persuasive against that suggestion. However, I 
> think your client side scripting point support the suggestion I had 
> for how browsers should deal (i.e. not normalizing on the 
> HTML/XML/DOM/JS/CSS layers and only performing (or delegating to 
> system) normalization-related operations when deciding which glyphs to 
> *render* and when cooking user text input for placing it into the DOM).
>
possibly, but then i think rendering issues would be affected by the fonts.

if you look at combining diacritics with Cyrillic and Latin script text, 
Microsoft for instance didn't update its core fonts for this type of use 
until version 5.0 of the core fonts.

and currently we tend to avoid using those core fonts for NFC data that 
may have combining diacritics in the text, since some of the fonts (that 
have appropriate default glyph shapes)  since the position of the 
diacritics in precomposed characters is noticeably different form the 
positioning of the equivalent combining diacritics. For some of the core 
Vista fonts I get better typographic display using combining diacritics. 
But then if you are accessing the site using older version of Windows, 
then you are screwed

Although we don't have this problem with the third party fonts we use.

> I see. Still, I think it's more reasonable that teams whose multipart 
> graphemes don't have an obvious order for the subparts of the grapheme 
> bear the cost of dealing with this complex feature of their writing 
> system and for the sake of performance every browser, XML parser, etc. 
> around the world on all kinds of devices doesn't burn cycles 
> (time/electricity/CO₂) just *in case* there happens to be a string 
> compare where combining characters might have been inconsistently 
> ordered.
>
That asusmes that the development team are even aware of the issue. I 
wonder how many non-Vietnamese web developers know or understand the 
impact different input systems will ahve on a Vietnamese project they 
may be working on. We tend to be fussy about teh developers we work 
with. I expect them to know the input systems in sue. I expect them to 
know the limitations and bugs in mlang.dll and the impact that will have 
on default font settings in IE. I expect them to know the differences 
between version 3.06 and version 5.0 of Windows core fonts and the 
impact that would have on font declarations in stylesheets. I would 
expect them to know what HTML elements to avoid in a multilingual web 
development project. I would expect them to know what default web 
browser rendering needs to be styled per language. I would expect them 
to know which CSS attributes and properties to avoid for various 
languages and writing systems. I would expect them ... and the list goes 
on ... should give you an idea .... but then the average web developer 
here has limited knowledge of these issues. Its a stretch to get them to 
know the basics let alone the complexities of web typography and 
rendering in a truly multilingual environment.
> Before anyone accuses me of “Western bias”, I'd like to point out that 
> there is precedent for preferring (in my opinion quite reasonably) a 
> faster kind of Unicode string equality relation over one that could be 
> claimed to make more sense to users of Western languages: XML 
> deliberately uses code point for code point string comparison for 
> start/end tag names instead of Unicode-wise correct case-insensitive 
> comparison for performance reasons (and to avoid having to pick a 
> collation locale). (I'm assuming here that we can agree that bicameral 
> scripts are big in the “West” and case-insensitive equality makes some 
> intuitive sense for the users of bicameral scripts.)
>
no arguments there regarding bicameral scripts ... ;)

-- 
Andrew Cunningham
Senior Manager, Research and Development
Vicnet
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000

Ph: +61-3-8664-7430
Fax: +61-3-9639-2175

Email: andrewc@vicnet.net.au
Alt email: lang.support@gmail.com

http://home.vicnet.net.au/~andrewc/
http://www.openroad.net.au
http://www.vicnet.net.au
http://www.slv.vic.gov.au
Received on Wednesday, 4 February 2009 23:04:12 UTC