- From: Andrew Cunningham <andrewc@vicnet.net.au>
- Date: Thu, 05 Feb 2009 10:02:37 +1100
- To: Henri Sivonen <hsivonen@iki.fi>
- CC: Jonathan Kew <jonathan@jfkew.plus.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
- Message-ID: <498A1E8D.6000702@vicnet.net.au>
Henri Sivonen wrote: > On Feb 4, 2009, at 00:44, Andrew Cunningham wrote: > >> Henri Sivonen wrote: >> >>> The central reason for using NFC for interchange (i.e. what goes >>> over HTTP) is that legacy software (including the text rendering >>> code in legacy browsers) works better with NFC. >>> >> I'd be interested in knowing what oyu'd define as legacy browsers, >> and which operating systems you have in mind when you emntion it. > > For the purpose of the quoted sentence, I'd define versions of the > major browsers other than the most recent release version of each one > as "legacy". (For the purpose of specifying new behaviors in specs, > even the now-current versions should be considered legacy from the > point of view of the potential future spec feature.) > although you identified one of the problem areas of "legacy" browsers as text rendering code, which in this instance would be a furphy, considering your definition of legacy in this instance, and irrelevant tot he discusison. Although from the point of view of changes to a spec or future spec even the current release and possibly the next scheduled release or two would also be considered legacy. I.e. in the sense you use legacy ALL browsers are legacy possibly including versions under development. But you do make a point about stability and backwards behaviour. If changes were to occur regarding normalisation, obviously they wouldn't effect existing versions of CSS 1.0 or 2.1. But not sure that would necessarily be a valid argument for CSS 3.0 >> if a browser can't render combining diacritics, then it will not be >> able to render NFC data when the NFC data uses combining diacritics. > > Right. However, that does not make those browsers useless, because > there are lot of things that can be communicated with precomposed > characters. > Depends on the language. if heavilly use of combining diacritics is required by a particular language, and if those diacritics play a critical role in the legibility and readability of the text ... then its a significant issue for those languages. But then again, for the languages I tend to work there other issues such as text layout, line breaking, ordered lists where browsers have serious defects. Rule of thumb is that even most recent browsers are fully ready for a truly multilingual environment. there are quite a few writing scripts where much more work needs to be done. Both in terms of text layout in web browsers and various aspects of CSS. But tis is tsrating to get very off topic, so probably best to leave those comments as is. >> So for a "legacy" browser when a document contains combining >> diacritics it doesn't matter if the text is NFC or NFD, it will not >> correctly render it. >> >> For legacy browsers, Unicode will always be a barrier regardless of >> normalisation form. > > Only for cases where the mapping from characters to graphemes is not > one-to-one. In a lot of cases that have utility, the mapping is > one-to-one. > Most writing scripts aren't simple one-to-one mappings Even the Latin and Cyrillic scripts need to use GSUB features to be able to support various languages. Sticking with European languages for a second: The new crop of core fonts form Windows (those in Vista or the EU Font update) use GSUB features for Romanian. If you have content in Russian, Macedonian and Serbian, and part of that text needs to display in an italic typeface and you need to use an italic typeface, and want to use the same font for all Cyrillic text, then you need a browser that can utilise OpenType language systems and GSUB lookups to provide the correct glyphs for Serbian and Macedonian vs Russian. Then if you add all the Latin script languages unsupported by OpenType language systems that need culturally appropriate glyph alternatives for the same character, then you have a bigger problem. Current browsers do allow a user any control over rendering, when alternatives are available in a font, nor does CSS. Nor is there a way of distinguishing between versions of a font, when the differences This is really getting off topic for the thread. Sorry. But its a bug bear with me. I tend to mainly work on web content in lesser used languages and come across the limitations of web browsers and CSS on a daily basis. For some projects I may be working with a number of languages. The fonts I use are OpenType fonts, they contain all the OpenType rules needed to display each of those langauges correctly. But when I need to display that text in a web browser, I'm forced to take the original opentype font make a copy of that font and edit the opentype table in that font to change the default values of some of those features, save the modified font and in the CSS files specify either teh original font or the modified font.in theory all the text should display fine with the original font. But CSS gives me no way to control the alternative behaviours and web browsers have no UI to expose or control such behaviours either. But then they also sometimes struggle with obligatory OpenType features. Sometimes this may be a problem with the underlying OS font rendering / text layout system, and sometimes its a problem with the application. I suppose that is enough of that rant. Just tired from trying to tweak some fonts for a Sudanese project. >>> >> There are discrepancies between canonical ordering in normalisation >> for some Myanmar characters compared to the data storage order >> recommended in UTN11. Current Unicode 5.1 fonts for the Myanmar block >> as based on UNT11. I believe Martin H is working on a draft of >> version 3 of UTN11 (esp. since UTN11 was Burmese centric and also >> needs to address a range of issues with ethnic minority languages, >> Pali and Sanskrit). Very few if any web sites actually normalise >> content, Wikipedia and the Mediawiki platform being one of those. >> From memory the problem came to light when trying to work out >> rendering problems in the Burmese version of Wikipedia. Haven't >> followed the discussion in any detail and have only had second hand >> reports on the meetings in Yangon last year. > > So this is not a fundamental objection to normalization but a bug in > coordinating between UTNs and the bug is expected to be transient? we hope ;) although would be interesting to see if the ethnic langauges will through a spanner in the works vis-a-vis. > >>>> In a number of African languages it is useful to work with NFD data, >>> >>> Even if it useful to perform in-RAM editing operations on NFD in a >>> text editor, it doesn't follow that NFD should be used for interchange. >>> >> except where it is useful process NFD data in a client side script. > > That's indeed an interesting case. Considering the subject matter of > this thread, it isn't an argument *for* browsers normalizing without > the author asking them to but *against* browsers normalizing in > CSS/HTML/JS parsers without the author asking them to. > yep, personally i think default wholesale normalisation would be interesting, defaulting to NFC. But I'd want a mechanism in CSS and in the browsers for the web developer to specify alternative behaviour when required. I think normalisation is required. But I'd also liek to have the flexibility of using the normalisation form appropriate to the web development project at hand. >>>> esp if you also want to comply with certain AAA checkpoints in WCAG >>>> 2.0. >>> >>> Hold on. What WCAG 2.0 checkpoints require content *not* to be in >>> NFC? If that's the case, there are pretty serious defect *somewhere*. >>> >> As far as I know WCAG 2.0 is normalisation form agnostic, it doesn't >> require any particular normalisation form. But there is a stuff about >> guidance for pronunciation, and for tonal African languages this >> means dealing with tone marking (where in day to day usage it isn't >> included) - partly or language learners, students and in some case to >> aid in disambiguating ideas or words. It could be handled at the >> server end or at the client end. To handle at the client end, easier >> to use NFD data, and for langauges like Igbo, etc run simple regex to >> toggle between tonal versions and standrad versions. > > I see. This doesn't mean that serving content in NFD is *required* > only that one implementation strategy for a case that is unusual on a > global scale becomes *easier* if the DOM data is in NFD. > yes, nor is it an argument against normalisation, rather a recommendation for some control of normalisation forms by the web developer. >>> I can see how the editing buffer in RAM would need to be in a form >>> other than NFC and perhaps in UTF-16 or UTF-32, but why is it >>> desirable to write something other than NFC-normalized UTF-8 to >>> persistent storage or to a network socket? >>> >> So you are suggesting all files should use NFC when transmitted to >> the browser, and at the client end convert to NFD when needed to be >> processed in that form? > > Yes, that's what I was suggesting. Your client side scripting point > is, however, somewhat persuasive against that suggestion. However, I > think your client side scripting point support the suggestion I had > for how browsers should deal (i.e. not normalizing on the > HTML/XML/DOM/JS/CSS layers and only performing (or delegating to > system) normalization-related operations when deciding which glyphs to > *render* and when cooking user text input for placing it into the DOM). > possibly, but then i think rendering issues would be affected by the fonts. if you look at combining diacritics with Cyrillic and Latin script text, Microsoft for instance didn't update its core fonts for this type of use until version 5.0 of the core fonts. and currently we tend to avoid using those core fonts for NFC data that may have combining diacritics in the text, since some of the fonts (that have appropriate default glyph shapes) since the position of the diacritics in precomposed characters is noticeably different form the positioning of the equivalent combining diacritics. For some of the core Vista fonts I get better typographic display using combining diacritics. But then if you are accessing the site using older version of Windows, then you are screwed Although we don't have this problem with the third party fonts we use. > I see. Still, I think it's more reasonable that teams whose multipart > graphemes don't have an obvious order for the subparts of the grapheme > bear the cost of dealing with this complex feature of their writing > system and for the sake of performance every browser, XML parser, etc. > around the world on all kinds of devices doesn't burn cycles > (time/electricity/CO₂) just *in case* there happens to be a string > compare where combining characters might have been inconsistently > ordered. > That asusmes that the development team are even aware of the issue. I wonder how many non-Vietnamese web developers know or understand the impact different input systems will ahve on a Vietnamese project they may be working on. We tend to be fussy about teh developers we work with. I expect them to know the input systems in sue. I expect them to know the limitations and bugs in mlang.dll and the impact that will have on default font settings in IE. I expect them to know the differences between version 3.06 and version 5.0 of Windows core fonts and the impact that would have on font declarations in stylesheets. I would expect them to know what HTML elements to avoid in a multilingual web development project. I would expect them to know what default web browser rendering needs to be styled per language. I would expect them to know which CSS attributes and properties to avoid for various languages and writing systems. I would expect them ... and the list goes on ... should give you an idea .... but then the average web developer here has limited knowledge of these issues. Its a stretch to get them to know the basics let alone the complexities of web typography and rendering in a truly multilingual environment. > Before anyone accuses me of “Western bias”, I'd like to point out that > there is precedent for preferring (in my opinion quite reasonably) a > faster kind of Unicode string equality relation over one that could be > claimed to make more sense to users of Western languages: XML > deliberately uses code point for code point string comparison for > start/end tag names instead of Unicode-wise correct case-insensitive > comparison for performance reasons (and to avoid having to pick a > collation locale). (I'm assuming here that we can agree that bicameral > scripts are big in the “West” and case-insensitive equality makes some > intuitive sense for the users of bicameral scripts.) > no arguments there regarding bicameral scripts ... ;) -- Andrew Cunningham Senior Manager, Research and Development Vicnet State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Ph: +61-3-8664-7430 Fax: +61-3-9639-2175 Email: andrewc@vicnet.net.au Alt email: lang.support@gmail.com http://home.vicnet.net.au/~andrewc/ http://www.openroad.net.au http://www.vicnet.net.au http://www.slv.vic.gov.au
Received on Wednesday, 4 February 2009 23:04:12 UTC