- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Sun, 12 Sep 2004 01:25:27 +0300 (EEST)
- To: W3C CSS List <www-style@w3.org>
On Sat, 11 Sep 2004, Anne van Kesteren wrote: > From the specification[1]: (Technically, CSS 2.1 is still a draft, not a specification. But there's probably no change here from CSS 2.0 anyway.) > # Some languages may have specific rules about how to treat certain > # letter combinations. In Dutch, for example, if the letter combination > # "ij" appears at the beginning of a word, both letters should be > # considered within the :first-letter pseudo-element. Given the current state of the art, such notes are pointless (or worse, since they may mislead people). Browsers mostly don't even recognize language markup to know the language of a piece of text; still less do they even try to do meaningful language-dependent processing even in simple details. > Robbert Broersma just told me that there are two Unicode characters > defined for the "Dutch ij", a uppercase and lowercase variant. > > They are: \u0132 and \u0133. See also Bugzilla Bug 92176[2]. They are compatibility characters, with IJ and ij as the compatibility decompositions. In effect, they were included into Unicode because they belonged to some existing character code standards, and Unicode was meant to be universal code, so that you can map data from any encoding into Unicode, and vice versa, without losing a distinction made in the other encoding. Note the difference between e.g. IJ and the letter AE (Æ), which is historically a ligature of A and E but classified as an independent letter, not as a compatibility character. This means that U+0132 and U+0133 are effectively just IJ and ij as ligatures, as typographic variants of certain character pairs. Whether you use them or IJ and ij (with or without some mechanism, such as a style sheet, that renders them as ligatures) is a practical choice, and in Web authoring, there are good reasons to favor the letter pairs IJ and ij, which are universally supported. Even if U+0132 and U+0133 were preferred over IJ and ij - they aren't - it would still be the case that Dutch texts contain IJ and ij quite often. (As an intelligent guess, I would say _far_ more often than U+0132 and U+0133.) > I was wondering if this note is still needed, since "certain letter > combinations" apparently have Unicode equivalents. (At least, the "Dutch > ij" has.) Thus, I think the logic does not apply, but the statement should be removed for other reasons. The specification should simply define what :first-letter really means - hopefully in a realistic way. The current formulation is vague, and partly deviates from browser practice. (Currently the specification does not actually define at all what :first-letter corresponds to in document content. We are expected to guess this from hints and indirect references, such as the pseudo-element's name.) The note should probably be turned into a realistic warning: the :first-letter pseudo-element is defined in a simple way (or, if the current wordings are kept: is not strictly defined), and authors should note that it does not capture the orthographic and stylistic conventions of several languages, where two consecutive characters (e.g., IJ in Dutch) might be treated as a single letter. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Saturday, 11 September 2004 22:26:01 UTC