Re: [css21] 5.12.2 The :first-letter pseudo-element (the Dutch "ij") from Jukka K. Korpela on 2004-09-11 (www-style@w3.org from September 2004)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sun, 12 Sep 2004 01:25:27 +0300 (EEST)
To: W3C CSS List <www-style@w3.org>
Message-ID: <Pine.GSO.4.58.0409120100510.28965@korppi.cs.tut.fi>

On Sat, 11 Sep 2004, Anne van Kesteren wrote:

>  From the specification[1]:

(Technically, CSS 2.1 is still a draft, not a specification.
But there's probably no change here from CSS 2.0 anyway.)

> # Some languages may have specific rules about how to treat certain
> # letter combinations. In Dutch, for example, if the letter combination
> # "ij" appears at the beginning of a word, both letters should be
> # considered within the :first-letter pseudo-element.

Given the current state of the art, such notes are pointless (or worse,
since they may mislead people).

Browsers mostly don't even recognize language markup to know the language
of a piece of text; still less do they even try to do
meaningful language-dependent processing even in simple details.

> Robbert Broersma just told me that there are two Unicode characters
> defined for the "Dutch ij", a uppercase and lowercase variant.
>
> They are: \u0132 and \u0133. See also Bugzilla Bug 92176[2].

They are compatibility characters, with IJ and ij as the compatibility
decompositions. In effect, they were included into Unicode because they
belonged to some existing character code standards, and Unicode was meant
to be universal code, so that you can map data from any encoding into
Unicode, and vice versa, without losing a distinction made in the other
encoding. Note the difference between e.g. IJ and the letter AE (Æ), which
is historically a ligature of A and E but classified as an independent
letter, not as a compatibility character.

This means that U+0132 and U+0133 are effectively just IJ and ij as
ligatures, as typographic variants of certain character pairs.
Whether you use them or IJ and ij (with or without some mechanism, such as
a style sheet, that renders them as ligatures) is a practical choice, and
in Web authoring, there are good reasons to favor the letter pairs IJ and
ij, which are universally supported.

Even if U+0132 and U+0133 were preferred over IJ and ij - they aren't - it
would still be the case that Dutch texts contain IJ and ij quite often.
(As an intelligent guess, I would say _far_ more often than U+0132 and
U+0133.)

> I was wondering if this note is still needed, since "certain letter
> combinations" apparently have Unicode equivalents. (At least, the "Dutch
> ij" has.)

Thus, I think the logic does not apply, but the statement should be
removed for other reasons. The specification should simply define
what :first-letter really means - hopefully in a realistic way.
The current formulation is vague, and partly deviates from browser
practice. (Currently the specification does not actually define at all
what :first-letter corresponds to in document content. We are expected to
guess this from hints and indirect references, such as the
pseudo-element's name.)

The note should probably be turned into a realistic warning: the
:first-letter pseudo-element is defined in a simple way (or, if the
current wordings are kept: is not strictly defined), and authors should
note that it does not capture the orthographic and stylistic conventions
of several languages, where two consecutive characters (e.g., IJ in Dutch)
might be treated as a single letter.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Saturday, 11 September 2004 22:26:01 UTC