Re: first-word pseudo-element from Julien Quint on 2001-05-17 (www-style@w3.org from May 2001)

From: Julien Quint <julien.quint@imag.fr>
Date: Thu, 17 May 2001 20:05:13 +0200
To: www-style@w3.org
Cc: Julien.Quint@imag.fr
Message-Id: <01051720051309.01363@houdini.imag.fr>

Le Jeudi 17 Mai 2001 19:25, Bjoern Hoehrmann a écrit :
> * Daniel Glazman wrote:
> >My answer, posted also on a regular basis, is the following one : what
> >is a word ?
>
> That's script-dependant as is the word-spacing property and general
> white-space handling in e.g. XHTML. We could have a simple definition
> for ::first-word like: "everything before the first application of
> word-spacing", so yes, it's a problem to be script-dependant but this
> problem already existed in CSS Level 1 and does exist in XHTML and so
> on, so I see no problem to add such a pseudo-element. User-agents could
> be made be free to implement this feature to a distinct range of script
> families (like "Latin").

The definition of a word is not script dependant, not even language 
dependant, but person dependant. See [1], p. 393 for instance -- three native 
speakers from Taiwan and three native speakers from mainland China do not 
agree on the segmentation into words of the same text written in Chinese; 
that is they have a different perception of what is a word even though they 
speak the same language and write it with the same script.

Even if you decide to restrict this to scripts like latin, you would have to 
take into account compound words, abbreviations, ambiguous punctuation... And 
there are an awful lot of languages that use the latin alphabet, each with 
different rules and exceptions.

Julien

[1] Richard Sproat, Chilin Shih, William Gale and Nancy Chang, ``A Stochastic 
Finite-State Word-Segmentation Algorithm for Chinese,'' in Computational 
Linguistics vol. 22, no. 3, pp. 377-404, September 1996.

Received on Thursday, 17 May 2001 14:05:31 UTC