W3C home > Mailing lists > Public > www-html@w3.org > August 2005

Re: tag for notion and compound indication

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sat, 6 Aug 2005 01:26:09 +0300 (EEST)
To: XHTML-Liste <www-html@w3.org>
Message-ID: <Pine.GSO.4.63.0508060114590.18530@korppi.cs.tut.fi>

On Fri, 5 Aug 2005, Devin Bayer wrote:

> On Aug 5, 2005, at 1:15, Simon Siemens wrote:
>
>> However what has not been addressed by HTML up to
>> now are compound words. I suppose, this is because English does not have
>> them to a relevant extend. But German e.g. has many compound words (like
>> "Bundesregierung": "Bund" + "Regierung"). Thus an ability to indicate
>> such compositions could really enhance search engine results from a
>> German point of view.
>
> Why not use the the word joiner (2060) character?

Because it means roughly the opposite of the desired meaning.

> Excerpted from Unicode:
>
> Word Joiner behaves like &nbsp; in that it indicates the absence of word 
> boundaries; however, the word joiner has no width. For example, the word 
> joiner can be inserted after the fourth character in the text ?base+delta? to 
> indicate that there should be no line break between the ?e? and the ?+?. The 
> word joiner should be ignored in contexts other than word or line breaking.

The quotation is from
http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf#page=6
and we can see that there's a sentence omitted in the midst of the 
quotation, namely:
"The function of the character is to indicate that line breaks are not 
allowed between the adjoining characters, except next to hard line 
breaks."
This is confirmed later in the text in a manner that looks rather 
definite:
"The word joiner should be ignored in contexts other than word or line 
breaking."

Thus, Bundes&#2060;regierung would differ from Bundesregierung only by 
disallowing a line break - at the most preferable point of division!

Indicating the internal structure of a word, such as its being a compound 
word, would most logically belong to markup, not character level.
The practical reason for not wanting to deal with it at character level is 
that even if the Unicode consortium would be willing to add such a 
character (which it probably wouldn't), the process would take many years.

As a rule, empty elements indicate design flaws in markup language.
The logical markup would be something like

<compound><part>Bund<case>es</case></part><part>regierung</part></compound>

but I'm afraid it has no chances of being accepted into a specification, 
still less being implemented in a useful way in browsers, search engines, 
etc.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Friday, 5 August 2005 22:26:16 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:16:04 GMT