- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Sat, 6 Aug 2005 01:26:09 +0300 (EEST)
- To: XHTML-Liste <www-html@w3.org>
On Fri, 5 Aug 2005, Devin Bayer wrote: > On Aug 5, 2005, at 1:15, Simon Siemens wrote: > >> However what has not been addressed by HTML up to >> now are compound words. I suppose, this is because English does not have >> them to a relevant extend. But German e.g. has many compound words (like >> "Bundesregierung": "Bund" + "Regierung"). Thus an ability to indicate >> such compositions could really enhance search engine results from a >> German point of view. > > Why not use the the word joiner (2060) character? Because it means roughly the opposite of the desired meaning. > Excerpted from Unicode: > > Word Joiner behaves like in that it indicates the absence of word > boundaries; however, the word joiner has no width. For example, the word > joiner can be inserted after the fourth character in the text ?base+delta? to > indicate that there should be no line break between the ?e? and the ?+?. The > word joiner should be ignored in contexts other than word or line breaking. The quotation is from http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf#page=6 and we can see that there's a sentence omitted in the midst of the quotation, namely: "The function of the character is to indicate that line breaks are not allowed between the adjoining characters, except next to hard line breaks." This is confirmed later in the text in a manner that looks rather definite: "The word joiner should be ignored in contexts other than word or line breaking." Thus, Bundesࠌregierung would differ from Bundesregierung only by disallowing a line break - at the most preferable point of division! Indicating the internal structure of a word, such as its being a compound word, would most logically belong to markup, not character level. The practical reason for not wanting to deal with it at character level is that even if the Unicode consortium would be willing to add such a character (which it probably wouldn't), the process would take many years. As a rule, empty elements indicate design flaws in markup language. The logical markup would be something like <compound><part>Bund<case>es</case></part><part>regierung</part></compound> but I'm afraid it has no chances of being accepted into a specification, still less being implemented in a useful way in browsers, search engines, etc. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Friday, 5 August 2005 22:26:16 UTC