Re: tag for notion and compound indication

On Fri, 5 Aug 2005, Devin Bayer wrote:

> On Aug 5, 2005, at 15:26, Jukka K. Korpela wrote:
>
>> Thus, Bundesࠌregierung would differ from Bundesregierung only by 
>> disallowing a line break - at the most preferable point of division!
>
> Words are not supposed to be divided in the middle.

They are, when needed typographically. Web browsers traditionally break 
words, but this is just a symptom of their being primitive in handling 
text. Consult suitable general and language-specific manuals on typography 
and orthography if you disagree.

>> Indicating the internal structure of a word, such as its being a compound 
>> word, would most logically belong to markup, not character level.
>
> I disagree.  The internal structure of a word is the job for character data. 
> Markup is much better at the external structure, at the word level and above.

Characters, as defined by Unicode and other character standards, 
correspond to units of written text, such as letters, digits, punctuation, 
syllabic characters, ideographic characters, and mathematical symbols.
Characters are not meant to act as invisible descriptors of structure.
There are deviations from this, mainly from historical reasons, but you 
would find it impossible to persuade the relevant standardization bodies 
on a fundamental change towards using characters to indicate the internal 
structure of a word.

>> As a rule, empty elements indicate design flaws in markup language.
>> The logical markup would be something like
>> 
>> <compound><part>Bund<case>es</case></part><part>regierung</part></compound>
>
> This solution is way over-engineered.

I was simply making a logical conclusion. If you wish to have the internal 
structure of a word indicated, that would be about the _simplest_
engineering that is consistent with modern design principles.

> I can't imagine you want to work with 
> that kind of mess.

Does anyone work with the XML mess "by hand"? Besides, my fictitious 
markup is very simple and logical; consider MathML, which is much more 
messy, yet taken seriously by many.

> Do you support using markup on the first letter of a 
> sentence instead of having capital characters?

No, but I could well imagine using markup for a sentence, with a style 
sheet used to capitalize the first character of the first word of a 
sentence, if that's the style.

The real question is what kind of structures should be indicated in 
markup. (Just having some markup element does not mean one would need to 
use it for every occasion where it might be used. So this is partly a 
matter of markup language design, partly a matter of authors' choices.)
I would need a good reason to indicate the "fine structure" of text in the 
first place. But once you have decided on such a matter, the rest is 
rather obvious, except perhaps for the choice of minor details like 
element and attribute names. (You might prefer write-only markup with 
cryptic single-letter names, or long and descriptive names, or something 
between - HTML is currently an awkward mixture in this respect.)

> I do agree I had the semantics a little off.

No, you had it completely wrong. The Unicode standard says that the word 
joiner is for line break control only.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Saturday, 6 August 2005 05:07:32 UTC