Forwarded message 1
Andreas Prilop wrote on the Unicode mailing list[1]:
> Does the Unicode standard only deal with plain text or
> does it also deal with text in markup languages like SGML/HTML?
>
> I wonder whether Arabic letters should join when they are
> separated by markup. Here's an example:
>
> http://www.unics.uni-hannover.de/nhtcapri/temp/nastaliq.html
>
> Current programs display the letters separated by markup
> differently: Internet Explorer 6 and StarOffice 7 join the
> letters, but Mozilla 1.7 does not.
>
> Is it left to the rules of SGML/HTML to decide or
> has the Unicode standard any opinion about this?
In semantic markup languages like HTML, it's really the domain of the
formatting system used to process the markup, not the markup system
itself. [1] So, for web pages, this behavior would be governed by the
Unicode and CSS specs. I haven't read the Unicode book cover to cover,
but since there's an argument here, I'm guessing it's not covered by
Unicode quite yet. :)
Like many other people here, I think that the goal should be to make
the text as readable as possible, even if it means ignoring some of
the styling.
Therefore, these are the rules I suggest:
For characters within the same inline sequence.
1. Shaping and joining behavior MUST NOT be affected by element
boundaries.
2. Ligatures, including obligatory ligatures, MUST be broken if
the formatting rules introduce extra space between the affected
characters (e.g. by putting a border and margin around one of
the characters).
3. Optional ligatures SHOULD be broken if the formatting rules
cannot otherwise be accomodated.
4. Obligatory ligatures MUST NOT be broken if the formatting rules
introduce no extra space between the affected characters, even
if this means some of the characters are rendered in the wrong
font or as part of the wrong visual element.
5. Combining characters MUST be rendered as the combined grapheme
cluster if the system is capable of rendering the combination,
even if this means some of the characters are rendered in the
wrong font or as part of the wrong visual element. The combined
grapheme cluster SHOULD be rendered as part of the base
character's element, or, in the case of combining jamos, the
initial character's element.
I'm quite certain of #1, but as I don't have extensive background
in this stuff, I am not so certain of the others. Comments are
appreciated. I can ask the CSS Working Group to consider adding a
recommendation to the next revision of CSS2.1 if there seems to
be a consensus around a particular set of rules, and/or to refer
to relevant parts of the Unicode standard.
~fantasai
[1] http://www.unicode.org/mail-arch/unicode-ml/y2005-m06/0110.html
username: unicode-ml ; pass: unicode
[2] CSS determines whether an element visually behaves as a
block or an inline or a table cell. Given the CSS rule
* { display: inline; }
both
<div>ARA</div><div>BIC</div>
and
<span>ARA</span><span>BIC</span>
would result in the exact same rendering.