Re: Transcoding Tamil in the presence of markup

At 23:16 03/12/07 +0900, Jungshik Shin wrote:

>On Sun, 7 Dec 2003, Peter Jacobi wrote:

> > So, I'm still wondering whether Unicode and HTML4 will consider
> >   <span style='color:#00f'>&#x0BB2;</span>&#x0BBE;
> > valid and it is the task of the user agent to make the best out of it.
>
>   I think this is valid.

I agree. It is the task of the user agent to make the best out of it,
and different user agents may currently do different things with it.
Because this is related to rendering and styling, it seems to make
sense that this is clarified in the CSS spec (either 2.1 or 3.0).


>A more interesting case has to do with
>W3 CHARMOD in which NFC is required/recommended (it's not yet complete
>and W3C I18N-WG has been discussing it).  Consider the following case.
>
>   &#x0BB2;<span class="left_part">&#0x0BC7;</span>
>  <span class="right_part">&#0x0BBE;</span>
>
>Because <U+0BC7, U+0BBE> is equivalent to U+0BCB, we couldn't use
>the above if NFC is required even though in legacy TSCII encoding,
>it's possible.

Yes, this is a bad idea. But there is Web technology that can do
this (see below).

The basic problem is that one has to draw the line somewhere.
Sometimes, one would for example like to color the dot on an 'i'.
In Unicode, it may theoretically be possible (with a dotless 'i'
and a 'dot above' or some such), but it wouldn't be a real 'i'
anymore.

And there is of course a slippery slope. For example, consider
the crossbar on a 't'. You can't color that, in any encoding.
But a font designer may want to do that, for some instructional
material, or may want to color all serifs in a font,...

Similar examples exist in almost any other script. For most
intents and purposes, most people are okay with what they
can and can't do, but occasionally, we come close to the
dividing line, and some of us are quite surprised. But somehow,
we have to agree on what's a character and what's only a glyph,
and we have to agree which combinations are canonically equivalent.


>The same is true of Korean syllables(see below) as
>Philippe pointed out.
>
>   &#x1100;<span class="vowel">&#x1161;</span>&#x11a8;

Yes. Korean is particularly difficult because it is the most
logical, well-designed script in the world. It has more
clearly identifiable hierarchical levels than any other
script. It is very difficult to agree on which level
characters should be.

As an example, the vowel pairs a/ya, o/yo, u/yu, and so on
are distinguished by changing from one small stroke to two
small strokes. A Web page for children or foreigners may
want to color these strokes separately. With the current
encoding(s) in Unicode this is not possible, but I'm sure
somebody has designed an encoding where this would be possible.


So while this does not solve Peter's immediate problem,
starting to change Unicode to color characters, glyphs,
or character parts would be an extremely slippery slope.

Working on better font technology seems to be much better
suited to do the job. And such technology actually is
already around. It's part of SVG. Chris Lilley had a
very nice example once, but it got lost in a HD crash.
Chris, any chance of getting a new example?

SVG (http://www.w3.org/Graphics/SVG/ http://www.w3.org/TR/SVG11/)
is the XML-based vector graphics format for the Web.
Here is more or less how it works (as far as I understand it):

In SVG Fonts (http://www.w3.org/TR/SVG11/fonts.html),
SVG itself is used to describe glyph shapes. This means
that all kinds of graphic features, including of course
coloring, but also animation,... are available.
But of course you don't want colors to be fixed.
So glyphs in a font, or parts of glyphs, also allow
the 'class' attribute. So you can mark glyphs or glyph
components with things such as class='accent' or
class='crossbar', and so on. The rendering of pieces
in this class can then be controlled from a CSS
stylesheet.  (I hope I got the details right.)


Regards,    Martin.

Received on Sunday, 7 December 2003 12:54:06 UTC