[Bug 13502] Text run starting with composing character should be valid from bugzilla@jessica.w3.org on 2011-10-01 (public-html-bugzilla@w3.org from October 2011)

From: <bugzilla@jessica.w3.org>
Date: Sat, 01 Oct 2011 21:22:18 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1RA70k-0003ne-RN@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=13502

--- Comment #20 from Shai Berger <shai@platonix.com> 2011-10-01 21:22:17 UTC ---
(In reply to comment #19)
> (In reply to comment #18)
> 
> > Anyone who can object to "acce<b>&#x0301;</b>nt" should also object to the
> > equivalent with Shin Dot.
> > 
> > However, characters in the range 05B0--05BC (inclusive) are not diacritics in
> > any sense but visual; they are our vowels.
> 
> How is that an argument? There is no such thing as "right to have styled
> vowels" ... ;-)
> 

There is in Latin scripts... 

> Beside, even if disallowed in HTML, you can get all you need via CSS. [...]
> For Opera, I was unable to style the accent different from the base character -
> but at least I was able to to hold its hand: http://tinyurl.com/6yk2m9b
> 

1) This example relies on moving the combining character to a css "content"
text run (which, then, starts with a combining character). It turns semantics
into presentation, and assumes that an invalid HTML text run will still be a
valid CSS text run.

2) This example doesn't work in Chromium (I mean the actual code, not just the
redirect). It can probably be fixed to work there too, but I fear the specter
of browser-specific code.

3) Since the graphic capability is, as you say, present in all browsers (I
didn't check IE myself); and since nobody is seriously contemplating to forbid
the marking of single letters in a word via markup; why, then, is it so
important to forbid it for symbols which are combining characters?

I actually found an answer for this question in the charmod-norm draft
(http://www.w3.org/TR/charmod-norm, linked earlier by Henri). It is required
there that fully-normalized text does not include text-runs which begin with a
combining character, because when such text-runs are concatenated (appended) to
another text-run, normalization may change the characters involved or their
order. As an example, "acce"+"&#x301;nt" should normalize into "acc�nt". Hebrew
vowels (like many other combining characters) do not combine with their base
into a single character when normalized, but when there is more than one
combining character, their order may still change: Using capitals for the
combining characters, "acceB"+"Ant" may normalize into "acceABnt".

As was demonstrated here, this is not a real issue for browsers presenting
pages. I suppose it may be an issue for other processing of HTML pages. But
even then, the limitation seems far too strict: An overwhelming majority of
text runs in HTML documents will never be concatenated to anything but the
preceding text run in the same document; I could live perfectly well with "the
concatenation of all text-runs in a document should be fully-normalized" rather
than "every text run". Actually, according to the "background" subsection of
charmod-norm, there is little reason to apply it to HTML at all ("When data
transfer on the Web remained mostly unidirectional (from server to browser),
and where the main purpose was to render documents, the use of Unicode without
specifying additional details was sufficient". This still describes HTML, as
far as I am aware).

So: As far as I see, this is the issue here: Does W3C prefer a use-case that is
already supported by major browsers, or the promise that concatenating
text-runs from valid pages will not, in itself, create non-normalized text? For
the "normalized" promise, note that no such promises are made about the
text-runs themselves; nobody requires those to be normalized.

> <rant>Each writing script has its advantages and disadvantages. For instance,
> Hebrew text runs are shorter than Latin runs, since there are no vowels there
> (and even if you have vowels, the text length doesn't increase).  As a user of
> of the Latin script where I must write vowels, I feel discriminated - for
> instance on Twitter!  It is even �worse�: last I checked, Twitter seemed (at
> least on the profile page) to not count combined chars, but instead to only
> count the letter they are combined with. And judging from that, you can add
> Hebrew vocals on Twitter without being punished! :-D </rant>

Check out https://dev.twitter.com/docs/counting-characters. Twitter counts
normalized characters; accents on Latin vowels are free, but Hebrew vowels will
still cost you. Just sayin'.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Saturday, 1 October 2011 21:22:20 UTC