Re: CSS3 Text and UAX14 from Asmus Freytag on 2007-02-21 (www-style@w3.org from February 2007)

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Wed, 21 Feb 2007 15:11:22 -0800
To: fantasai <fantasai.lists@inkedblade.net>
CC: "Paul Nelson (ATC)" <paulnel@winse.microsoft.com>, www-style@w3.org, WWW International <www-international@w3.org>
Message-ID: <45DCD19A.1020200@ix.netcom.com>
On 2/20/2007 10:37 PM, fantasai wrote:
> Well, no, it wouldn't because ZW mustn't be automatically inserted in
> this case:
>
>    <p>Line feeds (<code>U+000A</code>) are...</p>
I suspected as much.
> I think that if you merely recommend the currently-specified behavior
> and specify it as the default, but allow breaks within whitespace
> sequences as a tailoring, then it is reasonable to trust that the
> implementor or higher-level protocol will do what is most appropriate
> for its application. That would certainly solve the problem for CSS
> 'pre'.
>
If we can establish that this is the one thing that we need to allow, then
the best way forward is indeed to allow HL protocols to override the
breaking behavior *within* sequences of whitespace.

In order to keep the bar high to encourage reliable data interchange,
I would like to keep the exception limited to what is actually required.

Therefore, I want to make sure that nobody is aware of any other
behavior affecting sequences of SP and any of the other non-tailorable
classes that present issues. (The interaction of SP with tailorable 
classes is
not an issue as these rules are tailorable).
>> As a first pass, I've gone over the existing text and replaced the 
>> use of
>> "must" where it was used as ordinary language, usually by 'needs be',
>> or other, less normative sounding statements.
>
> Excellent. :)
>
>> I've taken a hard look at the "should"s and replaced many of them
>> by "is recommended", because that's what most of them were used
>> as.
>
> "should" and "is recommended" are equivalent in RFC2119, 
but not in UAX#14 - we explicitly state that we use "recommended"
when we talk about 'best practice' in a general sense. However, we
use "should" when talking about character usage, in those cases where
we strongly recommend that a particular character or sequence be used
to encode a specific text features.

Here's where UAX#14 is both part of the Unicode Standard as well
as a specification of a particular algorithm. Therefore, section 5 contains
occasional recommendations to authors of texts about choice of
characters, which is of no concern to implementers of the algorithm,
because to them the sequence of input characters are given.
> and carry
> the same basic meaning in English as well. If you add it as alternative
> wording for recommendations in your conformance section, there's no
> problem with using it. The thing to be careful about is, as with
> "must", to only use it in contexts that are recommending a behavior
> for the implementation ("the UA should treat Q as Z in some languages",
> or "Q should be treated as Z in some languages").
>
>> The 'descriptive statement' issue is of course a bit trickier to track
>> down. I've strengthened the language in the introduction to section5
>> to clarify that much of the material is informative.
>
> Ok. I'll take a closer look and see if I find any more examples of
> this problem.
OK - let's take these details off-line, at least until next week when I
hope to be able to have a proposed update for UAX#14 released.
(I'll respond offline on these).
>
>>> Another problem I've noticed: SP is specified as not tailorable, but it
>>> is left out of the list of non-tailorable character classes in the list
>>> at the top of 6.1. What is the intent of the spec? Can membership in SP
>>> be tailored or not?
>>
>> It looks like the list at the head of 6.1 is in error (It's marked
>> non-tailorable in both Table 1 and Section 5 and the intent is to only
>> have non-tailorable classes participating in the rules in Section 6.1).
>> That's an erratum.
>
> Ok. Is there any reason why SP's membership cannot be augmented?
The main reason is that hardly any established practice exists of 
treating any
other character codes the same as U+0020. At least in as many line break
and DTP implementations that I've worked on, SP was always limited to
a single character.

In my view, allowing this class to be tailored is also not needed.
There are existing line break classes, like BA, that will give you most of
what SP does, without its special status as a (visible) linebreak control.
All the fixed-width spaces have been assigned to this BA ("break after")
class by default. You could easily create another class e.g. BS 
("breaking space")
and give it rules that are more specific to what you'd like to do.

The two aspects that you couldn't override (and I'd argue that you 
shouldn't)
are: a ZW character will force a break anywhere (except in front of a 
"true" SP).

And the second one is that CM will be treated as AL following a true SP but
not following any tailorable class. Now, the reason why we include SP in 
that
rule is because we earlier recommended that SP be used as a placeholder to
show a combining mark in isolation. This recommendation caused a lot of
grief, precisely because SP is treated as special in so many algorithms.
Therefore, we now we recommend NBSP. However, as a result of that
earlier recommendation, you can expect to see sequences of SP followed
by combining marks. However, sequences of EM-SPACE followed by
combining marks are ill-formed on the text level (they are legal on the
encoding level) and it makes little sense to try to fine-tune their 
treatment
in a linebreaking algorithm.

OK, that was the long answer. As always, if you have an example of an
existing protocol or implementation where not being able to tailor SP
makes a crucial difference, I'd like to know about it.

A./
Received on Wednesday, 21 February 2007 23:12:51 UTC