Re: [css3-text] line-break questions/comments from Glenn Adams on 2012-08-27 (www-style@w3.org from August 2012)

From: Glenn Adams <glenn@skynav.com>
Date: Mon, 27 Aug 2012 15:59:55 +0800
To: Koji Ishii <kojiishi@gluesoft.co.jp>
Cc: W3C Style <www-style@w3.org>, "public-i18n-cjk@w3.org" <public-i18n-cjk@w3.org>
Message-ID: <CACQ=j+cUrMqqr-cbTEXVY6m21xvLnAMj59ndx=yZSayj2vF1qg@mail.gmail.com>
On Mon, Aug 27, 2012 at 1:49 PM, Koji Ishii <kojiishi@gluesoft.co.jp> wrote:

> >>> (1) "known to be Chinese or Japanese" is not defined in a manner
> >>> sufficient to obtain testability or interoperability at any level; some
> >>> default algorithm should be defined, e.g., "use the 'lang' attribute
> ..."
> >>> or "use the default language of the font if any" or "if there are any
> >>> hiragana or katakana character, then treat as Japanese; if any
> >>> hangul character, treat as Korean, otherwise ...", etc
> >>
> >> This refers to content language[1], and when such is not in the
> document,
> >> the spec says "it is possible for the content language of an element to
> be
> >> unknown", so this portion does not apply. This part of the spec is
> informative
> >> (as it is recommended) so UA may rely on other methods to determine if
> >> unknown such as automatic language detection.
> >>
> >> I guess we should change the "language" to "content language" with link
> to
> >> the terminology.
> >
> > Yes, please change "language" to a link to "content language". It would
> also
> > be useful to add a NOTE under the first occurrence of "known to be
> Chinese
> > or Japanese" to the following effect:
> >
> > "For the purpose of resolving 'known to be Chinese or Japanese', it is
> >  sufficient to determine that the governing @lang attribute (or
> equivalent)
> >  specifies a language tag containing 'ja' or 'zh' (or equivalent) as its
> primary
> >  language subtag."
>
> Fixed the link part.
>
> I don't think adding a note to this section is a good idea. First, it's
> more complex than one might imagine, like what to do when both @lang and
> @xml:lang were specified with different values. We should be consistent
> with what, for instance, lang selector does, and with what i18n WG says.
> Second, this property is not the only property that use content language;
> see underline-position for instance. It doesn't look smart if the same
> notes appear everywhere we refer to the content language, does it?
>

A NOTE is informative, and it never hurts to add unless it is blatantly
wrong. It provides valuable guidance.

The phrases "known to be X [language]" are completely undefined as far as
the current text is concerned. If you want to have one note that covers all
X, then by all means do so, but don't just leave it in such an undefined
state.

For example, using lang (or equivalent) to satisfy "known" actually has
nothing to do with interpreting the text; that is, one might have <p
lang="jp">This is not Japanese</lang> which would satisfy the "use @lang or
equivalent" but does not satisfy the more implicit interpretation "if it is
really japanese language text".


> >>> (3) speaking of "breaks between some inseparable characters: ‥ U+2025,
> >>> … U+2026" what exactly does "between" mean here? does it mean
> >>> between only the following four pairs or something else?
> >>>
> >>> &#x2025;&#x2025;
> >>> &#x2025;&#x2026;
> >>> &#x2026;&#x2025;
> >>> &#x2026;&#x2026;
> >>
> >>Correct. This refers to IN (Inseparable Characters)[2] class in UAX#14.
> >
> > Please add some text making reference to this this definition, e.g.,
> change
> > "between some inseparable characters" to read "between characters of
> > the IN (Inseparable Characters) class of [UAX14]".
>
> Ok, will do.
>
> >>> (4) is it permissible for 'auto' behavior to differ from all of
> >>> normal|strict|loose? e.g., map to 'foo' (where foo is defined
> internally by UA)?
> >>
> >> I didn't think about this, but as far as spec says, I think yes. From
> author
> >> perspective, I think yes too; authors should use the property if they
> want
> >> specific behavior, possibly along with lang attribute.
> >
> > Since many UAs make use of ICU, which uses UAX #14 for its default LB
> rules,
> > I would suggest adding an additional keyword value to this property
> "uax14", and
> > further specify that, "in the absence of any other relevant criteria, a
> UA should
> > treat 'auto' as if 'uax14' were specified". This will improve
> interoperability and
> > testability for the 'auto' value, which is the default 'initial' value
> for this property.
>
> IE, for instance, uses 'normal' as 'auto', and doesn't use ICU. It's
> mostly the same as the normal ja settings of ICU50, but not exactly the
> same. I don't see much user value for IE to implement exactly the same line
> breaking as UAX#14 and changing 'auto' to it.
>

I doubt it. That is, I doubt that IE uses *only* the rules defined by
'normal' if set to 'auto'. At present, auto (and the other values are
pretty much *completely* non-interoperable or even testable except for the
extremely small number of rules explicitly defined. I very much doubt that
users would be satisfied if *only* the rules specified under the definition
of this property were applied to text of any language, including CJK.


>
> UAX#14 has several character classes that change their behavior by
> application's input. AI or CJ classes for example, so there's no single
> UAX#14 line breaking. Also UAX#14 changes over time, ICU changes how to
> interpret classes like AI or CJ by versions, and UAs take different
> versions of ICU, so just adding 'uax14' doesn't help much.
>

I disagree. It provides a concrete point of departure when what is
currently specified is "whatever you want". The issue of versioning and
AI/CJ are not sufficient reason to not reference or define a uax14 keyword.

At minimum, we should define a "uax14" keyword, intentionally leaving
version and AI/CJ UA dependent (which could be resolved or parameterized in
the future).

The issue of whether auto should map to uax14 "absent other criteria" is
another issue.


>
> We can test differences between loose, normal, and strict. We can't test
> exact set of code points as you said, but we could test common characters
> without requiring conformance.
>

Sure, but that is only a very small part of the rules available in uax14
and expected by authors/users. In other words, it isn't enough.


>
>
> > It might also be useful to either specify (in the property definition)
> or write in a
> > note something like: "in the absence of any other relevant criteria, a
> UA should
> > interpret 'loose', 'normal', and 'strict' in accordance with the default
> rules of
> > [UAX14] modified as required to satisfy the additional constraints
> specified in
> > this section".
>
> Because we don't define the baseline, we can't say this. The baseline is
> UA dependent, and we define the minimum differences between values.
>

We define baselines (i.e., default behavior) in many other cases. We should
do so here as well. Failing to do so just produces non-interoperability.
What is the point of defining line-break if 90% of the behavior is
unspecified? That seems to be a poor design principle: i.e., throw out some
tokens (loose, strict), and let UAs do whatever they want. We can do better
and should.


>
>
> >> The line break rules should apply cross-elements boundary, so the rule
> should
> >> apply in this case too. I know some implementations are broken in this
> regard
> >> though. As far as I discussed this with fantasai last time, 5.1. Line
> Breaking
> >> Details[3] says "a replaced element or other atomic inline is
> equivalent to that
> >> of the Object Replacement Character (U+FFFC)" so if one of the adjacent
> >> elements are inline-block, this will not apply.
> >
> > It would be useful to add a NOTE that distills this information.
>
> Hm. The information is just one page above the text, maybe examples help
> better. I'll be working on it later.
>
>
> Regards,
> Koji
>
>
Received on Monday, 27 August 2012 08:00:49 UTC