Re: [css3-writing-modes] referring to Unicode from John Daggett on 2011-05-04 (www-style@w3.org from May 2011)

From: John Daggett <jdaggett@mozilla.com>
Date: Tue, 3 May 2011 23:34:38 -0700 (PDT)
To: Addison Phillips <addison@lab126.com>
Cc: fantasai <fantasai.lists@inkedblade.net>, www-style@w3.org, WWW International <www-international@w3.org>
Message-ID: <465832219.1274.1304490878342.JavaMail.root@zimbra1.shared.sjc1.mozilla.com>
Addison Phillips wrote:

> > > Since CSS specs are both explaining behavior and defining
> > > implementation, referring to a Unicode technical note is fine
> > > for referring to a deeper explanation of a concept but is *not*
> > > sufficient for defining implementation behavior. Implementation
> > > behavior should be defined in terms of the Unicode database [1]
> > > instead, by referencing specific data fields in specific files,
> > > e.g. the EastAsianWidth.txt file in your example here.  The
> > > technical notes often don't always cover all the subtleties
> > > implicit in using this data and that's something any definition
> > > of implementation behavior needs to cover explicitly, otherwise
> > > you end up with untestable muddle.
> > 
> > The EastAsianWidth.txt file is referenced from UAX11. UAX11 gives
> > the explanation of what it means, how to use it, etc. So I think
> > that referring to UAX11 is the correct thing to do here. I'll let
> > Addison correct me if I'm wrong.
> 
> In my opinion, you are correct to use UAX11 as a reference. UAX
> means "Unicode Standard Annex", i.e. it is an integral part of the
> Unicode Standard. John Daggett's comments do apply to some other
> classes of Unicode Technical Report and sometimes an Annex (or
> Technical Standard) may not be complete as a reference unto itself.
> But, in this case, UAX11 deals with East Asian Widths and focuses on
> defining the Unicode informative property in question. It is thus
> probably the best reference to EastAsianWidth.txt, although a
> separate reference to the latter file might also be useful for
> implementers.

I wasn't arguing that we shouldn't refer to Unicode annexes or
technical reports, I'm saying that it's not sufficient to define the
implementation of a given CSS property. For that I think we should be
including more detail, specifically that the definition of a given CSS
property should reference the specific property in the Unicode
database rather than relying on the property and its handling being
"obvious" by referring to a given portion of the Unicode
specification/annex or technical report.

In the case of the 'text-orientation' property, the reasons for this
are evidenced by the issue noted at the end of the property description:

"Issue: Need to define handling of EAW Ambiguous (A) symbols and punctuation."

In other words, the decision as to whether to rotate the glyphs for 
a given character in vertical text needs to be more clearly specified 
since this is *not* explicitly covered as part of the text of UAX11.

Perhaps a better example of the same issue exists in the definition of
the 'text-transform' property in the current Editor's Draft of CSS3 Text:

http://dev.w3.org/cvsweb/~checkout~/csswg/css3-text/Overview.html?rev=1.128;content-type=text%2Fhtml#text-transform

The 'fullwidth' value is defined as:

    Puts all characters in fullwidth form. If the character does not
    have a corresponding fullwidth form, it is left as is. This value
    is typically used to typeset Latin characters and digits like
    ideographic characters. 

Additional description:

    The definition of fullwidth and halfwidth forms can be found on
    the Unicode consortium web site at [UAX11]. The mapping to
    fullwidth form is defined by <wide> tag of Character Decomposition
    Mapping in [UAX44].

But this doesn't really define the precise mapping function, it implies
it obliquely.  The data in the UnicodeData.txt file looks like this:

FF41;FULLWIDTH LATIN SMALL LETTER A;Ll;0;L;<wide> 0061;;;;N;;;FF21;;FF21
FF42;FULLWIDTH LATIN SMALL LETTER B;Ll;0;L;<wide> 0062;;;;N;;;FF22;;FF22
FF43;FULLWIDTH LATIN SMALL LETTER C;Ll;0;L;<wide> 0063;;;;N;;;FF23;;FF23
FF44;FULLWIDTH LATIN SMALL LETTER D;Ll;0;L;<wide> 0064;;;;N;;;FF24;;FF24

The mapping is *from* the codepoint contained in the
Decomposition_Mapping property when '<wide>' is present.  So 'A'
(U+0061) would map to it's fullwidth version (U+FF41).  When you look
at the data you also discover this:

3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

So the mapping would also map spaces to ideographic spaces.  Since
this has implications for white space collapsing, the point in the
text handling pipeline where text-transform occurs needs to be defined
precisely.  This has been noted as an issue and discussed on www-style
[1].

The precise behavior of 'uppercase' and 'lowercase' should also
probably be defined explicitly.  Should only the
Simple_Uppercase_Mapping and Simple_Lowercase_Mapping properties be
used?  Or should the properties contained in SpecialCasing.txt also
apply? (My answer: yes please!).

Instead the current draft just writes:

    Although limited, the case mapping process has some language
    dependencies. Some well known examples are Turkish and Greek. If
    the content language is known then any such language-specific
    rules must be used.  The case mapping rules for the character
    repertoire specified by the Unicode Standard can be found on the
    Unicode Consortium Web site. [UNICODE]

This is simply not sufficient to define what 'uppercase' and
'lowercase' means in implementation terms.

Depending on how you define the case-mapping properties, there's also 
a possible ordering issue, since text-transform can be multi-valued:

  p { text-transform: fullwidth lowercase; }

  <p>&#xfb00;</p> /* codepoint for ff presentational ligature */

Does a viewer see the ff-ligature or fullwidth FF?  This *might* be
determined by the order in which these mappings are applied.

My point here is simply that implementors need more detail than simple
references to parts of Unicode.  Rather than rely on folks like Boris
Zbarsky, David Baron and Sergey Malkin to point these details out when
they actually dig through the references for a given property and
ponder on them for a bit, I think it would be better (and simpler!)
for the specs to detail these algorithms precisely so that issues like
these are clear to not just implementors steeped in Unicode lore but
to authors, QA folks and other mere mortals.

Regards,

John Daggett

[1] Effect of text-transform on spaces
http://lists.w3.org/Archives/Public/www-style/2011Feb/thread.html#msg470
Received on Wednesday, 4 May 2011 06:35:07 UTC