Re: [css-fonts-4] Author-friendly unicode-range values? from Tab Atkins Jr. on 2013-09-04 (www-style@w3.org from September 2013)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Wed, 4 Sep 2013 10:10:22 -0700
To: Lea Verou <lea@verou.me>
Cc: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDD+UZWE3=+gR=X0+rpzsdc7puAcskwPthzGO=Ecm2K0GA@mail.gmail.com>

On Wed, Sep 4, 2013 at 9:40 AM, Lea Verou <lea@verou.me> wrote:
> Today’s telcon discussion on unicode-range reminded me of most authors’
> biggest gripes with it: That ranges need to be defined as unicode codepoints
> instead of strings, requiring a unicode table lookup for every single use of
> that descriptor. Strings seemed like a no-brainer: Figuring out the unicode
> codepoint for a specific character is something machines do much better than
> humans. I researched it and it seems that the reason this was not allowed
> was this [1]:
>
>> Makes sense but I think the implementation details could easily get a
> bit hairy, you would end up with ambiguous situations involving things
> like combining diacritics and shaped vowels.  To authors they would
> appear as a single character but underneath they would in fact be
> multi-character strings:
>>  unicode-range: 'å'; /* could be a-ring or 'a' followed by ring diacritic
>> */
>
> I understand the complexity, but it seems like one of those cases where we
> couldn’t decide what to do, so we did nothing and ended up with very
> author-unfriendly syntax. Most use cases would not be ambiguous, so as long
> as we define something reasonable for the ones that are, authors would
> rarely stumble on any complexity. On the plus side, authors would be able to
> use unicode-range without any need to look up unicode tables for the vast
> majority of their use cases. For the cases that are ambiguous, when authors
> want to do something different than the way we’ve defined stings to work,
> they can always use codepoints. Requiring them to use codepoints all the
> time because strings might be confusing in some cases does not seem
> reasonable to me.
>
> The way I picture it working, ranges would also be available (such as
> "a"-"z"). Single character strings would just be a shortcut to their unicode
> codepoint and they could be combined (e.g. ranges like "a"-U+7F).
> Multi-character strings would be invalid (including letters followed by
> diacritics).
>
> Thoughts? Is there any other reason this was not allowed, that I missed?
>
> [1]: http://lists.w3.org/Archives/Public/www-style/2009Jun/0000.html

I'm not a fan of the syntax, particularly when you mix codepoint and
characters, like "a"-U+7F, which parses as STRING(a) IDENT(-U)
DIMENSION(+7, F) rather than the intended STRING(a) DELIM(-)
UNICODE-RANGE(U+7F).  A function would work better, as
"unicode-range('a', u+7f)".

I was considering suggesting that we could do normalization, to avoid
the "combined character or normal character + combining character",
but any normalization would also do unfortunate things with characters
like Angstrom (normalized to precomposed A with o-ring).  So yeah,
just let a string of two characters be invalid.  Some keyboards/OSes
have trouble typing characters precomposed, but oh well, that's what
the codepoint form is for.

~TJ

~TJ

Received on Wednesday, 4 September 2013 17:11:10 UTC