Re: convertKeyIdentifier from Mark Davis ☕ on 2009-09-30 (www-dom@w3.org from July to September 2009)

From: Mark Davis ☕ <mark@macchiato.com>
Date: Wed, 30 Sep 2009 16:04:47 -0700
To: Doug Schepers <schepers@w3.org>
Cc: public-i18n-core@w3.org, "www-dom@w3.org" <www-dom@w3.org>, w3c-i18n-ig <w3c-i18n-ig@w3.org>
Message-ID: <30b660a20909301604t57da51feo920e1a894d5de0de@mail.gmail.com>
Note: this may bounce from the dom list, so please forward.
Mark


On Tue, Sep 22, 2009 at 23:27, Doug Schepers <schepers@w3.org> wrote:

> Hi, Mark-
>
> Here's the facts as I see them... others should feel free to jump in with
> corrections or clarifications, or even other use cases or requirements.
>
> The context is the DOM3 Events spec [1].  We are discussing what the values
> for some of the key values should be... we have constructed a level of
> abstraction above scan codes, keyCode/charCode, key mappings, dead keys, and
> all that, into a set of "key identifiers".  (Note that a "key identifier" is
> actually the current modified and mapped value for the key, nothing like a
> keyCode or scan code).  We are discussing what the nature of the value
> should be, in the context of an API for key events.
>
> For historical context, key identifiers were specified originally as a set
> of key names (for control keys, etc., like "Enter", "Tab", or "F1") and a
> set of Unicode strings ("U+xxxx").  This is how it last stood when work was
> stopped on the spec in 2003, and published as a W3C Note. (It was a bit
> muddled in some details, but a good start.)
>
>
> We have recently taken up the spec again and are revisiting the key
> identifier issue, and in looking at the model, are discussing what the best
> return values are for a particular keystroke.  In the current model, each
> key identifier has one or more of the following "value types": a key name
> (like "Shift"), a character value (like "w"), or a Unicode string (like
> "U+0308").
>

Unless there is something I am missing, the only distinction between the
latter two is that

   - a character value consists of exactly one code point, such as
      -  "à", or "&#E0;"
      -  "𝄌", or "&#1D10C;"
   - a string value consists of more than two code points, such as
      - "sà", or "&#73;&#E0;"
      - "𝄒𝄒" or equivalently "&#x1D10C;&#x1D112"

Issue1: you could use some convention for representing characters other than
by themselves (like  "\p{308}"). If this is in XML, the only reason you
would have to do this is where XML (unfortunately!!) cannot represent all
Unicode characters. The U+ notation wouldn't be a particularly good one,
because you don't know where it terminates -- that notation is meant for
plain text, where the last hex digit would be followed by a space or other
character. If it is not XML, but intended for APIs like Javascript or Java,
those have datatypes that can represent any Unicode string. So I'm not quite
seeing the problem.

Issue2: I don't see that there is a particularly good reason to have two
different elements for the single vs multiple code point cases.


> With the key names (those that don't represent a character or dead key that
> has a Unicode code point), we seem to be resolved on the full
> multi-character string.
>
> However, with the Unicode value, there is some contention about which is
> the best way to do this.  We know we want to make it very easy for script
> authors to get the character representation of the code point, but we also
> want them to be able to get the code point, for things like "\u0308"
> (diaeresis dead key).  This is obviously important for internationalization.
>

I don't understand what you mean. Could you give some examples of why
regular XML notation (either the literal character or an NCR wouldn't work,
other than for the characters that XML doesn't handle?


>
> We also know that we want people to easily be able to represent the key
> identifiers in markup (such as the value for @accesskey, perhaps), so it
> should be representable there as well.  (My assumption there is that they
> should be able to use any of the character value, the code point string, or
> the key name, as appropriate and convenient.)
>
> We also want to allow people to test the Unicode range of the key
> identifier, to see if it falls into a certain orthography, general category,
> etc..
>

There are common libraries (like ICU) that supply a full set of Unicode
property tests for characters. There are many, many possible properties that
people may want to test on. Does this have to be in the protocol?


>
> Finally, we want this to be as simple as possible to understand, implement,
> and use, while not sacrificing any potential functionality.
>
> As a secondary goal, we may want to allow people to convert the resulting
> key identifier (or, indeed, any string) into different characterizations
> (entity, code point, character, key name), not as part of an event, but in
> the general case.
>

I'm not sure what you mean; can you supply an example?


> A few mechanisms have been proposed.  One centers around a single attribute
> value on the keyboard event (".key"), which may resolve to the most
> author-friendly value, which can then perhaps be manipulated (there are
> various permutations of this idea).  The other involves an additional
> attribute (".codePoint"), that gives the Unicode code point, if one exists.
>

I don't know why you would want to do this, rather than just have a unified
representation: a key either has a name (eg "shift") or has a value (any
non-empty Unicode string of code points).


>
> Any insight the i18n folks could shine here would be appreciated.
>
>
> [1] http://dev.w3.org/2006/webapi/DOM-Level-3-Events/html/DOM3-Events.html
>
> Regards-
> -Doug Schepers
> W3C Team Contact, SVG and WebApps WGs
>
>
>
> Mark Davis ☕ wrote (on 9/23/09 1:27 AM):
>
>> I don't know enough about the context here, but there appear to be a
>> number of misperceptions, among them that U+0308 (or "\u0308" or
>> whatever the syntax is) alone does not constitute a valid Unicode
>> string. It is absolutely a valid string.
>>
>> Perhaps someone can point me to some background here.
>>
>> Mark
>>
>>
>> On Tue, Sep 22, 2009 at 22:19, Doug Schepers <schepers@w3.org
>> <mailto:schepers@w3.org>> wrote:
>>
>>    Hi, Maciej-
>>
>>    Maciej Stachowiak wrote (on 9/22/09 11:23 PM):
>>
>>
>>        On Sep 22, 2009, at 7:53 PM, Ian Hickson wrote:
>>
>>            On Tue, 22 Sep 2009, Maciej Stachowiak wrote:
>>
>>                On Sep 22, 2009, at 9:27 AM, Anne van Kesteren wrote:
>>
>>                I agree with Anne. I think we should remove the U+XXXX
>>                format entirely.
>>                If you have a string like Q, you can convert it to a
>>                unicode numeric
>>                value for range checking like this: [...]
>>
>>                I don't think the U+XXXX string format does not add any
>>                value.
>>
>>
>>            Are dead keys represented in some way? The string "\x0308"
>>            is not a valid
>>            Unicode string (it has a combining character with no base),
>>            but I don't
>>            see how else we would represent the diaeresis dead key.
>>
>>
>>        I hadn't thought of dead keys.
>>
>>
>>    I did mention that as one of the use cases at the beginning of this
>>    thread [1], but I probably could have expressed it more clearly.
>>
>>    Another case I mentioned is making sure that a character is in a
>>    certain range (such as in a certain code block or language group).
>>      This is possible with the Unicode code point (and some regex), but
>>    not with the character (I think), because a given character
>>    representation can actually appear in multiple ranges, so you can't
>>    say for certain that some particular character belongs to
>>    unequivocally to a certain range. That might not be correct... I'll
>>    look into it and report back (unless someone already knows for sure).
>>
>>
>>        According to the spec, the key identifier
>>        for the diaeresis dead key is the string "DeadUmlaut". I can see
>>        a few
>>        possible ways to deal with this:
>>
>>
>>    We could remove the "key name" from the Unicode values, or replace
>>    it with something more appropriate, perhaps.
>>
>>
>>        1) Have a way to get the unicode code point for a dead key. But
>>        I think
>>        a numeric value would be more useful than the U+XXXX format string.
>>        1.a) This could be a global method that takes strings like
>>        "DeadUmlaut"
>>        and returns code points as numeric values ; OR
>>        1.b) There could be an attribute on key events that gives the code
>>        point, if any, separate from the key identifier. long
>>        unicodeCodePoint
>>        for instance.
>>
>>
>>    When we discussed this in the telcons, we decided that a utility
>>    function was better than a event attribute, because you could use it
>>    at any time, not just when a keyboard event had occurred... (there
>>    was some other reason that Travis brought up that escapes me at the
>>    moment).
>>
>>    However, that was my first thought as well, so I'm amenable to that
>>    (maybe just ".codepoint"?).
>>
>>
>>        2) Alternately - even though "\x0308" is not a valid Unicode
>>        string, it
>>        can still be represented as a DOM string and as a JavaScript
>> string,
>>        since both the DOM and JavaScript define strings as sequences of
>>        16-bit
>>        UTf-16 code units, and may represent invalid strings (including
>> even
>>        such things as containing only one code unit of the two that
>>        comprise a
>>        surrogate pair). Thus, identifiers like "DeadUmlaut" could be
>>        replaced
>>        with ones like "\x0308".
>>
>>
>>    What's the advantage of this over the U+XXXX format string?  I don't
>>    get it.
>>
>>    My own thought in putting this together was that we don't know all
>>    the uses it will be put to, so enabling the most general and generic
>>    approach is probably the safest bet.  Cutting corners now might
>>    inadvertently exclude some use case down the line, and enabling
>>    access to all the key identifier value types doesn't seem to be much
>>    more overhead (if any).  Please correct me if I'm wrong.
>>
>>
>>    [1] http://lists.w3.org/Archives/Public/www-dom/2009JulSep/0406.html
>>
>>    Regards-
>>    -Doug Schepers
>>    W3C Team Contact, SVG and WebApps WGs
>>
>
>
>
Received on Wednesday, 30 September 2009 23:05:22 UTC