Re: [webauthn] truncation to 64-byte upper limit doesn't mention character boundaries

@equalsJeffH Thanks for this. I think it is useful to separate some concerns here.

Counted storage limits generally are required by low-level protocols or data structures. These are generally implemented in code doing serialization/deserialization or as part of e.g. network protocols. At this level, you need to define a length limit in something--I18N folks generally prefer *characters* (by which we mean Unicode code points) as the length limit and not bytes because limits defined in characters do not disadvantage languages/scripts that use 2-, 3-, or 4-byte forms of UTF-8 the way that byte counts do. The 64-byte limit probably goes unnoticed by English speakers, while Chinese users (with an effectively 21-character limit) notice it more often. User's don't understand why they can type a lot of text but only like 1 (complex) emoji.

If you *must* have a byte limit boundary (usually due to a protocol requirement), then at a very minimum I expect to see code point based truncation (because creating extra U+FFFD characters is a Bad Thing), hence my comment above. I'm usually fine with low-level specs that define length limited fields and specify character truncation. 

So... regarding EGCs, yes, I agree that this is ideal. If you truncate some character sequences, you change the meaning of the user-perceived character (grapheme), such as when you remove a vowel from an Indic character sequence. So ideally truncation would be on grapheme boundaries. UTR#29 talks about this as do some other specs. But, to be honest, I don't think a low level implementation needs to require this or define it in amazing detail. It's a health warning. An example of a W3C spec that deals with the higher level problem is [CSS3-Text](https://www.w3.org/TR/css-text-3/#characters) (at the given location). Note that even there they allow for vagaries with notes like:

> Authors are forewarned that dividing grapheme clusters by element boundaries may give inconsistent or undesired results. 

So:

* I would prefer you define the minimum limit in Unicode code points ("characters")
* I would prefer you require (`MUST`) truncation only on character boundaries, regardless of whether you are counting code units ("bytes") or code points ("characters")
* I would like it if you encouraged truncation only on grapheme boundaries

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at https://github.com/w3c/webauthn/issues/973#issuecomment-400877378 using your GitHub account

Received on Thursday, 28 June 2018 01:00:25 UTC