- From: aphillips via GitHub <sysbot+gh@w3.org>
- Date: Sat, 28 May 2016 21:42:27 +0000
- To: public-annotation@w3.org
Per [I18N-ACTION-527](https://www.w3.org/International/track/actions/527), I looked into the text in 4.2.5 and the related text elsewhere in the current ED. Here's the current text: > The text MUST be normalized before recording. Thus HTML/XML tags should be removed, character entities should be replaced with the character that they encode, and unnecessary whitespace should be normalized. The normalization routine may be performed automatically by a browser, and other applications should implement the DOM String Comparisons method. This allows the Selector to be used with different encodings and user agents and still have the same semantics and utility. The selection MUST be based on the logical order of the text, rather than the visual order, especially for bidirectional text. The normalized value MUST be recorded as UTF-8 in the JSON serialization of the Annotation. I also agree with @r12a's comment above: rather than a random list of potential operations, there should be a clearly defined set of operations. The problem is that the text has a normative sounding _must_ in it, but then follows with some random suggestions and some "should" text. And it includes a reference to DOM String. As an implementer, I would be confused about what exactly is required. If the concern here is whether to apply a Unicode Normalization Form, the WG's current position, as described in [Charmod-Norm](http://w3c.github.io/charmod-norm/) is *not* to apply a Unicode Normalization Form to the text. In Charmod-Norm, pay particular attention to section 3.2. The annotation specs are of the "non-normalizing" type, please note. The text in DOM Strings referenced in the current text requires NFC and, additionally, requires _fully normalized_ and _include normalized_ checking. In summary, these normalization requirements are meant to prevent selections from starting with a combining mark. While noble in intent, in most implementations it is difficult for the user to select text that begins with a combining mark. I'm not convinced (although I could be) that requiring _fully normalized_ checking at the model level is helpful. If there is a reason to apply this checking, it should be explicitly stated in Annotation Model, not indirectly and obliquely through DOM Strings (where it will be misunderstood). In addition, by applying it to the text _normalization_ step, you miss the important point: the normalizing algorithm probably cannot adjust the boundaries between the `exact`, `prefix`, and `suffix` text. The best it can probably do is mutate the text to have an extra non-combining mark (generally an NBSP) at the start of the given segment of text, which probably does more to break the text than doing nothing at all. In that case, you'd be better off supplying a MUST requirement on the quote or position selector locations--or just noting the potential problem for implementers to try to avoid (but permitting non-fully-normalized quotes or positions). In the editor's copy, I note that there is an addition of text referring to logical (vs. visual) order and also one mentioning the use of UTF-8 for the JSON serialization. In my opinion, both of these requirements are superfluous and should be removed. The eventual use of UTF-8 is already a requirement of JSON serialization (and text can also be `\u` escaped in JSON). It presents no actual requirement for implementers of Text Quote Selector. Similarly, it would be better to introduce logical encoding globally in the document, perhaps in the discussion of principles or by reference to [Charmod-Fundamentals](https://www.org/TR/charmod) and Charmod-Norm. I would suggest using this as a basis for a revision: > The text MUST be normalized before recording by applying the following operations in this order to the source text: > 1. Conversion of the source text to a sequence of Unicode code points, including expansion of character entities and escapes to Unicode. > 2. Remove all markup, such as HTML or XML tags. _Question: what to do about dir?_ > 3. Normalization of whitespace by collapsing all whitespace tokens to a single ASCII space character (U+0020). Note that the text MAY begin or end with a space character. _i.e. no trim is implied_ > 4. Adjust boundaries between `exact`, `prefix`, and `suffix` such that none of the three begin with a combining mark and, if possible, to coincide with grapheme boundaries. > 5. Extract the `exact` and, if present, the `prefix` and `suffix` text. -- GitHub Notification of comment by aphillips Please view or discuss this issue at https://github.com/w3c/web-annotation/issues/227#issuecomment-222330988 using your GitHub account
Received on Saturday, 28 May 2016 21:42:28 UTC