Re: [web-annotation] Reference to text encoding in spec perhaps not appropriate from aphillips via GitHub on 2016-05-28 (public-annotation@w3.org from May 2016)

From: aphillips via GitHub <sysbot+gh@w3.org>
Date: Sat, 28 May 2016 21:42:27 +0000
To: public-annotation@w3.org
Message-ID: <issue_comment.created-222330988-1464471746-sysbot+gh@w3.org>
Per 
[I18N-ACTION-527](https://www.w3.org/International/track/actions/527),
 I looked into the text in 4.2.5 and the related text elsewhere in the
 current ED. Here's the current text:

> The text MUST be normalized before recording. Thus HTML/XML tags 
should be removed, character entities should be replaced with the 
character that they encode, and unnecessary whitespace should be 
normalized. The normalization routine may be performed automatically 
by a browser, and other applications should implement the DOM String 
Comparisons method. This allows the Selector to be used with different
 encodings and user agents and still have the same semantics and 
utility. The selection MUST be based on the logical order of the text,
 rather than the visual order, especially for bidirectional text. The 
normalized value MUST be recorded as UTF-8 in the JSON serialization 
of the Annotation. 

I also agree with @r12a's comment above: rather than a random list of 
potential operations, there should be a clearly defined set of 
operations. The problem is that the text has a normative sounding 
_must_ in it, but then follows with some random suggestions and some 
"should" text. And it includes a reference to DOM String. As an 
implementer, I would be confused about what exactly is required. 

If the concern here is whether to apply a Unicode Normalization Form, 
the WG's current position, as described in 
[Charmod-Norm](http://w3c.github.io/charmod-norm/) is *not* to apply a
 Unicode Normalization Form to the text. In Charmod-Norm, pay 
particular attention to section 3.2. The annotation specs are of the 
"non-normalizing" type, please note. 

The text in DOM Strings referenced in the current text requires NFC 
and, additionally, requires _fully normalized_ and _include 
normalized_ checking. In summary, these normalization requirements are
 meant to prevent selections from starting with a combining mark. 
While noble in intent, in most implementations it is difficult for the
 user to select text that begins with a combining mark. I'm not 
convinced (although I could be) that requiring _fully normalized_ 
checking at the model level is helpful. If there is a reason to apply 
this checking, it should be explicitly stated in Annotation Model, not
 indirectly and obliquely through DOM Strings (where it will be 
misunderstood). In addition, by applying it to the text 
_normalization_ step, you miss the important point: the normalizing 
algorithm probably cannot adjust the boundaries between the `exact`, 
`prefix`, and `suffix` text. The best it can probably do is mutate the
 text to have an extra non-combining mark (generally an NBSP) at the 
start of the given segment of text, which probably does more to break 
the text than doing nothing at all. In that case, you'd be better off 
supplying a MUST requirement on the quote or position selector 
locations--or just noting the potential problem for implementers to 
try to avoid (but permitting non-fully-normalized quotes or 
positions).

In the editor's copy, I note that there is an addition of text 
referring to logical (vs. visual) order and also one mentioning the 
use of UTF-8 for the JSON serialization. In my opinion, both of these 
requirements are superfluous and should be removed. The eventual use 
of UTF-8 is already a requirement of JSON serialization (and text can 
also be `\u` escaped in JSON). It presents no actual requirement for 
implementers of Text Quote Selector. Similarly, it would be better to 
introduce logical encoding globally in the document, perhaps in the 
discussion of principles or by reference to 
[Charmod-Fundamentals](https://www.org/TR/charmod) and Charmod-Norm.

I would suggest using this as a basis for a revision:

> The text MUST be normalized before recording by applying the 
following operations in this order to the source text:
> 1. Conversion of the source text to a sequence of Unicode code 
points, including expansion of character entities and escapes to 
Unicode.
> 2. Remove all markup, such as HTML or XML tags. _Question: what to 
do about dir?_
> 3. Normalization of whitespace by collapsing all whitespace tokens 
to a single ASCII space character (U+0020). Note that the text MAY 
begin or end with a space character. _i.e. no trim is implied_
> 4. Adjust boundaries between `exact`, `prefix`, and `suffix` such 
that none of the three begin with a combining mark and, if possible, 
to coincide with grapheme boundaries.
> 5. Extract the `exact` and, if present, the `prefix` and `suffix` 
text.



-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/227#issuecomment-222330988
 using your GitHub account
Received on Saturday, 28 May 2016 21:42:28 UTC