Re: [web-annotation] Reference to text encoding in spec perhaps not appropriate from aphillips via GitHub on 2016-05-30 (public-annotation@w3.org from May 2016)

From: aphillips via GitHub <sysbot+gh@w3.org>
Date: Mon, 30 May 2016 15:04:01 +0000
To: public-annotation@w3.org
Message-ID: <issue_comment.created-222510692-1464620640-sysbot+gh@w3.org>

I don't believe, @iherman, that this is what the snippeting is used 
for. Or at least that there exist a number of use cases where the text
 is later presented to a human and not merely used by a machine for 
comparison.

When thinking about this problem, I have to admit that I was thinking 
about use cases from my day job, where I have been involved in 
actually implementing annotations and snippeting (but which I can't 
talk about here). One snippeting process is capturing user highlights 
outside a document ("scrapbooking") which, along with the case shown 
in the document, is where you don't just want the positions, but also 
a copy of the text.

@fsasaki: I don't necessarily agree. Removing markup from the quoted 
text, regardless of the source format, is desirable, since you don't 
want markup in the plain text. I'm very mindful that not just *ML is a
 target here. While PDF content doesn't contain markup generally, 
other non-HTML/XML content types that might appear in a Web context 
would also want their markup removed when quoting. For example, you 
wouldn't want WebVTT, CSV, or RTF markup in the snippets either. I 
think the goal is to only present the user-facing text.

I also recognize that whitespace normalization would destroy "layout" 
such as represented by `pre`. I think this is expected. If one wants 
document fidelity, use text positions and extract the layout, not just
 the plain text. The problem is that, once you get into doing _some_ 
whitespace normalization, you can't just leave it up to the 
implementation to decide. Some will send spaces while some won't. 
Later comparison, such as @iherman suggests, is more difficult if the 
normalization isn't, er, normalized.

@duerst: the algorithm doesn't introduce any spaces that weren't 
already there in the text. If the original text contains spaces, those
 spaces will remain (with collapsing) in the final text. The one 
exception is line/paragraph breaks: these would become a space with 
the proposed text as written. It's a valid question whether line 
breaks should be converted to space.

That said, not performing trim helps East Asian texts because it 
prevents implementers from _introducing_ spaces algorithmically when 
reassembling text later.

I must admit that I'm curious why Text Quote Selectors exist without 
reference to position. If they were a special case of Text Position 
Selector, wouldn't that work more reliably? After all, some texts are 
highly repetitive. Without a position number, the quote selector might
 match many places in the source document.

-- 
GitHub Notification of comment by aphillips
Please view or discuss this issue at 
https://github.com/w3c/web-annotation/issues/227#issuecomment-222510692
 using your GitHub account

Received on Monday, 30 May 2016 15:04:03 UTC