- From: aphillips via GitHub <sysbot+gh@w3.org>
- Date: Mon, 30 May 2016 15:04:01 +0000
- To: public-annotation@w3.org
I don't believe, @iherman, that this is what the snippeting is used
for. Or at least that there exist a number of use cases where the text
is later presented to a human and not merely used by a machine for
comparison.
When thinking about this problem, I have to admit that I was thinking
about use cases from my day job, where I have been involved in
actually implementing annotations and snippeting (but which I can't
talk about here). One snippeting process is capturing user highlights
outside a document ("scrapbooking") which, along with the case shown
in the document, is where you don't just want the positions, but also
a copy of the text.
@fsasaki: I don't necessarily agree. Removing markup from the quoted
text, regardless of the source format, is desirable, since you don't
want markup in the plain text. I'm very mindful that not just *ML is a
target here. While PDF content doesn't contain markup generally,
other non-HTML/XML content types that might appear in a Web context
would also want their markup removed when quoting. For example, you
wouldn't want WebVTT, CSV, or RTF markup in the snippets either. I
think the goal is to only present the user-facing text.
I also recognize that whitespace normalization would destroy "layout"
such as represented by `pre`. I think this is expected. If one wants
document fidelity, use text positions and extract the layout, not just
the plain text. The problem is that, once you get into doing _some_
whitespace normalization, you can't just leave it up to the
implementation to decide. Some will send spaces while some won't.
Later comparison, such as @iherman suggests, is more difficult if the
normalization isn't, er, normalized.
@duerst: the algorithm doesn't introduce any spaces that weren't
already there in the text. If the original text contains spaces, those
spaces will remain (with collapsing) in the final text. The one
exception is line/paragraph breaks: these would become a space with
the proposed text as written. It's a valid question whether line
breaks should be converted to space.
That said, not performing trim helps East Asian texts because it
prevents implementers from _introducing_ spaces algorithmically when
reassembling text later.
I must admit that I'm curious why Text Quote Selectors exist without
reference to position. If they were a special case of Text Position
Selector, wouldn't that work more reliably? After all, some texts are
highly repetitive. Without a position number, the quote selector might
match many places in the source document.
--
GitHub Notification of comment by aphillips
Please view or discuss this issue at
https://github.com/w3c/web-annotation/issues/227#issuecomment-222510692
using your GitHub account
Received on Monday, 30 May 2016 15:04:03 UTC