- From: aphillips via GitHub <sysbot+gh@w3.org>
- Date: Mon, 30 May 2016 15:04:01 +0000
- To: public-annotation@w3.org
I don't believe, @iherman, that this is what the snippeting is used for. Or at least that there exist a number of use cases where the text is later presented to a human and not merely used by a machine for comparison. When thinking about this problem, I have to admit that I was thinking about use cases from my day job, where I have been involved in actually implementing annotations and snippeting (but which I can't talk about here). One snippeting process is capturing user highlights outside a document ("scrapbooking") which, along with the case shown in the document, is where you don't just want the positions, but also a copy of the text. @fsasaki: I don't necessarily agree. Removing markup from the quoted text, regardless of the source format, is desirable, since you don't want markup in the plain text. I'm very mindful that not just *ML is a target here. While PDF content doesn't contain markup generally, other non-HTML/XML content types that might appear in a Web context would also want their markup removed when quoting. For example, you wouldn't want WebVTT, CSV, or RTF markup in the snippets either. I think the goal is to only present the user-facing text. I also recognize that whitespace normalization would destroy "layout" such as represented by `pre`. I think this is expected. If one wants document fidelity, use text positions and extract the layout, not just the plain text. The problem is that, once you get into doing _some_ whitespace normalization, you can't just leave it up to the implementation to decide. Some will send spaces while some won't. Later comparison, such as @iherman suggests, is more difficult if the normalization isn't, er, normalized. @duerst: the algorithm doesn't introduce any spaces that weren't already there in the text. If the original text contains spaces, those spaces will remain (with collapsing) in the final text. The one exception is line/paragraph breaks: these would become a space with the proposed text as written. It's a valid question whether line breaks should be converted to space. That said, not performing trim helps East Asian texts because it prevents implementers from _introducing_ spaces algorithmically when reassembling text later. I must admit that I'm curious why Text Quote Selectors exist without reference to position. If they were a special case of Text Position Selector, wouldn't that work more reliably? After all, some texts are highly repetitive. Without a position number, the quote selector might match many places in the source document. -- GitHub Notification of comment by aphillips Please view or discuss this issue at https://github.com/w3c/web-annotation/issues/227#issuecomment-222510692 using your GitHub account
Received on Monday, 30 May 2016 15:04:03 UTC