Re: HTML tags, normalizations and text selectors from Robert Sanderson on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: Robert Sanderson <azaroth42@gmail.com>
Date: Fri, 24 May 2013 09:01:25 -0600
To: Paolo Ciccarese <paolo.ciccarese@gmail.com>
Cc: public-openannotation <public-openannotation@w3.org>
Message-ID: <CABevsUES6ZDoNNbFkpMSo_mi1CU_uDLda4s3TADFqUddMwPQzg@mail.gmail.com>

Tags and Character Entities in text/plain:  No, as text/plain does not have
the concept of tags or character entities.  That should be clearer in the
documentation that those rules only apply to formats that have those
concepts (notably sgml/xml derivatives).

Whitespace in text/plain: Yes, as text/plain has the concept of whitespace.
 So whether you see "Two  spaces  between  words" in html or in plain text,
it should be treated as "Two spaces between words" using
TextPositionSelector.  If you want to preserve those spaces, then use
DataPositionSelector (or &nbsp; or the unicode equivalent).

I agree that maintaining the exact representation is an important use case,
and one that is covered already, I think :)

Rob





On Fri, May 24, 2013 at 8:35 AM, Paolo Ciccarese
<paolo.ciccarese@gmail.com>wrote:

> Some additional thoughts after the the discussion thread
> http://lists.w3.org/Archives/Public/public-openannotation/2013May/0042.html
>
> Currently both selectors in the spec ask for removal of HTML/XML tags.
> That is also what I do in most of my applications as the goal is to
> annotate 'content'. And it works pretty well for HMTL<->PDF.
>
> However, OA has a broader scope.
>
> I was wondering if tags have to be removed when I get a document of type
> 'text/plain'.  I personally don't remove tags in 'the text/plain'
> representation of a web page as, in that format, I don't see them as tags.
> Of course it goes down to the fact that I annotate content within the
> browser and the browser does not see those as tags.
>
> Vice versa if I get the content as text/html or application/xml I can use
> fragments selectors to point to specific elements. But I might still resort
> to plain text for very specific reasons.
>
> Now, does the 'HTML/XML tags should be removed' (in
> http://www.openannotation.org/spec/core/specific.html#TextQuoteSelectorand
> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector)
> apply to 'text/plain'. If the Selectors and the State have to be
> orthogonal, if the implementation does not matter and given the current
> description, it probably should.
>
> Also, what happens about white spaces for a 'text/plain' vs 'text/html'?
> In the former does it make sense to normalize? Should that be an option
> (normalization: on/off) or that makes things more complicated?
>
> In summary, the two existing text selector are tailored for text documents
> that are accessed through the DOM… and we care about the pure content. So,
> if I want to record that I've annotated an HTML/XML page negotiated as
> plain text - recorded in the State - or better accessed as a file, it
> currently cannot keep/count the HTML/XML tags as part of the selection. And
> that is an important use case for some (including myself).
>
> As a start, as these are 'text selector'. So if you re dealing with
> 'application/xml' you need to strip tags and normalize to get to the text.
> If you deal with 'text/HTML' same to get to the textual content. In other
> words I would modify the statement 'HTML/XML tags should be removed'.
> Other formats might require similar normalization which is not covered by
> this specification.
>
> Then we can probably think of additional selectors?
>
> --
> Dr. Paolo Ciccarese
> http://www.paolociccarese.info/
> Biomedical Informatics Research & Development
> Instructor of Neurology at Harvard Medical School
> Assistant in Neuroscience at Mass General Hospital
> Member of the MGH Biomedical Informatics Core
> +1-857-366-1524 (mobile)   +1-617-768-8744 (office)
>
> CONFIDENTIALITY NOTICE: This message is intended only for the
> addressee(s), may contain information that is considered
> to be sensitive or confidential and may not be forwarded or disclosed to
> any other party without the permission of the sender.
> If you have received this message in error, please notify the sender
> immediately.
>

Received on Friday, 24 May 2013 15:01:53 UTC