W3C home > Mailing lists > Public > public-openannotation@w3.org > May 2013

Re: HTML tags, normalizations and text selectors

From: David Wood <david@3roundstones.com>
Date: Fri, 24 May 2013 11:06:47 -0400
Cc: Paolo Ciccarese <paolo.ciccarese@gmail.com>, public-openannotation <public-openannotation@w3.org>
Message-Id: <85C43532-1B87-4B1B-928D-25B4019F426B@3roundstones.com>
To: Robert Sanderson <azaroth42@gmail.com>
On May 24, 2013, at 11:01, Robert Sanderson <azaroth42@gmail.com> wrote:

> 
> Tags and Character Entities in text/plain:  No, as text/plain does not have the concept of tags or character entities.  That should be clearer in the documentation that those rules only apply to formats that have those concepts (notably sgml/xml derivatives).

+1.  I wouldn't want HTML or XML tags stripped from my text/plain documents.

Regards,
Dave
--
http://about.me/david_wood

> 
> Whitespace in text/plain: Yes, as text/plain has the concept of whitespace.  So whether you see "Two  spaces  between  words" in html or in plain text, it should be treated as "Two spaces between words" using TextPositionSelector.  If you want to preserve those spaces, then use DataPositionSelector (or &nbsp; or the unicode equivalent).
> 
> I agree that maintaining the exact representation is an important use case, and one that is covered already, I think :)
> 
> Rob
> 
> 
> 
> 
> 
> On Fri, May 24, 2013 at 8:35 AM, Paolo Ciccarese <paolo.ciccarese@gmail.com> wrote:
> Some additional thoughts after the the discussion thread http://lists.w3.org/Archives/Public/public-openannotation/2013May/0042.html
> 
> Currently both selectors in the spec ask for removal of HTML/XML tags. That is also what I do in most of my applications as the goal is to annotate 'content'. And it works pretty well for HMTL<->PDF.
> 
> However, OA has a broader scope.
> 
> I was wondering if tags have to be removed when I get a document of type 'text/plain'.  I personally don't remove tags in 'the text/plain' representation of a web page as, in that format, I don't see them as tags. Of course it goes down to the fact that I annotate content within the browser and the browser does not see those as tags. 
> 
> Vice versa if I get the content as text/html or application/xml I can use fragments selectors to point to specific elements. But I might still resort to plain text for very specific reasons.
> 
> Now, does the 'HTML/XML tags should be removed' (in http://www.openannotation.org/spec/core/specific.html#TextQuoteSelector and http://www.openannotation.org/spec/core/specific.html#TextPositionSelector) apply to 'text/plain'. If the Selectors and the State have to be orthogonal, if the implementation does not matter and given the current description, it probably should.
> 
> Also, what happens about white spaces for a 'text/plain' vs 'text/html'? In the former does it make sense to normalize? Should that be an option (normalization: on/off) or that makes things more complicated?
> 
> In summary, the two existing text selector are tailored for text documents that are accessed through the DOMů and we care about the pure content. So, if I want to record that I've annotated an HTML/XML page negotiated as plain text - recorded in the State - or better accessed as a file, it currently cannot keep/count the HTML/XML tags as part of the selection. And that is an important use case for some (including myself).
> 
> As a start, as these are 'text selector'. So if you re dealing with 'application/xml' you need to strip tags and normalize to get to the text. If you deal with 'text/HTML' same to get to the textual content. In other words I would modify the statement 'HTML/XML tags should be removed'.  Other formats might require similar normalization which is not covered by this specification.
> 
> Then we can probably think of additional selectors?
> 
> -- 
> Dr. Paolo Ciccarese
> http://www.paolociccarese.info/
> Biomedical Informatics Research & Development
> Instructor of Neurology at Harvard Medical School
> Assistant in Neuroscience at Mass General Hospital
> Member of the MGH Biomedical Informatics Core
> +1-857-366-1524 (mobile)   +1-617-768-8744 (office)
> 
> CONFIDENTIALITY NOTICE: This message is intended only for the addressee(s), may contain information that is considered
> to be sensitive or confidential and may not be forwarded or disclosed to any other party without the permission of the sender. 
> If you have received this message in error, please notify the sender immediately.
> 




Received on Friday, 24 May 2013 15:07:10 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:22:04 UTC