Re: HTML tags, normalizations and text selectors from David Wood on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: David Wood <david@3roundstones.com>
Date: Fri, 24 May 2013 11:45:09 -0400
To: Robert Sanderson <azaroth42@gmail.com>
Cc: Paolo Ciccarese <paolo.ciccarese@gmail.com>, public-openannotation <public-openannotation@w3.org>
Message-Id: <242D866F-BDC0-41B4-B7E4-2F09EDB2E16E@3roundstones.com>
On May 24, 2013, at 11:16, Robert Sanderson <azaroth42@gmail.com> wrote:

> 
> Thanks David.  Did you have any thoughts on the whitespace issue, which I think is more contentious?  I recall that we had a conversation about this issue at the W3C eBook meeting in February.
> 
> It might be clearer if the documentation for the Selectors included the rationale behind them, so that Text*Selector said that it was exclusively about the content and tries to be format agnostic in order to permit the same selection (and selector class) to be used in a meaningful way in multiple situations.


It depends on who you wish to inconvenience :)  Arbitrary whitespace anywhere around or in a quotation inconveniences implementors, but is potentially better for users, especially across various serializations of a document.  I always lean toward helping users, even though I hate that as an implementor.


> 
> Another example to consider would be markdown or the various wiki syntaxes.  Should the client should process them into the expected human readable form before counting the characters, even though the syntax is not strictly "tags" or "character entities"?  I would say that it should, for example you wouldn't count "---" as three characters but the same as <hr/>.   There might be a more understandable, but less prescriptive, way to define the normalization in those sorts of terms?


That's an ugly corner case, Robert, and one I would love to pretend didn't exist.  *sigh*  How does one know that a document has markdown in it and isn't just arbitrary text?  The only answer is a priori context.  There is no MIME type or file extension for it.  Yicks.

Regards,
Dave
--
http://about.me/david_wood


> 
> Rob
> 
> 
> 
> On Fri, May 24, 2013 at 9:06 AM, David Wood <david@3roundstones.com> wrote:
> On May 24, 2013, at 11:01, Robert Sanderson <azaroth42@gmail.com> wrote:
> 
>> 
>> Tags and Character Entities in text/plain:  No, as text/plain does not have the concept of tags or character entities.  That should be clearer in the documentation that those rules only apply to formats that have those concepts (notably sgml/xml derivatives).
> 
> +1.  I wouldn't want HTML or XML tags stripped from my text/plain documents.
> 
> Regards,
> Dave
> --
> http://about.me/david_wood
> 
>> 
>> Whitespace in text/plain: Yes, as text/plain has the concept of whitespace.  So whether you see "Two  spaces  between  words" in html or in plain text, it should be treated as "Two spaces between words" using TextPositionSelector.  If you want to preserve those spaces, then use DataPositionSelector (or &nbsp; or the unicode equivalent).
>> 
>> I agree that maintaining the exact representation is an important use case, and one that is covered already, I think :)
>> 
>> Rob
>> 
>> 
>> 
>> 
>> 
>> On Fri, May 24, 2013 at 8:35 AM, Paolo Ciccarese <paolo.ciccarese@gmail.com> wrote:
>> Some additional thoughts after the the discussion thread http://lists.w3.org/Archives/Public/public-openannotation/2013May/0042.html
>> 
>> Currently both selectors in the spec ask for removal of HTML/XML tags. That is also what I do in most of my applications as the goal is to annotate 'content'. And it works pretty well for HMTL<->PDF.
>> 
>> However, OA has a broader scope.
>> 
>> I was wondering if tags have to be removed when I get a document of type 'text/plain'.  I personally don't remove tags in 'the text/plain' representation of a web page as, in that format, I don't see them as tags. Of course it goes down to the fact that I annotate content within the browser and the browser does not see those as tags. 
>> 
>> Vice versa if I get the content as text/html or application/xml I can use fragments selectors to point to specific elements. But I might still resort to plain text for very specific reasons.
>> 
>> Now, does the 'HTML/XML tags should be removed' (in http://www.openannotation.org/spec/core/specific.html#TextQuoteSelector and http://www.openannotation.org/spec/core/specific.html#TextPositionSelector) apply to 'text/plain'. If the Selectors and the State have to be orthogonal, if the implementation does not matter and given the current description, it probably should.
>> 
>> Also, what happens about white spaces for a 'text/plain' vs 'text/html'? In the former does it make sense to normalize? Should that be an option (normalization: on/off) or that makes things more complicated?
>> 
>> In summary, the two existing text selector are tailored for text documents that are accessed through the DOM… and we care about the pure content. So, if I want to record that I've annotated an HTML/XML page negotiated as plain text - recorded in the State - or better accessed as a file, it currently cannot keep/count the HTML/XML tags as part of the selection. And that is an important use case for some (including myself).
>> 
>> As a start, as these are 'text selector'. So if you re dealing with 'application/xml' you need to strip tags and normalize to get to the text. If you deal with 'text/HTML' same to get to the textual content. In other words I would modify the statement 'HTML/XML tags should be removed'.  Other formats might require similar normalization which is not covered by this specification.
>> 
>> Then we can probably think of additional selectors?
>> 
>> -- 
>> Dr. Paolo Ciccarese
>> http://www.paolociccarese.info/
>> Biomedical Informatics Research & Development
>> Instructor of Neurology at Harvard Medical School
>> Assistant in Neuroscience at Mass General Hospital
>> Member of the MGH Biomedical Informatics Core
>> +1-857-366-1524 (mobile)   +1-617-768-8744 (office)
>> 
>> CONFIDENTIALITY NOTICE: This message is intended only for the addressee(s), may contain information that is considered
>> to be sensitive or confidential and may not be forwarded or disclosed to any other party without the permission of the sender. 
>> If you have received this message in error, please notify the sender immediately.
>> 
> 
>
Attachments

application/pkcs7-signature attachment: smime.p7s
Received on Friday, 24 May 2013 15:45:38 UTC