Re: HTML tags, normalizations and text selectors from Robert Sanderson on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: Robert Sanderson <azaroth42@gmail.com>
Date: Fri, 24 May 2013 09:16:11 -0600
To: David Wood <david@3roundstones.com>
Cc: Paolo Ciccarese <paolo.ciccarese@gmail.com>, public-openannotation <public-openannotation@w3.org>
Message-ID: <CABevsUEKg9bhRxnmoVz+yZpJbTynBnhG-Uzi3mrAnVkQcsY1sA@mail.gmail.com>
Thanks David.  Did you have any thoughts on the whitespace issue, which I
think is more contentious?  I recall that we had a conversation about this
issue at the W3C eBook meeting in February.

It might be clearer if the documentation for the Selectors included the
rationale behind them, so that Text*Selector said that it was exclusively
about the content and tries to be format agnostic in order to permit the
same selection (and selector class) to be used in a meaningful way in
multiple situations.

Another example to consider would be markdown or the various wiki syntaxes.
 Should the client should process them into the expected human readable
form before counting the characters, even though the syntax is not strictly
"tags" or "character entities"?  I would say that it should, for example
you wouldn't count "---" as three characters but the same as <hr/>.   There
might be a more understandable, but less prescriptive, way to define the
normalization in those sorts of terms?

Rob



On Fri, May 24, 2013 at 9:06 AM, David Wood <david@3roundstones.com> wrote:

> On May 24, 2013, at 11:01, Robert Sanderson <azaroth42@gmail.com> wrote:
>
>
> Tags and Character Entities in text/plain:  No, as text/plain does not
> have the concept of tags or character entities.  That should be clearer in
> the documentation that those rules only apply to formats that have those
> concepts (notably sgml/xml derivatives).
>
>
> +1.  I wouldn't want HTML or XML tags stripped from my text/plain
> documents.
>
> Regards,
> Dave
> --
> http://about.me/david_wood
>
>
> Whitespace in text/plain: Yes, as text/plain has the concept of
> whitespace.  So whether you see "Two  spaces  between  words" in html or in
> plain text, it should be treated as "Two spaces between words" using
> TextPositionSelector.  If you want to preserve those spaces, then use
> DataPositionSelector (or &nbsp; or the unicode equivalent).
>
> I agree that maintaining the exact representation is an important use
> case, and one that is covered already, I think :)
>
> Rob
>
>
>
>
>
> On Fri, May 24, 2013 at 8:35 AM, Paolo Ciccarese <
> paolo.ciccarese@gmail.com> wrote:
>
>> Some additional thoughts after the the discussion thread
>> http://lists.w3.org/Archives/Public/public-openannotation/2013May/0042.html
>>
>> Currently both selectors in the spec ask for removal of HTML/XML tags.
>> That is also what I do in most of my applications as the goal is to
>> annotate 'content'. And it works pretty well for HMTL<->PDF.
>>
>> However, OA has a broader scope.
>>
>> I was wondering if tags have to be removed when I get a document of type
>> 'text/plain'.  I personally don't remove tags in 'the text/plain'
>> representation of a web page as, in that format, I don't see them as tags.
>> Of course it goes down to the fact that I annotate content within the
>> browser and the browser does not see those as tags.
>>
>> Vice versa if I get the content as text/html or application/xml I can use
>> fragments selectors to point to specific elements. But I might still resort
>> to plain text for very specific reasons.
>>
>> Now, does the 'HTML/XML tags should be removed' (in
>> http://www.openannotation.org/spec/core/specific.html#TextQuoteSelectorand
>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector)
>> apply to 'text/plain'. If the Selectors and the State have to be
>> orthogonal, if the implementation does not matter and given the current
>> description, it probably should.
>>
>> Also, what happens about white spaces for a 'text/plain' vs 'text/html'?
>> In the former does it make sense to normalize? Should that be an option
>> (normalization: on/off) or that makes things more complicated?
>>
>> In summary, the two existing text selector are tailored for text
>> documents that are accessed through the DOM… and we care about the pure
>> content. So, if I want to record that I've annotated an HTML/XML page
>> negotiated as plain text - recorded in the State - or better accessed as a
>> file, it currently cannot keep/count the HTML/XML tags as part of the
>> selection. And that is an important use case for some (including myself).
>>
>> As a start, as these are 'text selector'. So if you re dealing with
>> 'application/xml' you need to strip tags and normalize to get to the text.
>> If you deal with 'text/HTML' same to get to the textual content. In other
>> words I would modify the statement 'HTML/XML tags should be removed'.
>> Other formats might require similar normalization which is not covered by
>> this specification.
>>
>> Then we can probably think of additional selectors?
>>
>> --
>> Dr. Paolo Ciccarese
>> http://www.paolociccarese.info/
>> Biomedical Informatics Research & Development
>> Instructor of Neurology at Harvard Medical School
>> Assistant in Neuroscience at Mass General Hospital
>> Member of the MGH Biomedical Informatics Core
>> +1-857-366-1524 (mobile)   +1-617-768-8744 (office)
>>
>> CONFIDENTIALITY NOTICE: This message is intended only for the
>> addressee(s), may contain information that is considered
>> to be sensitive or confidential and may not be forwarded or disclosed to
>> any other party without the permission of the sender.
>> If you have received this message in error, please notify the sender
>> immediately.
>>
>
>
>
Received on Friday, 24 May 2013 15:16:43 UTC