HTML tags, normalizations and text selectors from Paolo Ciccarese on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: Paolo Ciccarese <paolo.ciccarese@gmail.com>
Date: Fri, 24 May 2013 10:35:17 -0400
To: public-openannotation <public-openannotation@w3.org>
Message-ID: <CAFPX2kBgb+XjXWGf-KAFS8s1FM8uESe=4QbBNDBhAzeoXaPLzA@mail.gmail.com>

Some additional thoughts after the the discussion thread
http://lists.w3.org/Archives/Public/public-openannotation/2013May/0042.html

Currently both selectors in the spec ask for removal of HTML/XML tags. That
is also what I do in most of my applications as the goal is to annotate
'content'. And it works pretty well for HMTL<->PDF.

However, OA has a broader scope.

I was wondering if tags have to be removed when I get a document of type
'text/plain'.  I personally don't remove tags in 'the text/plain'
representation of a web page as, in that format, I don't see them as tags.
Of course it goes down to the fact that I annotate content within the
browser and the browser does not see those as tags.

Vice versa if I get the content as text/html or application/xml I can use
fragments selectors to point to specific elements. But I might still resort
to plain text for very specific reasons.

Now, does the 'HTML/XML tags should be removed' (in
http://www.openannotation.org/spec/core/specific.html#TextQuoteSelector and
http://www.openannotation.org/spec/core/specific.html#TextPositionSelector)
apply to 'text/plain'. If the Selectors and the State have to be
orthogonal, if the implementation does not matter and given the current
description, it probably should.

Also, what happens about white spaces for a 'text/plain' vs 'text/html'? In
the former does it make sense to normalize? Should that be an option
(normalization: on/off) or that makes things more complicated?

In summary, the two existing text selector are tailored for text documents
that are accessed through the DOM… and we care about the pure content. So,
if I want to record that I've annotated an HTML/XML page negotiated as
plain text - recorded in the State - or better accessed as a file, it
currently cannot keep/count the HTML/XML tags as part of the selection. And
that is an important use case for some (including myself).

As a start, as these are 'text selector'. So if you re dealing with
'application/xml' you need to strip tags and normalize to get to the text.
If you deal with 'text/HTML' same to get to the textual content. In other
words I would modify the statement 'HTML/XML tags should be removed'.
Other formats might require similar normalization which is not covered by
this specification.

Then we can probably think of additional selectors?

-- 
Dr. Paolo Ciccarese
http://www.paolociccarese.info/
Biomedical Informatics Research & Development
Instructor of Neurology at Harvard Medical School
Assistant in Neuroscience at Mass General Hospital
Member of the MGH Biomedical Informatics Core
+1-857-366-1524 (mobile)   +1-617-768-8744 (office)

CONFIDENTIALITY NOTICE: This message is intended only for the addressee(s),
may contain information that is considered
to be sensitive or confidential and may not be forwarded or disclosed to
any other party without the permission of the sender.
If you have received this message in error, please notify the sender
immediately.

Received on Friday, 24 May 2013 14:35:49 UTC