Re: oa:start and end

Dear Sebastian and Kristof,
I've been applying TextPositionSelector and even more TextQuoteSelector
mostly for annotationg HTML documents as they show up in the browser.
Therefore, in both cases I've always been stripping out the HTML and
normalized the whitespaces as that guarantees better matching and, in the
case of the TextQuote to be able to transfer annotation back and forth from
HTML and PDF.

Domeo, as Hypothes.is and many other projects, is interested in annotating
online content from a final user perspective and normally they are not
interested in annotating the underlying HTML code.

However, there are others that are willing to annotate the actual source
code. It happened to me in the past and in that case the source as been
looked at as plain text (text/plain) and not as HTML for instance.
 "ao:prefix": "</title>\n <link rel=\"stylesheet\" type=\"text/",
                                "ao:exact": "css\" href=\"styles/main.css\"
/>",
                                "ao:suffix": "\n <link rel=\"stylesheet\"
type=\"text/css\" href=\"styles/skin/main.css\" />\n <style>\n h1 {
text-align: center; "
                            }

The two selectors in the specs are very oriented to the final presentation
in the browser.



On Fri, May 24, 2013 at 8:07 AM, Kristof Csillag <csillag@hypothes.is>wrote:

> Hi Sebastian,
>
> We are moving beyond the borders of my experience here, so please take
> my answers with a grain of salt.
> I really hope some of the creators of the standard will speak up soon.
>
> At 2013-05-24 12:51, Sebastian Hellmann wrote:
> > Hi Kristof,
> > it seems that we are talking about two different things.
> > I am talking about documents as unicode sequences.
> I think that is a perfectly valid use case, and so it should be
> supported by OA; however, Annotator & Hypothes.is are currenty only
> working on the text of documents that can be accessed as a DOM tree in
> browsers.
>
> > For example, what you would get in the Body part of a HTTP GET
> > Response for Content-Type: text/*
> > Or the string that you transfer in the Header as parameters during a
> > GET/POST request e.g. ?form-field=Some+text+from+a+form or
> > ?text=text+to+be+annotated+by+nlp+tool
> > or a plain text log-file on the hard disk using file://
> Those use-cases are outside the current scope of Hypothes.is, but the OA
> standard should definitely apply nevertheless.
>
> > When doing so it is possible to define a way to generate character
> > position numbers in a way, that:
> > (1) different programming languages already have functions that allow
> > me to generate these numbers e.g. python or java
> > (2) allow two programmers who read the spec to produce the same
> > implementation with matching numbers for the same text
> >
> >
> > The thing that you describe is a little bit outside of my expertise. I
> > wouldn't call it "text", however. I think, you are talking about "text
> > nodes" in a DOM tree or more specifically text nodes in an XML
> > document. As long as the OA annotation doesn't  leave your Hypothes.is
> > tool chain, everything works fine. I am unsure, however, whether a
> > developer reading the current description of the TextPositionSelector
> > could anchor the annotations in the way it was intended by Hypothes.is
> > when exporting annotations.
> I am not sure whether any of them helps, but here are a few considerations:
>
> 1. Besides TextPositionSelector, there is also
>
> 1.a) DataPositionSelector[1], which works on byte-level, instead of
> character-level, and does not mention any kind of normalization, and
> 1.b) TextQuoteSelector[2], which does not mention character positions at
> all.
> Maybe those could be used in these situations? Please note that these
> selectors are optional; any given target might have any combination of
> selectors. So if in a given situation, a given selector is not suitable,
> you can just use something else.
>
> 2. You might want to wrap the data into a HTML < pre > tag, which
> disables any formatting in browser. In this case, normalization does not
> take place. However, this still does not solve the question of illegal
> characters / HTML entities in the text.
>
>
> [1]
> http://www.openannotation.org/spec/core/specific.html#TextQuoteSelector
> [2]
> http://www.openannotation.org/spec/core/specific.html#DataPositionSelector
>
> > It is more an interoperability issue and normally requires test cases.
> >
> >
> > I was assuming that OA is interested in being universal,
> I think it is.
> > so natural language processing tools and annotations should definitely
> > be in scope.
> > They are working on text not XML nodes.  I attached the respective
> > snippet from ISO 24612:2012(E) of the Linguistic annotation framework
> > (LAF).
> >
> > My question is what the underlying assumption of the current
> > TextPositionSelector is. If you have a document such as:
> > http://persistence.uni-leipzig.org/nlp2rdf/examples/doc/LinkedData.txt
> > Do I have to normalize whitespace as is required by the definition?
> When opening the referenced document in either FireFox or Chrome, the
> browsers display the TXT file by wrapping a < pre > element around it.
> Which means that the workaround I mentioned above is kind of
> automatically in place.
>
> Selecting the title ("Linked Data") should yield these values:
> {type:"TextPositionSelector", start:313, end:324}
> I believe that's in line with most of sane definitions.
>
> > And how would I normalize it so that your position numbers match mine?
> You don't have to, since the < pre > element (automatically added by the
> browser) disables automatic formatting in browsers, which the
> normalization is supposed to approximate.
>
> >
> > I would offer to write another selector which clearly defines what is
> > needed for NLP.
> That is certainly an option, but i don't have enough information to form
> an opinion about that.
>
> Best wishes:
>
>    Kristof
>
>
>


-- 
Dr. Paolo Ciccarese
http://www.paolociccarese.info/
Biomedical Informatics Research & Development
Instructor of Neurology at Harvard Medical School
Assistant in Neuroscience at Mass General Hospital
Member of the MGH Biomedical Informatics Core
+1-857-366-1524 (mobile)   +1-617-768-8744 (office)

CONFIDENTIALITY NOTICE: This message is intended only for the addressee(s),
may contain information that is considered
to be sensitive or confidential and may not be forwarded or disclosed to
any other party without the permission of the sender.
If you have received this message in error, please notify the sender
immediately.

Received on Friday, 24 May 2013 12:14:03 UTC