Re: oa:start and end

Hi Sebastian,

We are moving beyond the borders of my experience here, so please take
my answers with a grain of salt.
I really hope some of the creators of the standard will speak up soon.

At 2013-05-24 12:51, Sebastian Hellmann wrote:
> Hi Kristof,
> it seems that we are talking about two different things.
> I am talking about documents as unicode sequences.
I think that is a perfectly valid use case, and so it should be
supported by OA; however, Annotator & Hypothes.is are currenty only
working on the text of documents that can be accessed as a DOM tree in
browsers.

> For example, what you would get in the Body part of a HTTP GET
> Response for Content-Type: text/*
> Or the string that you transfer in the Header as parameters during a
> GET/POST request e.g. ?form-field=Some+text+from+a+form or
> ?text=text+to+be+annotated+by+nlp+tool
> or a plain text log-file on the hard disk using file://
Those use-cases are outside the current scope of Hypothes.is, but the OA
standard should definitely apply nevertheless.

> When doing so it is possible to define a way to generate character
> position numbers in a way, that:
> (1) different programming languages already have functions that allow
> me to generate these numbers e.g. python or java
> (2) allow two programmers who read the spec to produce the same
> implementation with matching numbers for the same text
>
>
> The thing that you describe is a little bit outside of my expertise. I
> wouldn't call it "text", however. I think, you are talking about "text
> nodes" in a DOM tree or more specifically text nodes in an XML
> document. As long as the OA annotation doesn't  leave your Hypothes.is
> tool chain, everything works fine. I am unsure, however, whether a
> developer reading the current description of the TextPositionSelector
> could anchor the annotations in the way it was intended by Hypothes.is
> when exporting annotations.
I am not sure whether any of them helps, but here are a few considerations:

1. Besides TextPositionSelector, there is also

1.a) DataPositionSelector[1], which works on byte-level, instead of
character-level, and does not mention any kind of normalization, and
1.b) TextQuoteSelector[2], which does not mention character positions at
all.
Maybe those could be used in these situations? Please note that these
selectors are optional; any given target might have any combination of
selectors. So if in a given situation, a given selector is not suitable,
you can just use something else.

2. You might want to wrap the data into a HTML < pre > tag, which
disables any formatting in browser. In this case, normalization does not
take place. However, this still does not solve the question of illegal
characters / HTML entities in the text.
 

[1] http://www.openannotation.org/spec/core/specific.html#TextQuoteSelector
[2]
http://www.openannotation.org/spec/core/specific.html#DataPositionSelector

> It is more an interoperability issue and normally requires test cases.
>
>
> I was assuming that OA is interested in being universal,
I think it is.
> so natural language processing tools and annotations should definitely
> be in scope.
> They are working on text not XML nodes.  I attached the respective
> snippet from ISO 24612:2012(E) of the Linguistic annotation framework
> (LAF).
>
> My question is what the underlying assumption of the current
> TextPositionSelector is. If you have a document such as:
> http://persistence.uni-leipzig.org/nlp2rdf/examples/doc/LinkedData.txt
> Do I have to normalize whitespace as is required by the definition?
When opening the referenced document in either FireFox or Chrome, the
browsers display the TXT file by wrapping a < pre > element around it.
Which means that the workaround I mentioned above is kind of
automatically in place.

Selecting the title ("Linked Data") should yield these values:
{type:"TextPositionSelector", start:313, end:324}
I believe that's in line with most of sane definitions.

> And how would I normalize it so that your position numbers match mine?
You don't have to, since the < pre > element (automatically added by the
browser) disables automatic formatting in browsers, which the
normalization is supposed to approximate.

>
> I would offer to write another selector which clearly defines what is
> needed for NLP.
That is certainly an option, but i don't have enough information to form
an opinion about that.

Best wishes:

   Kristof

Received on Friday, 24 May 2013 12:07:39 UTC