Re: oa:start and end from Sebastian Hellmann on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Fri, 24 May 2013 12:51:03 +0200
To: Kristof Csillag <csillag@hypothes.is>
CC: public-openannotation@w3.org
Message-ID: <519F4617.5090504@informatik.uni-leipzig.de>
Hi Kristof,
it seems that we are talking about two different things.
I am talking about documents as unicode sequences. For example, what you 
would get in the Body part of a HTTP GET Response for Content-Type: text/*
Or the string that you transfer in the Header as parameters during a 
GET/POST request e.g. ?form-field=Some+text+from+a+form or 
?text=text+to+be+annotated+by+nlp+tool
or a plain text log-file on the hard disk using file://

When doing so it is possible to define a way to generate character 
position numbers in a way, that:
(1) different programming languages already have functions that allow me 
to generate these numbers e.g. python or java
(2) allow two programmers who read the spec to produce the same 
implementation with matching numbers for the same text


The thing that you describe is a little bit outside of my expertise. I 
wouldn't call it "text", however. I think, you are talking about "text 
nodes" in a DOM tree or more specifically text nodes in an XML document. 
As long as the OA annotation doesn't leave your Hypothes.is tool chain, 
everything works fine. I am unsure, however, whether a developer reading 
the current description of the TextPositionSelector could anchor the 
annotations in the way it was intended by Hypothes.is when exporting 
annotations.
It is more an interoperability issue and normally requires test cases.


I was assuming that OA is interested in being universal, so natural 
language processing tools and annotations should definitely be in scope.
They are working on text not XML nodes.  I attached the respective 
snippet from ISO 24612:2012(E) of the Linguistic annotation framework (LAF).

My question is what the underlying assumption of the current 
TextPositionSelector is. If you have a document such as:
http://persistence.uni-leipzig.org/nlp2rdf/examples/doc/LinkedData.txt
Do I have to normalize whitespace as is required by the definition? And 
how would I normalize it so that your position numbers match mine?

I would offer to write another selector which clearly defines what is 
needed for NLP.

All the best,
Sebastian




3.3.2 Primary data
Primary data consists of electronic data in any format, including 
character (text), image, audio and video.
Primary data in a LAF-compliant resources are frozen as "read-only" to 
preserve the integrity of references to
locations within the document or documents. Corrections and 
modifications to the primary data are treated as
annotations and stored in a separate annotation document. Primary data 
documents containing textual data
are encoded in UTF-8 (default) or UTF-16.
In the general case, primary data does not contain markup of any kind. 
If markup does exist in primary data
(e.g. HTML or XML tags), it is treated as a part of the data stream by 
referring annotations; no distinction is
made between markup and other characters in the data when referring to 
locations in the document.



Am 24.05.2013 10:48, schrieb Kristof Csillag:
> At 2013-05-24 09:35, Sebastian Hellmann wrote:
>> Hi Rob and Kristof,
>> ok, I am glad that it is just a matter of changing the documentation.
>> [...]
>> Issue 2 :
>> *******
>> Could you explain your rationale behind the text normalization?
>> "The text MUST be normalized before counting characters. HTML/XML 
>> tags should be removed, character entities should be replaced with 
>> the character that they encode, ...."
> Dear Sebastian,
>
> As far as I can understand, the goal here is to work with a text 
> string as close to what the user sees as possible. That is why we are 
> hiding away elements of the HTML that are not visible to the user.
>
> We are annotating text, not the technical elements of the document 
> that represent that text. If the same text is represented in a 
> different document, in a different format (different structure, 
> different tags, etc), but the normalized text is still the same, then 
> the TextPositionSelector should still apply.
>
>> That is a part which I do not understand. Does the 
>> TextPositionSelector assume text or HTML ?
> I think it is supposed to be applied to text. (Which is most often 
> rendered from HTML or PDF.)
>
>> Does it mean, that I can never use it annotate source code? E.g. an 
>> annotation such as:
>> """You might want to use <textarea> instead of <input type="text" in 
>> this HTML form."""
> I am not qualified to give a definite answer to this question; what I 
> can confirm is that when working with Annotator or Hypothes.is, you 
> can not annotate the source code of the active document. (Of course if 
> you escape the source code, then it's fine, but then it's no longer 
> real source code.)
>
>> I am unaware of a standardized normalization algorithm except for 
>> Unicode Normal Form[4]. NIF requires such a normalization, but then 
>> the text is supposed to be put into the RDF as rdf:Literal and 
>> indexes are counted afterwards.  Your approach only links to the 
>> resource, which makes counting difficult and depends on the encoding.
> I am not sure about the theory here. In practice, here at Hypothes.is, 
> to do this normalization, we have been using the browser selection API 
> (as implemented by Firefox and Chrome) to get a string representation 
> of (various parts of) the document. (If anybody is interested in the 
> details, this was explained in a blog post: 
> http://hypothes.is/blog/fuzzy-anchoring , section "Implementation 
> details". )
>
> Best wishes:
>
>    Kristof
>
>
>> For HTML/XML you might want to create a selector from the never 
>> finished XPointer/XPointer:
>> http://www.w3.org/TR/xptr-xpointer/#b2b1b1b3b6b6
>>
>> All the best,
>> Sebastian
>>
>> [0] http://www.w3.org/TR/its20/
>> [1] 
>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
>> [2] http://tools.ietf.org/html/rfc5147#section-2.2.1
>> [3] 
>> http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl
>> [4] http://unicode.org/reports/tr15/#Norm_Forms
>>
>> Am 23.05.2013 22:33, schrieb Kristof Csillag:
>>> Hi All,
>>>
>>> I think the wording in the spec[1] is quite clear:
>>>
>>> -------------
>>> /The start can be thought of as the position of a cursor in the 
>>> list. Position 0 would be immediately before the first character, 
>>> position 1 would be immediately before the second character, and so 
>>> on. The start character is thus included in the list, but the end 
>>> character is not as the cursor stops immediately before it. //
>>> //For example, if the document was "abcdefghijklmnopqrstuvwxyz", the 
>>> start was 4, and the end was 7, then the selection would be "efg". //
>>> /-------------
>>>
>>> However, the description in the ontology[2] is not so clear:
>>>
>>> -------------
>>> /An oa:Selector which describes a range of text based on its start 
>>> and end positions./
>>> -------------
>>>
>>> Indeed, this section could be a bit more detailed, to convey the 
>>> intent more precisely.
>>>
>>> I can confirm that the implementation in Hypothes.is (which is 
>>> hopefully going to land in Annotator soon) is in sync with the first 
>>> spec, so it behaves as Robert has described. (So from 
>>> "abcdefghijkl", start:0, end:1 gives "a" .)
>>>
>>> 1: 
>>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
>>> 2: http://www.w3.org/ns/oa#d4e667
>>>
>>> Best wishes:
>>>
>>>    Kristof Csillag
>>>
>>>
>>> At 2013-05-23 21:50, Robert Sanderson wrote:
>>>>
>>>> Hi Sebastian,
>>>>
>>>> If that's true, then we need to fix the description in the spec[1], 
>>>> as it's intended to be the same as RFC 5147 using only characters 
>>>> mode, with some
>>>>
>>>> To use "abcdefghijkl" as the document:
>>>>
>>>> start:0, end:1 should be "a"      -- start before the first 
>>>> character, end before the second character
>>>> start:4, end:7 should be "efg" -- start before the 5th character 
>>>> (eg e) and end before the 8th character (eg h)
>>>>
>>>> So I *think* it's the same as your tools?
>>>>
>>>>
>>>> 1: 
>>>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
>>>>
>>>> Rob
>>>>
>>>>
>>>
>>
>>
>> -- 
>> Dipl. Inf. Sebastian Hellmann
>> Department of Computer Science, University of Leipzig
>> Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, 
>> Deadline: *July 8th*)
>> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
>> Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
>> http://dbpedia.org/Wiktionary , http://dbpedia.org
>> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
>> Research Group: http://aksw.org
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, 
Deadline: *July 8th*)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Received on Friday, 24 May 2013 10:51:48 UTC