Re: oa:start and end from Sebastian Hellmann on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Fri, 24 May 2013 09:35:05 +0200
To: Kristof Csillag <csillag@hypothes.is>
CC: public-openannotation@w3.org
Message-ID: <519F1829.8040408@informatik.uni-leipzig.de>
Hi Rob and Kristof,
ok, I am glad that it is just a matter of changing the documentation.
NIF will be a normative feature in a W3C recommendation[0] soon, so we 
are making sure, that the core properties are stable and sound and 
correctly linked to any vocabularies like OA and the Provenance Ontology.

Here are the sentences, which were causing my misunderstanding. For me 
only Issue 1 is important. Issue 2 is very well handled in NIF, but 
still fuzzy in OA.

Issue 1 "plain text definition":
*******
In [1], you are using the terms list, point, stream, position, cursor, 
segment to explain this selector.
Often it is unclear what is meant.

Cursor position is first defined as pointing to the elements:
"/position of a cursor in the list/"
Then defined as immediately stopping "before" the character:
"as the cursor stops immediately before it."
Directly below you are the refering to
"The first character in the full text is character position 0, and the 
character is included within the segment. "

After your definition, this means that  start 0, end 0 includes the 
character at position 0 in the segment or sublist.
Note that, however, in the RFC[2] "character position" is defined as the 
gap between two characters and then based on these "character positions" 
"character ranges" are defined.

Here is the NIF definition[3]:

nif:beginIndex
     a owl:DatatypeProperty ;
     vs:term_status "testing" ;
     rdfs:label "begin index"@en ;
     rdfs:comment """The begin index of a character range as defined in http://tools.ietf.org/html/rfc5147#section-2.2.1 and http://tools.ietf.org/html/rfc5147#section-2.2.2, measured as the gap between two characters, starting to count from 0 (the position before the first character of a text).
     Example: Index "2" is the postion between "Mr" and "."  in "Mr. Sandman".
     Note: RFC 5147 is re-used for the definition of character ranges. RFC 5147 is assuming a text/plain MIME type. NIF builds upon Unicode and is content agnostic.
     Requirement (1): This property has the same value the "Character position" of RFC 5147 and it must therefore be an xsd:nonNegativeInteger .
     Requirement (2): The index of the subject string MUST be calculated relative to the nif:referenceContext of the subject. If available, this is the rdf:Literal of the nif:isString property.""" ;
     # still being discussed:
     rdfs:subPropertyOf oa:start ;
     rdfs:range <http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
     rdfs:domain nif:String .



Issue 2 :
*******
Could you explain your rationale behind the text normalization?
"The text MUST be normalized before counting characters. HTML/XML tags 
should be removed, character entities should be replaced with the 
character that they encode, ...."

That is a part which I do not understand. Does the TextPositionSelector 
assume text or HTML ?
Does it mean, that I can never use it annotate source code? E.g. an 
annotation such as:
"""You might want to use <textarea> instead of <input type="text" in 
this HTML form."""

I am unaware of a standardized normalization algorithm except for 
Unicode Normal Form[4]. NIF requires such a normalization, but then the 
text is supposed to be put into the RDF as rdf:Literal and indexes are 
counted afterwards.  Your approach only links to the resource, which 
makes counting difficult and depends on the encoding.

For HTML/XML you might want to create a selector from the never finished 
XPointer/XPointer:
http://www.w3.org/TR/xptr-xpointer/#b2b1b1b3b6b6

All the best,
Sebastian

[0] http://www.w3.org/TR/its20/
[1] 
http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
[2] http://tools.ietf.org/html/rfc5147#section-2.2.1
[3] 
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl
[4] http://unicode.org/reports/tr15/#Norm_Forms

Am 23.05.2013 22:33, schrieb Kristof Csillag:
> Hi All,
>
> I think the wording in the spec[1] is quite clear:
>
> -------------
> /The start can be thought of as the position of a cursor in the list. 
> Position 0 would be immediately before the first character, position 1 
> would be immediately before the second character, and so on. The start 
> character is thus included in the list, but the end character is not 
> as the cursor stops immediately before it. //
> //For example, if the document was "abcdefghijklmnopqrstuvwxyz", the 
> start was 4, and the end was 7, then the selection would be "efg". //
> /-------------
>
> However, the description in the ontology[2] is not so clear:
>
> -------------
> /An oa:Selector which describes a range of text based on its start and 
> end positions./
> -------------
>
> Indeed, this section could be a bit more detailed, to convey the 
> intent more precisely.
>
> I can confirm that the implementation in Hypothes.is (which is 
> hopefully going to land in Annotator soon) is in sync with the first 
> spec, so it behaves as Robert has described. (So from "abcdefghijkl", 
> start:0, end:1 gives "a" .)
>
> 1: 
> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
> 2: http://www.w3.org/ns/oa#d4e667
>
> Best wishes:
>
>    Kristof Csillag
>
>
> At 2013-05-23 21:50, Robert Sanderson wrote:
>>
>> Hi Sebastian,
>>
>> If that's true, then we need to fix the description in the spec[1], 
>> as it's intended to be the same as RFC 5147 using only characters 
>> mode, with some
>>
>> To use "abcdefghijkl" as the document:
>>
>> start:0, end:1 should be "a"      -- start before the first 
>> character, end before the second character
>> start:4, end:7 should be "efg" -- start before the 5th character (eg 
>> e) and end before the 8th character (eg h)
>>
>> So I *think* it's the same as your tools?
>>
>>
>> 1: 
>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
>>
>> Rob
>>
>>
>


-- 
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, 
Deadline: *July 8th*)
Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Received on Friday, 24 May 2013 07:35:35 UTC