Re: oa:start and end from Kristof Csillag on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: Kristof Csillag <csillag@hypothes.is>
Date: Fri, 24 May 2013 10:48:15 +0200
To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
CC: public-openannotation@w3.org
Message-ID: <519F294F.7080308@hypothes.is>
At 2013-05-24 09:35, Sebastian Hellmann wrote:
> Hi Rob and Kristof,
> ok, I am glad that it is just a matter of changing the documentation.
> [...]
> Issue 2 :
> *******
> Could you explain your rationale behind the text normalization?
> "The text MUST be normalized before counting characters. HTML/XML tags
> should be removed, character entities should be replaced with the
> character that they encode, ...."
Dear Sebastian,

As far as I can understand, the goal here is to work with a text string
as close to what the user sees as possible. That is why we are hiding
away elements of the HTML that are not visible to the user.

We are annotating text, not the technical elements of the document that
represent that text. If the same text is represented in a different
document, in a different format (different structure, different tags,
etc), but the normalized text is still the same, then the
TextPositionSelector should still apply.

> That is a part which I do not understand. Does the
> TextPositionSelector assume text or HTML ?
I think it is supposed to be applied to text. (Which is most often
rendered from HTML or PDF.)

> Does it mean, that I can never use it annotate source code? E.g. an
> annotation such as:
> """You might want to use <textarea> instead of <input type="text" in
> this HTML form."""
I am not qualified to give a definite answer to this question; what I
can confirm is that when working with Annotator or Hypothes.is, you can
not annotate the source code of the active document. (Of course if you
escape the source code, then it's fine, but then it's no longer real
source code.)

> I am unaware of a standardized normalization algorithm except for
> Unicode Normal Form[4]. NIF requires such a normalization, but then
> the text is supposed to be put into the RDF as rdf:Literal and indexes
> are counted afterwards.  Your approach only links to the resource,
> which makes counting difficult and depends on the encoding.
I am not sure about the theory here. In practice, here at Hypothes.is,
to do this normalization, we have been using the browser selection API
(as implemented by Firefox and Chrome) to get a string representation of
(various parts of) the document. (If anybody is interested in the
details, this was explained in a blog post:
http://hypothes.is/blog/fuzzy-anchoring , section "Implementation
details". )

Best wishes:

   Kristof


> For HTML/XML you might want to create a selector from the never
> finished XPointer/XPointer:
> http://www.w3.org/TR/xptr-xpointer/#b2b1b1b3b6b6
>
> All the best,
> Sebastian
>
> [0] http://www.w3.org/TR/its20/
> [1]
> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
> [2] http://tools.ietf.org/html/rfc5147#section-2.2.1
> [3]
> http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl
> [4] http://unicode.org/reports/tr15/#Norm_Forms
>
> Am 23.05.2013 22:33, schrieb Kristof Csillag:
>> Hi All,
>>
>> I think the wording in the spec[1] is quite clear:
>>
>> -------------
>> /The start can be thought of as the position of a cursor in the list.
>> Position 0 would be immediately before the first character, position
>> 1 would be immediately before the second character, and so on. The
>> start character is thus included in the list, but the end character
>> is not as the cursor stops immediately before it. //
>> //For example, if the document was "abcdefghijklmnopqrstuvwxyz", the
>> start was 4, and the end was 7, then the selection would be "efg". //
>> /-------------
>>
>> However, the description in the ontology[2] is not so clear:
>>
>> -------------
>> /An oa:Selector which describes a range of text based on its start
>> and end positions./
>> -------------
>>
>> Indeed, this section could be a bit more detailed, to convey the
>> intent more precisely.
>>
>> I can confirm that the implementation in Hypothes.is (which is
>> hopefully going to land in Annotator soon) is in sync with the first
>> spec, so it behaves as Robert has described. (So from "abcdefghijkl",
>> start:0, end:1 gives "a" .)
>>
>> 1:
>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
>> 2: http://www.w3.org/ns/oa#d4e667
>>
>> Best wishes:
>>
>>    Kristof Csillag
>>
>>
>> At 2013-05-23 21:50, Robert Sanderson wrote:
>>>
>>> Hi Sebastian,
>>>
>>> If that's true, then we need to fix the description in the spec[1],
>>> as it's intended to be the same as RFC 5147 using only characters
>>> mode, with some 
>>>
>>> To use "abcdefghijkl" as the document:
>>>
>>> start:0, end:1 should be "a"      -- start before the first
>>> character, end before the second character
>>> start:4, end:7 should be "efg" -- start before the 5th character (eg
>>> e) and end before the 8th character (eg h)
>>>
>>> So I *think* it's the same as your tools?
>>>
>>>
>>> 1: http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
>>>
>>> Rob
>>>
>>>
>>
>
>
> -- 
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig
> Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org,
> Deadline: *July 8th*)
> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
> Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
> http://dbpedia.org/Wiktionary , http://dbpedia.org
> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
> Research Group: http://aksw.org
Received on Friday, 24 May 2013 08:48:51 UTC