Re: oa:start and end from Robert Sanderson on 2013-05-24 (public-openannotation@w3.org from May 2013)

From: Robert Sanderson <azaroth42@gmail.com>
Date: Fri, 24 May 2013 08:51:56 -0600
To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Cc: public-openannotation <public-openannotation@w3.org>
Message-ID: <CABevsUFm2Y1SbUJg8T0-cgbqidUOyF8ppXGQpisFq=sfBguNbA@mail.gmail.com>
Hi Sebastian and all,

1.  For the documentation, it could definitely be clearer.  We'll work on
this for the meeting in Manchester.

2.  The TextPositionSelector assumes that there will be text in some
HTML/XML based format, which for a web standard seems like a reasonable
thing to do, especially given that other traditional document formats are
also covered by this, including EPUB and the XML based office formats.

So the rationale is that it's the best solution we could find for
interoperability between formats that primarily convey text intended to be
read by humans, being the primary use case for web annotation.  As more and
more systems become browser based or even browser native (PDF.js for
example), this interoperability will become easier and easier.

For other formats:
* For plain text you could use a FragmentSelector with 5147.
* For XML you can use a FragmentSelector as well with XPointer. (but not
X/HTML as its fragment specification doesn't allow it)
* For everything else, there's DataPositionSelector which doesn't normalize
at all.  This could be used to annotate source code (HTML or other), plain
text, or anything else.

Between those options I think we have the vast majority of use cases
covered, including the text mining case?  So I don't think it's fuzzy, it's
just that you don't use TextPositionSelector if you want to annotate raw
data, of any form.  We could add text as an example in
DataPositionSelector's description, if that would help?

Rob



On Fri, May 24, 2013 at 1:35 AM, Sebastian Hellmann <
hellmann@informatik.uni-leipzig.de> wrote:

>  Hi Rob and Kristof,
> ok, I am glad that it is just a matter of changing the documentation.
> NIF will be a normative feature in a W3C recommendation[0] soon, so we are
> making sure, that the core properties are stable and sound and correctly
> linked to any vocabularies like OA and the Provenance Ontology.
>
> Here are the sentences, which were causing my misunderstanding. For me
> only Issue 1 is important. Issue 2 is very well handled in NIF, but still
> fuzzy in OA.
>
> Issue 1 "plain text definition":
> *******
> In [1], you are using the terms list, point, stream, position, cursor,
> segment to explain this selector.
> Often it is unclear what is meant.
>
> Cursor position is first defined as pointing to the elements:
>
> "*position of a cursor in the list*"
> Then defined as immediately stopping "before" the character:
>
> "as the cursor stops immediately before it."
> Directly below you are the refering to
> "The first character in the full text is character position 0, and the
> character is included within the segment. "
>
> After your definition, this means that  start 0, end 0 includes the
> character at position 0 in the segment or sublist.
> Note that, however, in the RFC[2] "character position" is defined as the
> gap between two characters and then based on these "character positions"
> "character ranges" are defined.
>
> Here is the NIF definition[3]:
>
> nif:beginIndex
>     a owl:DatatypeProperty ;
>     vs:term_status "testing" ;
>     rdfs:label "begin index"@en ;
>     rdfs:comment """The begin index of a character range as defined in http://tools.ietf.org/html/rfc5147#section-2.2.1 and http://tools.ietf.org/html/rfc5147#section-2.2.2, measured as the gap between two characters, starting to count from 0 (the position before the first character of a text).
>     Example: Index "2" is the postion between "Mr" and "."  in "Mr. Sandman".
>     Note: RFC 5147 is re-used for the definition of character ranges. RFC 5147 is assuming a text/plain MIME type. NIF builds upon Unicode and is content agnostic.
>     Requirement (1): This property has the same value the "Character position" of RFC 5147 and it must therefore be an xsd:nonNegativeInteger .
>     Requirement (2): The index of the subject string MUST be calculated relative to the nif:referenceContext of the subject. If available, this is the rdf:Literal of the nif:isString property.""" ;
>     # still being discussed:
>     rdfs:subPropertyOf oa:start ;
>     rdfs:range <http://www.w3.org/2001/XMLSchema#nonNegativeInteger> <http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
>     rdfs:domain nif:String .
>
>
>
> Issue 2 :
> *******
> Could you explain your rationale behind the text normalization?
> "The text MUST be normalized before counting characters. HTML/XML tags
> should be removed, character entities should be replaced with the character
> that they encode, ...."
>
> That is a part which I do not understand. Does the TextPositionSelector
> assume text or HTML ?
> Does it mean, that I can never use it annotate source code? E.g. an
> annotation such as:
> """You might want to use <textarea> instead of <input type="text" in this
> HTML form."""
>
> I am unaware of a standardized normalization algorithm except for Unicode
> Normal Form[4]. NIF requires such a normalization, but then the text is
> supposed to be put into the RDF as rdf:Literal and indexes are counted
> afterwards.  Your approach only links to the resource, which makes counting
> difficult and depends on the encoding.
>
> For HTML/XML you might want to create a selector from the never finished
> XPointer/XPointer:
> http://www.w3.org/TR/xptr-xpointer/#b2b1b1b3b6b6
>
> All the best,
> Sebastian
>
> [0] http://www.w3.org/TR/its20/
> [1]
> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
> [2] http://tools.ietf.org/html/rfc5147#section-2.2.1
> [3]
> http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl
> [4] http://unicode.org/reports/tr15/#Norm_Forms
>
> Am 23.05.2013 22:33, schrieb Kristof Csillag:
>
>  Hi All,
>
> I think the wording in the spec[1] is quite clear:
>
> -------------
> * The start can be thought of as the position of a cursor in the list.
> Position 0 would be immediately before the first character, position 1
> would be immediately before the second character, and so on. The start
> character is thus included in the list, but the end character is not as the
> cursor stops immediately before it. **
> ** For example, if the document was "abcdefghijklmnopqrstuvwxyz", the
> start was 4, and the end was 7, then the selection would be "efg". **
> *-------------
>
> However, the description in the ontology[2] is not so clear:
>
> -------------
> *An oa:Selector which describes a range of text based on its start and
> end positions.*
> -------------
>
> Indeed, this section could be a bit more detailed, to convey the intent
> more precisely.
>
> I can confirm that the implementation in Hypothes.is (which is hopefully
> going to land in Annotator soon) is in sync with the first spec, so it
> behaves as Robert has described. (So from "abcdefghijkl", start:0, end:1
> gives "a" .)
>
> 1:
> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
> 2: http://www.w3.org/ns/oa#d4e667
>
> Best wishes:
>
>    Kristof Csillag
>
>
> At 2013-05-23 21:50, Robert Sanderson wrote:
>
>
> Hi Sebastian,
>
>  If that's true, then we need to fix the description in the spec[1], as
> it's intended to be the same as RFC 5147 using only characters mode, with
> some
>
>  To use "abcdefghijkl" as the document:
>
>  start:0, end:1 should be "a"      -- start before the first character,
> end before the second character
> start:4, end:7 should be "efg" -- start before the 5th character (eg e)
> and end before the 8th character (eg h)
>
>  So I *think* it's the same as your tools?
>
>
>  1:
> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector
>
>  Rob
>
>
>
>
>
> --
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig
> Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org,
> Deadline: *July 8th*)
> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf
> Projects: http://nlp2rdf.org , http://linguistics.okfn.org ,
> http://dbpedia.org/Wiktionary , http://dbpedia.org
> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
> Research Group: http://aksw.org
>
Received on Friday, 24 May 2013 14:52:28 UTC