- From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- Date: Fri, 24 May 2013 12:51:03 +0200
- To: Kristof Csillag <csillag@hypothes.is>
- CC: public-openannotation@w3.org
- Message-ID: <519F4617.5090504@informatik.uni-leipzig.de>
Hi Kristof, it seems that we are talking about two different things. I am talking about documents as unicode sequences. For example, what you would get in the Body part of a HTTP GET Response for Content-Type: text/* Or the string that you transfer in the Header as parameters during a GET/POST request e.g. ?form-field=Some+text+from+a+form or ?text=text+to+be+annotated+by+nlp+tool or a plain text log-file on the hard disk using file:// When doing so it is possible to define a way to generate character position numbers in a way, that: (1) different programming languages already have functions that allow me to generate these numbers e.g. python or java (2) allow two programmers who read the spec to produce the same implementation with matching numbers for the same text The thing that you describe is a little bit outside of my expertise. I wouldn't call it "text", however. I think, you are talking about "text nodes" in a DOM tree or more specifically text nodes in an XML document. As long as the OA annotation doesn't leave your Hypothes.is tool chain, everything works fine. I am unsure, however, whether a developer reading the current description of the TextPositionSelector could anchor the annotations in the way it was intended by Hypothes.is when exporting annotations. It is more an interoperability issue and normally requires test cases. I was assuming that OA is interested in being universal, so natural language processing tools and annotations should definitely be in scope. They are working on text not XML nodes. I attached the respective snippet from ISO 24612:2012(E) of the Linguistic annotation framework (LAF). My question is what the underlying assumption of the current TextPositionSelector is. If you have a document such as: http://persistence.uni-leipzig.org/nlp2rdf/examples/doc/LinkedData.txt Do I have to normalize whitespace as is required by the definition? And how would I normalize it so that your position numbers match mine? I would offer to write another selector which clearly defines what is needed for NLP. All the best, Sebastian 3.3.2 Primary data Primary data consists of electronic data in any format, including character (text), image, audio and video. Primary data in a LAF-compliant resources are frozen as "read-only" to preserve the integrity of references to locations within the document or documents. Corrections and modifications to the primary data are treated as annotations and stored in a separate annotation document. Primary data documents containing textual data are encoded in UTF-8 (default) or UTF-16. In the general case, primary data does not contain markup of any kind. If markup does exist in primary data (e.g. HTML or XML tags), it is treated as a part of the data stream by referring annotations; no distinction is made between markup and other characters in the data when referring to locations in the document. Am 24.05.2013 10:48, schrieb Kristof Csillag: > At 2013-05-24 09:35, Sebastian Hellmann wrote: >> Hi Rob and Kristof, >> ok, I am glad that it is just a matter of changing the documentation. >> [...] >> Issue 2 : >> ******* >> Could you explain your rationale behind the text normalization? >> "The text MUST be normalized before counting characters. HTML/XML >> tags should be removed, character entities should be replaced with >> the character that they encode, ...." > Dear Sebastian, > > As far as I can understand, the goal here is to work with a text > string as close to what the user sees as possible. That is why we are > hiding away elements of the HTML that are not visible to the user. > > We are annotating text, not the technical elements of the document > that represent that text. If the same text is represented in a > different document, in a different format (different structure, > different tags, etc), but the normalized text is still the same, then > the TextPositionSelector should still apply. > >> That is a part which I do not understand. Does the >> TextPositionSelector assume text or HTML ? > I think it is supposed to be applied to text. (Which is most often > rendered from HTML or PDF.) > >> Does it mean, that I can never use it annotate source code? E.g. an >> annotation such as: >> """You might want to use <textarea> instead of <input type="text" in >> this HTML form.""" > I am not qualified to give a definite answer to this question; what I > can confirm is that when working with Annotator or Hypothes.is, you > can not annotate the source code of the active document. (Of course if > you escape the source code, then it's fine, but then it's no longer > real source code.) > >> I am unaware of a standardized normalization algorithm except for >> Unicode Normal Form[4]. NIF requires such a normalization, but then >> the text is supposed to be put into the RDF as rdf:Literal and >> indexes are counted afterwards. Your approach only links to the >> resource, which makes counting difficult and depends on the encoding. > I am not sure about the theory here. In practice, here at Hypothes.is, > to do this normalization, we have been using the browser selection API > (as implemented by Firefox and Chrome) to get a string representation > of (various parts of) the document. (If anybody is interested in the > details, this was explained in a blog post: > http://hypothes.is/blog/fuzzy-anchoring , section "Implementation > details". ) > > Best wishes: > > Kristof > > >> For HTML/XML you might want to create a selector from the never >> finished XPointer/XPointer: >> http://www.w3.org/TR/xptr-xpointer/#b2b1b1b3b6b6 >> >> All the best, >> Sebastian >> >> [0] http://www.w3.org/TR/its20/ >> [1] >> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector >> [2] http://tools.ietf.org/html/rfc5147#section-2.2.1 >> [3] >> http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/version-1.0/nif-core.ttl >> [4] http://unicode.org/reports/tr15/#Norm_Forms >> >> Am 23.05.2013 22:33, schrieb Kristof Csillag: >>> Hi All, >>> >>> I think the wording in the spec[1] is quite clear: >>> >>> ------------- >>> /The start can be thought of as the position of a cursor in the >>> list. Position 0 would be immediately before the first character, >>> position 1 would be immediately before the second character, and so >>> on. The start character is thus included in the list, but the end >>> character is not as the cursor stops immediately before it. // >>> //For example, if the document was "abcdefghijklmnopqrstuvwxyz", the >>> start was 4, and the end was 7, then the selection would be "efg". // >>> /------------- >>> >>> However, the description in the ontology[2] is not so clear: >>> >>> ------------- >>> /An oa:Selector which describes a range of text based on its start >>> and end positions./ >>> ------------- >>> >>> Indeed, this section could be a bit more detailed, to convey the >>> intent more precisely. >>> >>> I can confirm that the implementation in Hypothes.is (which is >>> hopefully going to land in Annotator soon) is in sync with the first >>> spec, so it behaves as Robert has described. (So from >>> "abcdefghijkl", start:0, end:1 gives "a" .) >>> >>> 1: >>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector >>> 2: http://www.w3.org/ns/oa#d4e667 >>> >>> Best wishes: >>> >>> Kristof Csillag >>> >>> >>> At 2013-05-23 21:50, Robert Sanderson wrote: >>>> >>>> Hi Sebastian, >>>> >>>> If that's true, then we need to fix the description in the spec[1], >>>> as it's intended to be the same as RFC 5147 using only characters >>>> mode, with some >>>> >>>> To use "abcdefghijkl" as the document: >>>> >>>> start:0, end:1 should be "a" -- start before the first >>>> character, end before the second character >>>> start:4, end:7 should be "efg" -- start before the 5th character >>>> (eg e) and end before the 8th character (eg h) >>>> >>>> So I *think* it's the same as your tools? >>>> >>>> >>>> 1: >>>> http://www.openannotation.org/spec/core/specific.html#TextPositionSelector >>>> >>>> Rob >>>> >>>> >>> >> >> >> -- >> Dipl. Inf. Sebastian Hellmann >> Department of Computer Science, University of Leipzig >> Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, >> Deadline: *July 8th*) >> Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf >> Projects: http://nlp2rdf.org , http://linguistics.okfn.org , >> http://dbpedia.org/Wiktionary , http://dbpedia.org >> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann >> Research Group: http://aksw.org > -- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Events: NLP & DBpedia 2013 (http://nlp-dbpedia2013.blogs.aksw.org, Deadline: *July 8th*) Venha para a Alemanha como PhD: http://bis.informatik.uni-leipzig.de/csf Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org
Received on Friday, 24 May 2013 10:51:48 UTC