- From: Kristof Csillag <csillag@hypothes.is>
- Date: Fri, 24 May 2013 14:07:00 +0200
- To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- CC: public-openannotation@w3.org
Hi Sebastian, We are moving beyond the borders of my experience here, so please take my answers with a grain of salt. I really hope some of the creators of the standard will speak up soon. At 2013-05-24 12:51, Sebastian Hellmann wrote: > Hi Kristof, > it seems that we are talking about two different things. > I am talking about documents as unicode sequences. I think that is a perfectly valid use case, and so it should be supported by OA; however, Annotator & Hypothes.is are currenty only working on the text of documents that can be accessed as a DOM tree in browsers. > For example, what you would get in the Body part of a HTTP GET > Response for Content-Type: text/* > Or the string that you transfer in the Header as parameters during a > GET/POST request e.g. ?form-field=Some+text+from+a+form or > ?text=text+to+be+annotated+by+nlp+tool > or a plain text log-file on the hard disk using file:// Those use-cases are outside the current scope of Hypothes.is, but the OA standard should definitely apply nevertheless. > When doing so it is possible to define a way to generate character > position numbers in a way, that: > (1) different programming languages already have functions that allow > me to generate these numbers e.g. python or java > (2) allow two programmers who read the spec to produce the same > implementation with matching numbers for the same text > > > The thing that you describe is a little bit outside of my expertise. I > wouldn't call it "text", however. I think, you are talking about "text > nodes" in a DOM tree or more specifically text nodes in an XML > document. As long as the OA annotation doesn't leave your Hypothes.is > tool chain, everything works fine. I am unsure, however, whether a > developer reading the current description of the TextPositionSelector > could anchor the annotations in the way it was intended by Hypothes.is > when exporting annotations. I am not sure whether any of them helps, but here are a few considerations: 1. Besides TextPositionSelector, there is also 1.a) DataPositionSelector[1], which works on byte-level, instead of character-level, and does not mention any kind of normalization, and 1.b) TextQuoteSelector[2], which does not mention character positions at all. Maybe those could be used in these situations? Please note that these selectors are optional; any given target might have any combination of selectors. So if in a given situation, a given selector is not suitable, you can just use something else. 2. You might want to wrap the data into a HTML < pre > tag, which disables any formatting in browser. In this case, normalization does not take place. However, this still does not solve the question of illegal characters / HTML entities in the text. [1] http://www.openannotation.org/spec/core/specific.html#TextQuoteSelector [2] http://www.openannotation.org/spec/core/specific.html#DataPositionSelector > It is more an interoperability issue and normally requires test cases. > > > I was assuming that OA is interested in being universal, I think it is. > so natural language processing tools and annotations should definitely > be in scope. > They are working on text not XML nodes. I attached the respective > snippet from ISO 24612:2012(E) of the Linguistic annotation framework > (LAF). > > My question is what the underlying assumption of the current > TextPositionSelector is. If you have a document such as: > http://persistence.uni-leipzig.org/nlp2rdf/examples/doc/LinkedData.txt > Do I have to normalize whitespace as is required by the definition? When opening the referenced document in either FireFox or Chrome, the browsers display the TXT file by wrapping a < pre > element around it. Which means that the workaround I mentioned above is kind of automatically in place. Selecting the title ("Linked Data") should yield these values: {type:"TextPositionSelector", start:313, end:324} I believe that's in line with most of sane definitions. > And how would I normalize it so that your position numbers match mine? You don't have to, since the < pre > element (automatically added by the browser) disables automatic formatting in browsers, which the normalization is supposed to approximate. > > I would offer to write another selector which clearly defines what is > needed for NLP. That is certainly an option, but i don't have enough information to form an opinion about that. Best wishes: Kristof
Received on Friday, 24 May 2013 12:07:39 UTC