- From: Robert Sanderson <azaroth42@gmail.com>
- Date: Thu, 26 Jul 2012 07:42:56 -0600
- To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- Cc: Reto Bachmann-Gmür <reto@apache.org>, public-openannotation <public-openannotation@w3.org>
Hi Sebastian, Reto, Yes, as Reto says, it's just the plain text without any elements. If you look at, for example, the second paragraph of http://openannotation.org/spec/extension/#SelectorOffset it says to normalize the text first. We went for this approach based on a straw poll of implementations, all three of which used the raw text and did not count the characters for elements and so forth. (Being Domeo from Annotation Ontology, Annotator from the Open Knowledge Foundation and an internal OAC implementation) The reason being that the DOM that gets presented to a browser can differ substantially from the bytestream of the HTML. For example some browsers inject <tbody> into tables, and others don't. Also I realise that your question was about offsets, however for the TextQuoteSelector, you would not want to include HTML elements in the quotation as the 'find in document' process would be significantly more difficult. > <h2 title="Begrüßung" id="welcomeheader" >Hallöchen!</h2> > How are you measuring offset and range for "Hallöchen!" then? So the answer to your question would be offset of 0, and range of 10. Hope that helps, and if you have any evidence that contradicts the reasoning behind our current approach, please do bring it up :) Thanks, Rob On Thu, Jul 26, 2012 at 12:04 AM, Sebastian Hellmann <hellmann@informatik.uni-leipzig.de> wrote: > Am 26.07.2012 00:04, schrieb Reto Bachmann-Gmür: > >> On Jul 25, 2012 2:19 AM, "Sebastian Hellmann" < >> hellmann@informatik.uni-leipzig.de> wrote: >> .... >>> >>> E.g. <h2 title="Begrüßung" id="welcomeheader" >Hallöchen!</h2> >>> >>> I assume that your TextOffsetSelector assumes plain text and works on the >> >> HTML sources? >> >> I would have assumed it works on the actual text represented, so that >> ö, <b>o</b> and ö in the html source all count as one character. > > What do you mean by actual text represented? Do you mean text nodes in the > DOM? > This doesn't seem feasible. If this is your primary data: > > > <h2 title="Begrüßung" id="welcomeheader" >Hallöchen!</h2> > > How are you measuring offset and range for "Hallöchen!" then? > > <_:Selector1> a oax:TextOffsetSelector ; > oax:offset 44 ; > oax:range 15 . > > Sebastian > >> >> Cheers, >> Reto >> > > > -- > Dipl. Inf. Sebastian Hellmann > Department of Computer Science, University of Leipzig > Events: > * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012) > * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*) > Projects: http://nlp2rdf.org , http://dbpedia.org > Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann > Research Group: http://aksw.org >
Received on Thursday, 26 July 2012 13:43:34 UTC