Re: HTML and text Re: Questionn on annotation a text section from Robert Sanderson on 2012-07-26 (public-openannotation@w3.org from July 2012)

From: Robert Sanderson <azaroth42@gmail.com>
Date: Thu, 26 Jul 2012 07:42:56 -0600
To: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Cc: Reto Bachmann-Gmür <reto@apache.org>, public-openannotation <public-openannotation@w3.org>
Message-ID: <CABevsUHM-M2V6mJ8-OWoK_XfvW7YAOytM3wvLzRXt5_g0aCdpg@mail.gmail.com>

Hi Sebastian, Reto,

Yes, as Reto says, it's just the plain text without any elements.

If you look at, for example, the second paragraph of
http://openannotation.org/spec/extension/#SelectorOffset it says to
normalize the text first.  We went for this approach based on a straw
poll of implementations, all three of which used the raw text and did
not count the characters for elements and so forth.  (Being Domeo from
Annotation Ontology, Annotator from the Open Knowledge Foundation and
an internal OAC implementation)

The reason being that the DOM that gets presented to a browser can
differ substantially from the bytestream of the HTML.  For example
some browsers inject <tbody> into tables, and others don't.  Also I
realise that your question was about offsets, however for the
TextQuoteSelector, you would not want to include HTML elements in the
quotation as the 'find in document' process would be significantly
more difficult.

> <h2 title="Begrüßung" id="welcomeheader" >Hall&ouml;chen!</h2>
> How are you measuring offset and range for "Hallöchen!" then?

So the answer to your question would be offset of 0, and range of 10.

Hope that helps, and if you have any evidence that contradicts the
reasoning behind our current approach, please do bring it up :)

Thanks,

Rob

On Thu, Jul 26, 2012 at 12:04 AM, Sebastian Hellmann
<hellmann@informatik.uni-leipzig.de> wrote:
> Am 26.07.2012 00:04, schrieb Reto Bachmann-Gmür:
>
>> On Jul 25, 2012 2:19 AM, "Sebastian Hellmann" <
>> hellmann@informatik.uni-leipzig.de> wrote:
>> ....
>>>
>>> E.g. <h2 title="Begrüßung" id="welcomeheader" >Hall&ouml;chen!</h2>
>>>
>>> I assume that your TextOffsetSelector assumes plain text and works on the
>>
>> HTML sources?
>>
>> I would have assumed it works on the actual text represented, so that
>> &ouml;, <b>o</b> and ö in the html source all count as one character.
>
> What do you mean by actual text represented? Do you mean text nodes in the
> DOM?
> This doesn't seem feasible. If this is your primary data:
>
>
> <h2 title="Begrüßung" id="welcomeheader" >Hall&ouml;chen!</h2>
>
> How are you measuring offset and range for "Hallöchen!" then?
>
> <_:Selector1> a oax:TextOffsetSelector ;
>    oax:offset 44 ;
>    oax:range 15 .
>
> Sebastian
>
>>
>> Cheers,
>> Reto
>>
>
>
> --
> Dipl. Inf. Sebastian Hellmann
> Department of Computer Science, University of Leipzig
> Events:
>   * http://sabre2012.infai.org/mlode (Leipzig, Sept. 23-24-25, 2012)
>   * http://wole2012.eurecom.fr (*Deadline: July 31st 2012*)
> Projects: http://nlp2rdf.org , http://dbpedia.org
> Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
> Research Group: http://aksw.org
>

Received on Thursday, 26 July 2012 13:43:34 UTC