W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > January 2013

Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

From: Jörg Schütz <joerg@bioloom.de>
Date: Mon, 28 Jan 2013 11:43:42 +0100
Message-ID: <5106565E.9050402@bioloom.de>
To: public-multilingualweb-lt@w3.org
Hi Felix and all,

Thanks a lot for your comments, and certainly I like your approach to 
resolving issue-68. That this direction might open the discussion to 
further changes within the future evolution of ITS is not a bad thing, 
and it even doesn't question the current evolutionary stage of ITS 2.0 
and its design decisions.

Well, as already pointed out by Yves and particularly by Dave, there are 
some edges in your proposal that might influence the existing 
stand-off/standoff mechanism because now we allow for

(1) a separated part for the annotations 
(<its:textAnalyticsAnnotations>) which might be even outside the ITS 
document.
(2) the distinction between "term" and "entity" only, i.e. an identified 
term candidate could be marked as an "entity" if that fact isn't further 
identicated by a confidence value (i.e. terms marked within a certain 
confidence range might be considered term candidates). The 
its:disambigGranularity attribute and its associated attributes are no 
longer supported in this sceanrio.
(3) its:tanInfoRef to play the role of its:termInfoRef, i.e. an 
attribute that contains an IRI refering to the resource providing 
information about the term, as well as the role of its:disambigIdentRef. 
This also means that the its:termInfoPointer isn't needed anymore.
(4) its:tanIdentRef to link to an actual term database entry.
(5) its:tanClassRef as the equivalent of its:tanInfoRef for entities 
(e.g. an entity concept, or a lexical entry, or ... but now not 
predefined which is beneficial!).
(6) its:tanConfidence as the replacement of its:termConfidence and 
its:disambigConfidence.
(7) establishing the relationship between content marked up for 
analytics annotation with "id" and "ref" which is a deviation from the 
standoff/stand-off annotations provided for the provenance and data 
categories.

Since (7) with the constraints (1) to (6) is a totally different 
approach regarding standoff/stand-off annoation than we already have for 
provenance, i.e. provenance standoff provides the information 
explicitly, and tan standoff extends already given (marked up) 
information, we might follow/accept your proposal for merging the two 
data categories. Nevertheless, the approach has certainly an impact on 
how ITS information for terms and general natural language expressions 
is encoded because it is now sort of staged/phased annotation which is 
different from previous views, and might be more complex also because of 
some (unwanted) side-effects such as conflicting information.

Last but not least, would such a change (merge of data categories) be 
within the current Last Call process?

Cheers -- Jörg


On Jan 27, 2013, at 18:22 (CET), Felix Sasaki wrote:
> Hi Jörg,
>
> Am 27.01.13 17:55, schrieb Jörg Schütz:
>> Hi Felix and all,
>>
>> Thanks for opening the discussion about employing standoff/stand-off
>> annotations in the ITS framework. For the envisaged data categories,
>> i.e. term and disambiguation, a stand-off annoation (markup) approach
>> has several advantages which we already know from corpus linguistics
>> and other (hierarchical) annotation challenges (e.g. from language
>> proofing to text stream analytics). Now, there are several questions:
>>
>> 1. Do we want to maintain a separate type system for each data
>> category that might benefit from stand-off annotations? You suggested
>> only "term" and "entity" but we could easily extend this to a
>> full-fledged (hierarchical) type system across all current ITS data
>> categories.
>
> No, we don't want that now. We are in the last call stage, that is: we
> aim at being feature complete. Other data categories IMO are not borken,
> and we can move on without changing them. But the confusion about
> terminology and disambiguation is huge, and hence we need to fix it.
>
>>
>> 2. As of yet your suggestions only allows for "in-document"
>> annotations. Since many stand-off annotation approaches in real life
>> applications make use of several documents, this might be an
>> interesting options to maintain several document views (annotation
>> categories and granularities) within localization and translation.
>
>
> Again, I'd see this as an additional step for ITS2.1 maybe. At the
> moment I only would introduce a mechanism that fixes what is still
> borken in ITS2 - without disallowing other things.
>
>>
>> Well, and last but not least -- and this is kind of nasty comment --
>> the further discussion of stand-off markup/annotation might lead to an
>> entire revision of the current ITS framework.
>
> I like nasty comments :)
>
> But see my answers below - I think the last call comments
> http://tinyurl.com/its20-comments-handling
> only still ask for changes in three areas: ruby, directionality, and
> terminology / disambiguation. Nobody asked for a complete revision, and
> we have running code and test for all other areas - so I would restrict
> the changes discussed to issue-68.
>
> Given above explanations, what do you think about this proposed solution
> for issue-68?
>
> Best,
>
> Felix
>
>>
>> I'm looking forward to other thoughts and comments.
>>
>> All the best -- Jörg
>>
>> On Jan 27, 2013, at 08:24 (CET), Felix Sasaki wrote:
>>> Hi all,
>>>
>>> sorry, this is going to be long ... but please have a look, esp. the
>>> implementers (both consumers and producers) of terminology and
>>> disambiguation.
>>>
>>> in the last 10 1/2 months, since Tadej's presentation at the Dublin
>>> workshop, we had a lot of discussions on disambiguation, and sometimes
>>> (as now) including terminology. But it seems that we never discussed
>>> whether ITS2 approach of selection (global, local, inheritence,
>>> overriding (partial or not)...) is suitable for this type of
>>> information.
>>>
>>> By "this type" I mean annotation of linguistic information. Most ITS2
>>> and ITS1 data categories are process related (e.g. "Don't translate this
>>> ..."), but both terminology and what's now called disambiguation are
>>> information that you find in linguistic corpora and processing tools.
>>> Now, my point is that in both in such natural language processing tool
>>> chains and related corpora, a representation of information *inline per
>>> document node* is rather the exception. Mostly you have *standoff
>>> information*, that is a complete seperation of information from actual
>>> content - as in NIF.
>>>
>>> Why is that? In linguistic annotation it is common that you have several
>>> layers of information, like our lexical, ontological etc. information.
>>> Some of these might be complex in itself (e.g. named entities), some of
>>> these might be related to others (e.g. an ontological concept related to
>>> a lexical item). I won't try to define these layers here - but my point
>>> is that due to the complexity of representing such information inline,
>>> nearly nobody is trying to represent several layers at the same time
>>> inline. The common approach is rather to have a base layer, and then
>>> pointers from the various annotation layers.
>>>
>>> In a sense you can describe NIF as an approach of taking character
>>> offsets as the implicit base layer (implicit because characters don't
>>> need explicit anchors). The TEI here
>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
>>> provides an example for an offset using words as the base unit, with
>>> exlicit xml:id attributes.
>>>
>>> So far we haven't taken this approach for terminology or disambiguation.
>>> This is why we had to came of with 16+ attributes: if you want to do
>>> everything "inline", you need to differenciate attribute names and come
>>> up with a monster data category. Inline annotations are just not
>>> suitable for such information.
>>>
>>> So, the first idea behind below approach is: if you want to represent
>>> just one linguistic layer (or "qualifier" in Christian's mail at
>>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
>>>
>>> ) , you use "tan-type" attribute to differentiate annotations. That
>>> leads to following models inline models:
>>>
>>> 1) A term has its-tan-type with value "term" and optional
>>> its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
>>> <span its-tan-type="term"
>>> its-tan-ident-ref="http://termdatabase.example.com/entry37"
>>> its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>>> its-tan-confidence="1.0">Dublin</span>
>>> Comparison to current ITS1 "Terminology":
>>> its-tan-type="term" plays the role of term="yes". its-tan-info-ref plays
>>> the role of termInfoRef.  its-tan-ident-ref links to a term data base.
>>> its-tan-confidence provide confidence information.
>>> (Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, I'm
>>> just trying to exemplify the annotation approach here)
>>>
>>> 2) An entity has its-tan-type with value "entity" and optional
>>> its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
>>> <span its-tan-type="entity"
>>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>>> its-tan-class-ref=" http://nerd.eurecom.fr/ontology#Place"
>>> its-tan-confidence="0.7">Dublin</span>
>>>
>>> So above is only different naming compared to current "Terminology" and
>>> Disambiguation. Below is now the standoff approach. The processing
>>> expectation for tools *producing the annotation* is like this:
>>> - If there is no inline annotation, just create it (e.g. 1) or 2))
>>> - If there is inline annotation, check if there is an id attribute (in
>>> HTML) or xml:id (if XML serizalization of HTML is used and with lower
>>> precedence compared to id). For formats other than HTML, add xml:id if
>>> possible or use the id attribute appropriate for that format.
>>>
>>> Then, for creating standoff annotations, add an
>>> "its:textAnalyticsAnnotations" element to the document, e.g. in HTML
>>> "script" if needed.
>>>
>>> Let's assume before annotation we have
>>> <span its-tan-type="entity"
>>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>>> its-tan-confidence="0.7">Dublin</span>
>>> Then after annotation we would have
>>> <span its-tan-type="entity"
>>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>>> its-tan-confidence="0.7" *id="a8"*>Dublin</span>
>>> and this:
>>> <its:textAnalyticsAnnotations>
>>> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term"
>>> its-tan-ident-ref="http://termdatabase.example.com/entry37"
>>> its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>>> its-tan-confidence="1.0"/>
>>> </its:textAnalyticsAnnotations>
>>>
>>> Let's now assume that before annotation we have
>>> <span its-tan-type="term"
>>> its-tan-ident-ref="http://termdatabase.example.com/entry37"
>>> its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>>> its-tan-confidence="1.0">Dublin</span>
>>> Then after annotation we would have
>>> <span its-tan-type="term"
>>> its-tan-ident-ref="http://termdatabase.example.com/entry37"
>>> its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>>> its-tan-confidence="1.0" *id="a8"*>Dublin</span>
>>> and this:
>>> <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
>>> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>>> its-tan-confidence="0.7"/>
>>> </its:textAnalyticsAnnotations>
>>>
>>> Now, if several "entity" annotation tools have been used, we could
>>> also have
>>> <its:textAnalyticsAnnotations>
>>> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>>> its-tan-confidence="0.7" annotatorsRef="tan|tool-x"/>
>>> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>>> its-tan-confidence="0.4" annotatorsRef="tan|tool-y"/>
>>> </its:textAnalyticsAnnotations>
>>>
>>> Above approach would also influence the consumption of this data
>>> category, and of annotatorsRef:
>>>
>>> - A consuming tools goes through the document and gathers all
>>> textAnalyticsAnnotations elements
>>> - It then goes through the document. For each element node
>>> * check for existing inline markup. If it's available, add it to the
>>> list of annotations for that node. Assume the inline version up in the
>>> document tree of annotatorsRef to be responsible for the annotation of
>>> that markup.
>>> * check the accumulated standoff textAnalyticsAnnotations elements for
>>> occurrences of IDs that match the node. If there is such an ID, add the
>>> related annotation to the list for the node, including the additional
>>> annotatorsRef tool, e.g. tool-x or tool-y in the above case.
>>>
>>>
>>> In summary, this standoff tries to solve several issues:
>>>
>>> - avoid the 16+ inline attribute monster data category
>>> - allow for multiple annotations of the same span, with different tools
>>> - avoid the ITS1/2 or general inline annotation issues with inheritance
>>> and overriding - as with the standoff approach at exemplified at
>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
>>> annotation information is just accumulated for a given base item (in our
>>> case, element nodes with an ID).
>>>
>>> I'm not yet asking for this change, but I see it as a way forward that
>>> could make the life of both annotation producers (Marcis and Tadej) and
>>> consumers (Yves et al.) simpler. So I'm eager to hear thoughts on
>>> this :)
>>>
>>> Thoughts?
>>>
>>> - Felix
Received on Monday, 28 January 2013 10:43:42 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:08:26 UTC