Re: Multilayer proposal, not standoff revision (Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup) from Phil Ritchie on 2013-01-28 (public-multilingualweb-lt@w3.org from January 2013)

From: Phil Ritchie <philr@vistatec.ie>
Date: Mon, 28 Jan 2013 11:32:53 +0000
To: Felix Sasaki <fsasaki@w3.org>
Cc: Dave Lewis <dave.lewis@cs.tcd.ie>, public-multilingualweb-lt@w3.org
Message-ID: <OF49A1D8C7.D2E86916-ON80257B01.003DED58-80257B01.003F6F8B@vistatec.ie>
1) If I'm understanding this right, you seem to invert the reference 
mechanism between the element being annotated and the local stand-off 
element when compared to the similar mechanisms we already have for 
locQualityIssue and provenance. i.e. the standoff element here references 
the annotated element rather than the other way around in the current 
standoff mechanisms. 

Could you clarify why this different approach is needed? 

First part of the answer is: two resolve a last call comment. We can't 
easily be moving forward without doing that.
Second part of the answer: to have a broad consensus that ITS2 is a good 
thing. This is not an aspect of the w3c process. But if people think from 
the start that ITS2 is borken, we have a problem.
Third part: see my response to Yves about the "similar mechanisms" topic. 
I don't think we need to change other parts of the spec to accomodate 
issue-68. Christian, can you correct me if I'm wrong?

I agree with Dave, it seems as though the referencing mechanism is 
reversed. Felix's answer (never-the-less importantly) more replies to the 
reasons of keeping the project moving forward than to the question of why 
the pointer mechanism is reversed.


Second part of the answer: to have a broad consensus that ITS2 is a good 
thing. This is not an aspect of the w3c process. But if people think from 
the start that ITS2 is borken, we have a problem.

ITS2 has good intentions but it is on the cusp of trying to do too much 
perhaps. I do not mean to be heretical here. It is a factor that we all 
see the potential enormous opportunities of having such metadata for our 
particular use cases. If we can confirm that the stand-off processing 
algorithm is consistent across data categories (and I can read this 
lengthy post again) I think I'd be happy to accept Felix's proposal.

Phil.





From:   Felix Sasaki <fsasaki@w3.org>
To:     Dave Lewis <dave.lewis@cs.tcd.ie>, 
Cc:     public-multilingualweb-lt@w3.org
Date:   28/01/2013 07:39
Subject:        Multilayer proposal, not standoff revision (Re: issue-68 
from an  annotation representation point of view, with  potential 
implications for  annotatorsRef and standoff markup)



Hi Dave, all,

Am 28.01.13 01:23, schrieb Dave Lewis:
Hi Felix,
Some thoughts on this proposal, primarily in comparison to the existing 
stand-off mechanisms:

1) If I'm understanding this right, you seem to invert the reference 
mechanism between the element being annotated and the local stand-off 
element when compared to the similar mechanisms we already have for 
locQualityIssue and provenance. i.e. the standoff element here references 
the annotated element rather than the other way around in the current 
standoff mechanisms. 

Could you clarify why this different approach is needed? 

First part of the answer is: two resolve a last call comment. We can't 
easily be moving forward without doing that.
Second part of the answer: to have a broad consensus that ITS2 is a good 
thing. This is not an aspect of the w3c process. But if people think from 
the start that ITS2 is borken, we have a problem.
Third part: see my response to Yves about the "similar mechanisms" topic. 
I don't think we need to change other parts of the spec to accomodate 
issue-68. Christian, can you correct me if I'm wrong?


As it stands I see the following problems with this inverted approach:
i) it means implementors potentially need to support two mechanisms for 
handling standoff mark-up in different data categories (and therefore 
introduces a lot of uncertainties  for the LC compared to reusing the 
existing mechanism)

No, see above.

ii) this would get complex if other (non-ITS) functions are 
creating/rewriting, the id values

No, if we restrict this to where it has a real added value (see answers 
1-3 above): disambig+terminology.

iii) you loose the ability to associate standoff elements and content 
through global ITS rules, and hence loose the ability to annotate content 
in attributes.

True - but that ability is not needed for disambiguation and terminology 
anyway: as I understand it most annotation tools in both areas work on 
text content. Also, I don't have seen global rules for terminology working 
on attribute content. Others, have you?

iv) assuming the confidence attribute stays optional (or the confidence 
applies to several occurances), for compactness you may want to refer to 
several elements where the annotated text reoccurs from the same 
textAnalyticsAnnotation - this approach doesn't allow that I think 

Good point - you would need to allow confidence at the 
"textAnalyticsAnnotations" element as well to accomodate that, see below:

<its:textAnalyticsAnnotations annotatorsRef="tan|tool-x" 
its-tan-confidence="0.7">
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" its-tan-class-ref=
"http://nerd.eurecom.fr/ontology#Place" />
</its:textAnalyticsAnnotations>

Of course if "its-tan-confidence" appears at both a 
"textAnalyticsAnnotation" element and a "textAnalyticsAnnotations" 
element, the former apperance has precedence.



On the other hand we still don't have a clear idea of how to apply 
annotatorsRef for multiple annotations with the current standoff pattern 
from lqi and provenance

Sorry, I may have missed that: could you point me to the issue number for 
that problem?

, and we can't duck that here because its needed when confidence scores 
are used. One approach could be to apply annotatorsRef only to 
mtConfidence score, and use a dedicate tan-annotator attribute here.

2) a more minor issue, in your processing expectation for adding the 
annotation, you state that if there isn't an inline attribute then you 
should add it inline before adding a new textAnalyticsAnnotation. However  
with the current stand-off approaches we don't mandate this. You could in 
fact put _all_ your annotations in the stand-off, and for the XLIFF 
mapping for lqi and provenance we need to keep that option available to 
implementors.

See above - I wouldn't change anything with regards to the current 
stand-off approach.


3) one general, more philosophical point. You correctly note we didn't 
explicitly discuss whether ITS annotation mechanisms were suitable for 
term and disambig. We've implicitly limited ourselves to data categories 
that make sense with the existing annotation mechanisms. We've stretched 
this a bit with local standoff and annotatorRef individually, but 
combining these new mechanisms is not something we've figured our how to 
do yet (a symptom of that stretching).

Again, can you point me to the open issues? See the question above. I had 
thought we had resolved these at the f2f (even if not with perfect 
solutions), but if there is something to follow up on I'm eager to see it 
rather soon.

The approach you suggest here seems to add a new type of annotation 
pattern, stretching us further. 

Only if we try to make it a general principle. Otherwise I think it is a 
pretty easy implementation appraoch (see the "resolve all ID attributes" 
in the reply to Yves' mail).


Perhaps its better to restrict ourselves to what makes sense to do with 
existing, tested ITS mechanism while adding pointers to external formats 
that can be used for the more complicated cases that these can't handle. 
We took this approach with provenance, where we support simple agent 
provenance inline and provide an external link which gives us the 
possibility to build best practice for combining ITS and the W3C PROV 
model for more complex cases (see ISSUE-71). 


Good point - however, this would leave us open with issue-68, see my 
answers 1-3 above.



So we could allow a link to NIF to deal with cases where we have multiple 
annotations for the same text, or nested annotations or overlapping 
annotations, which NIF is designed to deal with. 

NIF too has open issues, and we have a weak implementation committment for 
it: just two, Sebastian and I, and Sebastian obviously is busy with other 
stuff too. So I'm not 100% sure whether we can count on NIF here. 
Besides, see the comment from Yves on this: NIF is yet another 
implementation branch. The mutilayer proposal (using now a different 
terminology, and a different mail subject) is ideally just one additional 
method while traversing the document tree: getTanAnnotations(nodeID), with 
nodeID being the xml:id or ID attribute. 

With some sensible best practice, we could use a combination of 
termInfoRef  and NIF to deal with many of these more complex cases. 

See again Yves' reply. As a prototype implementer, I think that the 
multilayer approach is easy to do.
Also, note that it relies on element nodes and their IDs, and not on 
character offsets. The ID attributes have of course stability issues, but 
with character offsets these are even worse - a pure re-formatting 
destroys the offsets. See again the NIF issue.


Essentially I'm arguing for the status quo here, living with limited but 
still useful  scope of current term+disambig inline and using best 
practice to give us the time to work out upgrade paths from this to 
supporting more complex use cases with term+disambig+NIF.

Thanks a lot for the feedback - I understand your position I think, but I 
hope that we can continue the discussion on this, to avoid a monster data 
category or to avoid concerns on ITS2 from the start.


Best,

Felix


cheers,
Dave

On 27/01/2013 07:24, Felix Sasaki wrote:
Hi all,

sorry, this is going to be long ... but please have a look, esp. the 
implementers (both consumers and producers) of terminology and 
disambiguation.

in the last 10 1/2 months, since Tadej's presentation at the Dublin 
workshop, we had a lot of discussions on disambiguation, and sometimes (as 
now) including terminology. But it seems that we never discussed whether 
ITS2 approach of selection (global, local, inheritence, overriding 
(partial or not)...) is suitable for this type of information.

By "this type" I mean annotation of linguistic information. Most ITS2 and 
ITS1 data categories are process related (e.g. "Don't translate this 
..."), but both terminology and what's now called disambiguation are 
information that you find in linguistic corpora and processing tools. Now, 
my point is that in both in such natural language processing tool chains 
and related corpora, a representation of information *inline per document 
node* is rather the exception. Mostly you have *standoff information*, 
that is a complete seperation of information from actual content - as in 
NIF.

Why is that? In linguistic annotation it is common that you have several 
layers of information, like our lexical, ontological etc. information. 
Some of these might be complex in itself (e.g. named entities), some of 
these might be related to others (e.g. an ontological concept related to a 
lexical item). I won't try to define these layers here - but my point is 
that due to the complexity of representing such information inline, nearly 
nobody is trying to represent several layers at the same time inline. The 
common approach is rather to have a base layer, and then pointers from the 
various annotation layers.

In a sense you can describe NIF as an approach of taking character offsets 
as the implicit base layer (implicit because characters don't need 
explicit anchors). The TEI here
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
provides an example for an offset using words as the base unit, with 
exlicit xml:id attributes.

So far we haven't taken this approach for terminology or disambiguation. 
This is why we had to came of with 16+ attributes: if you want to do 
everything "inline", you need to differenciate attribute names and come up 
with a monster data category. Inline annotations are just not suitable for 
such information.

So, the first idea behind below approach is: if you want to represent just 
one linguistic layer (or "qualifier" in Christian's mail at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html

) , you use "tan-type" attribute to differentiate annotations. That leads 
to following models inline models:

1) A term has its-tan-type with value "term" and optional 
its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
<span its-tan-type="term" its-tan-ident-ref=
"http://termdatabase.example.com/entry37" its-tan-info-ref=
"http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0">Dublin</span>
Comparison to current ITS1 "Terminology": 
its-tan-type="term" plays the role of term="yes". its-tan-info-ref plays 
the role of termInfoRef.  its-tan-ident-ref links to a term data base. 
its-tan-confidence provide confidence information.
(Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, I'm 
just trying to exemplify the annotation approach here)

2) An entity has its-tan-type with value "entity" and optional 
its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
<span its-tan-type="entity" its-tan-ident-ref=
"http://dbpedia.org/resource/Dublin" its-tan-class-ref=" 
http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.7">Dublin</span>

So above is only different naming compared to current "Terminology" and 
Disambiguation. Below is now the standoff approach. The processing 
expectation for tools *producing the annotation* is like this:
- If there is no inline annotation, just create it (e.g. 1) or 2))
- If there is inline annotation, check if there is an id attribute (in 
HTML) or xml:id (if XML serizalization of HTML is used and with lower 
precedence compared to id). For formats other than HTML, add xml:id if 
possible or use the id attribute appropriate for that format.

Then, for creating standoff annotations, add an 
"its:textAnalyticsAnnotations" element to the document, e.g. in HTML 
"script" if needed.

Let's assume before annotation we have
<span its-tan-type="entity" its-tan-ident-ref=
"http://dbpedia.org/resource/Dublin" its-tan-class-ref=
"http://nerd.eurecom.fr/ontology#Place" 
its-tan-confidence="0.7">Dublin</span>
Then after annotation we would have
<span its-tan-type="entity" its-tan-ident-ref=
"http://dbpedia.org/resource/Dublin" its-tan-class-ref=
"http://nerd.eurecom.fr/ontology#Place" its-tan-confidence="0.7" id="a8"
>Dublin</span>
and this:
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="term" 
its-tan-ident-ref="http://termdatabase.example.com/entry37" 
its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0"/>
</its:textAnalyticsAnnotations>

Let's now assume that before annotation we have
<span its-tan-type="term" its-tan-ident-ref=
"http://termdatabase.example.com/entry37" its-tan-info-ref=
"http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0">Dublin</span>
Then after annotation we would have
<span its-tan-type="term" its-tan-ident-ref=
"http://termdatabase.example.com/entry37" its-tan-info-ref=
"http://termdatabase.example.com/entry37/description" 
its-tan-confidence="1.0" id="a8">Dublin</span>
and this:
<its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" its-tan-class-ref=
"http://nerd.eurecom.fr/ontology#Place" its-tan-confidence="0.7"/>
</its:textAnalyticsAnnotations>

Now, if several "entity" annotation tools have been used, we could also 
have
<its:textAnalyticsAnnotations>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" its-tan-class-ref=
"http://nerd.eurecom.fr/ontology#Place" its-tan-confidence="0.7" 
annotatorsRef="tan|tool-x"/>
<its:textAnalyticsAnnotation ref="a8" its-tan-type="entity" 
its-tan-ident-ref="http://dbpedia.org/resource/Dublin" its-tan-class-ref=
"http://nerd.eurecom.fr/ontology#Place" its-tan-confidence="0.4" 
annotatorsRef="tan|tool-y"/>
</its:textAnalyticsAnnotations>

Above approach would also influence the consumption of this data category, 
and of annotatorsRef:

- A consuming tools goes through the document and gathers all 
textAnalyticsAnnotations elements
- It then goes through the document. For each element node
* check for existing inline markup. If it's available, add it to the list 
of annotations for that node. Assume the inline version up in the document 
tree of annotatorsRef to be responsible for the annotation of that markup.
* check the accumulated standoff textAnalyticsAnnotations elements for 
occurrences of IDs that match the node. If there is such an ID, add the 
related annotation to the list for the node, including the additional 
annotatorsRef tool, e.g. tool-x or tool-y in the above case.


In summary, this standoff tries to solve several issues:

- avoid the 16+ inline attribute monster data category
- allow for multiple annotations of the same span, with different tools
- avoid the ITS1/2 or general inline annotation issues with inheritance 
and overriding - as with the standoff approach at exemplified at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
annotation information is just accumulated for a given base item (in our 
case, element nodes with an ID). 

I'm not yet asking for this change, but I see it as a way forward that 
could make the life of both annotation producers (Marcis and Tadej) and 
consumers (Yves et al.) simpler. So I'm eager to hear thoughts on this :)

Thoughts?

- Felix  



************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the sender immediately by e-mail.

www.vistatec.com
************************************************************
Received on Monday, 28 January 2013 11:33:24 UTC