Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup from Felix Sasaki on 2013-01-29 (public-multilingualweb-lt@w3.org from January 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 29 Jan 2013 10:26:39 +0100
To: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
CC: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Artūrs Vasiļevskis <arturs.vasilevskis@Tilde.lv>
Message-ID: <510795CF.8040508@w3.org>
Hi Mārcis, all,

even if this discussion has now continued in a different thread, let met 
give further feedback here too - it may help to clarify things, and to 
continue the discussion in general.


Am 28.01.13 11:18, schrieb Mārcis Pinnis:
>
> Hi Felix, all,
>
> I see that there have been a lot of opinion exchanges on the proposal 
> brought up by Felix.
>
> I have some comments to add. I am now speaking as a data producer and 
> later maybe also a data consumer (and I am not speaking as a linguist! 
> ... that has to be understood as well).
>
> First of all, I would like to ask whether we agreed that ITS 2.0 
> should be able to represent data in the structure as TEI, NIF, XCES or 
> other NLP related standards do – that is, as far as I understand, the 
> direction where this discussion is heading. Should ITS 2.0 try to 
> re-invent these data standards? I would incline to saying – no!
>


As far as I understand, there standards are not yet implemented in 
localization tool chains. However, the "multilayer annotation" proposal 
brought one feature from these standards into such tool chains: the 
standoff mechanism. I'd rather see this as a value than a problem: 
bringing NLP friendly representations into localization workflows. Would 
you disagree?

> Secondly, as we are in a last call phase, I understand that such 
> significant change to the ITS 2.0 data categories would rewrite them 
> (and maybe it will get clearer when you read my comments till the 
> end). I as a data producer now will have to rewrite my parsers and 
> data producing systems just to accommodate the „stand-off” mechanisms, 
> which is in a content providers and content consumers perspective a 
> diametric change to just adding additional independent attributes or 
> changing the names of attributes (which was actually the initial 
> proposal by Tadej and me). I would like for others to understand that 
> this solution asks for re-development rather than simple adjustments.
>

I agree - this would be quite some work, and we need to justify the 
benefit clearly.


> Other comments are inline below...
>
> After reading the comments here is a summary:
>
> In my understanding the proposal complicates data production and 
> consumption significantly as it creates possibilities for a lot of 
> ambiguity, which I guess is the opposite of what initially was meant 
> by the disambiguation data category(!) and at least in our Use Case it 
> requires revision of parser logics and ITS 2.0 metadata annotation logics.
>

The proposal basically says: here is a way to represent ambiguity, 
created by several tools annotating the same document. However, I'd see 
this as a value, not a problem: with separate 
"its:textAnalyticsAnnotations" elements, including each its own 
annotatorsRef, you can clearly identify which tool created what 
annotation. This may be even clearer than the current annotatorsRef.

> However, I will have a discussion with my colleagues in order to 
> estimate how much changes would be required to our use case from a 
> development perspective.
>
> I also understand that this proposal wants to fuse all types of 
> possible NLP-related text analyses together, but I did not have the 
> feeling that ITS 2.0 should be used as a TEI, XCES, NIF, etc. clone? 
> This is how I see where the changes will lead us.
>
> However, I also do not say that that is a bad thing... we would 
> definitely make linguists more happier, but I as a content provider 
> and later also a consumer would have difficulties working with the 
> data as I would have to agree accepting uncertainty/ambiguity in the 
> ITS 2.0 metadata by default (except external resources as those are 
> defined between consumers/producers and not ITS 2.0).
>
> Best regards,
>
> Mārcis ;o)
>
> *From:*Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Sunday, January 27, 2013 9:25 AM
> *To:* public-multilingualweb-lt@w3.org
> *Subject:* issue-68 from an annotation representation point of view, 
> with potential implications for annotatorsRef and standoff markup
>
> Hi all,
>
> sorry, this is going to be long ... but please have a look, esp. the 
> implementers (both consumers and producers) of terminology and 
> disambiguation.
>
> in the last 10 1/2 months, since Tadej's presentation at the Dublin 
> workshop, we had a lot of discussions on disambiguation, and sometimes 
> (as now) including terminology. But it seems that we never discussed 
> whether ITS2 approach of selection (global, local, inheritence, 
> overriding (partial or not)...) is suitable for this type of information.
>
> By "this type" I mean annotation of linguistic information. Most ITS2 
> and ITS1 data categories are process related (e.g. "Don't translate 
> this ..."), but both terminology and what's now called disambiguation 
> are information that you find in linguistic corpora and processing 
> tools. Now, my point is that in both in such natural language 
> processing tool chains and related corpora, a representation of 
> information *inline per document node* is rather the exception. Mostly 
> you have *standoff information*, that is a complete seperation of 
> information from actual content - as in NIF.
>
> Mārcis:
>
> Parsing and understanding of the mark-up is the main difference (how 
> overriding and inheritance work) that requires this „stand-off” 
> mechanism for „this type” of annotation. If there would be only flat 
> level annotation, we would not have this discussion at all. Also, 
> “stand-off” is only good if you really have to add a lot of complex 
> data, but here we have to add just a flag or a reference (if put in 
> simple words). In Prague me and Tadej discussed that if hierarchical 
> information is needed, that should be encoded in the external resources.
>
> If I understand correctly, stand-off mark-up has no inheritance and it 
> has no overriding – it describes a span?
>

Correct.

> If so, I assume that with your proposal we are back at requiring 
> hierarchical annotation, overlapping annotation and contradictive 
> annotation, which will allow all kinds of text analysis annotations 
> (without restricted types – term, entity, ontology, lexical, etc.). 
> This will require data consumers to re-think their data consumption 
> strategies as they will have to disambiguate the 
> “disambiguation-style” annotations (which means that at the end we do 
> not help data consumers, but make the life rather more difficult).
>

As said above: if a consumer doesn't want to deal with several layers of 
annotations, it can just say: I want to consume the annotations made by 
Tilde or by JSI. This is guaranteed by the annotatorsRef attribute.
The current state of quo creates this situation: if Tilde already has 
annotated a text, and JSI wants to add annotations, and you want to 
compare them: how to do this? You can say "one creates terminology 
markup, the other disambiguation markup". But what about even more tools?

> In the current ITS 2.0 draft the annotation is flat - it is simple to 
> parse, simple to consume, simple to produce – it is not hierarchical 
> and it does not overlap.
>

See above - if a consumer does not want to consume relations between 
annotation tools or levels, you don't have to, and annotatorsRef gives 
you the ability to differentiate the annotations.

Btw., current disambiguation and terminology also don't inherit, see the 
table at
http://www.w3.org/TR/2012/WD-its20-20121206/#datacategories-defaults-etc
that is: the annotations of both data categories don't inherit to nested 
markup. So we could resolve the issue also via something like this:
Input before annotation: Dublin
First annotation: <span its-term="yes">Dublin</span>
Second annotation: <span its-term="yes"><span 
its-term="no">Dublin</span></span>

But that has the annotatorsRef issue if several "term annotation" tools 
have been used.

> From this perspective, the proposed change is a complete overhaul of 
> the 2 data categories in something different.
>
> Also – we do require the flag. That is something that will be heavily 
> complicated with the “stand-off” mechanism (that has to be 
> understood), or won’t be possible at all?!
>

Setting the type would give you the flat. I know that in the proposal at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0212.html
Tadej dropped the flat. But we could have instead of fixed values, e.g. 
"term", an URI. You could then interpret that URI as a term flag, e.g.
<span tan-type="http://example.com/term">


> Having a simple attribute inline is the simplest you can achieve. 
> Having a “stand-off” on the other hand is the most complex you can 
> achieve.
>
> And ... if I remember correctly, we did not want to make life 
> difficult for producers/consumers if they did not care about the other 
> data categories?
>


Correct, but here we have the situation that two data categories might 
be just too similar for keeping everthing as is.

>
>
> Why is that? In linguistic annotation it is common that you have 
> several layers of information, like our lexical, ontological etc. 
> information. Some of these might be complex in itself (e.g. named 
> entities), some of these might be related to others (e.g. an 
> ontological concept related to a lexical item). I won't try to define 
> these layers here - but my point is that due to the complexity of 
> representing such information inline, nearly nobody is trying to 
> represent several layers at the same time inline. The common approach 
> is rather to have a base layer, and then pointers from the various 
> annotation layers.
>
> In a sense you can describe NIF as an approach of taking character 
> offsets as the implicit base layer (implicit because characters don't 
> need explicit anchors). The TEI here
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> provides an example for an offset using words as the base unit, with 
> exlicit xml:id attributes.
>
> So far we haven't taken this approach for terminology or 
> disambiguation. This is why we had to came of with 16+ attributes: if 
> you want to do everything "inline", you need to differenciate 
> attribute names and come up with a monster data category. Inline 
> annotations are just not suitable for such information.
>
> Mārcis:
>
> I disagree that 16+ attributes are the difficulty here. The difficulty 
> from the beginning were the questions: 1) how many types of annotation 
> should be supported (we narrowed the list down to 4 – terminology, 
> named entities, ontology concepts, lexical concepts)? 2) should 
> overlapping be supported? 3) should hierarchical annotation be 
> supported? 4) should contradicting annotation be supported?
>

about 1): no type at all would be one solution, but the term identifer 
issue would come up. about 2): if a consumer just takes up one 
annotation, e.g. the output of Tilde's tool, there is no need to process 
overlap. And we can leave that to consumers IMO. 3): same like 2). 4) 
Same like 2).


> Also ... data producers would have to worry just about a maximum of 5 
> attributes simultaneously and they would be able to ignore the rest. 
> For instance, I have no use for the attributes for disambiguation 
> categories.
>

I think that's the heart of issue-68: there are two quite similar pieces 
of information, but consumers separate them.


> Although I would agree writing a parser that parses all these 
> attributes (just for compliancy with the data category), I would as a 
> consumer consume only the ones related to terminology and I as a 
> producer would produce only those related to terminology. I would nor 
> consume, nor produce the disambiguation related attributes.
>

That wouldn't work if we have one data category: our conformance 
requirements say: you implement it global or local or both. You then can 
also decide whether you implement it in HTML or XML or both. But you 
cannot cherry pick attributes for consumption. We don't say anything wrt 
production - but our schema helps us to verify that the "right data" has 
been produced.

> From that perspective, I disagree to the complexity in the attribute 
> scenario.
>

I think part of the disagreement comes from the "free spirit" you have 
as data producer and consumer, see above.


> For terminology I require a flagging mechanism (with the possibility 
> to add either a reference, a confidence score, or both).
>
> I do agree that we are limiting the annotation with having separate 
> attributes, but then again ... ITS 2.0 does not have to represent 
> every possible text analysis annotation type. It is supposed to aid in 
> localisation processes and not all text analysis types have a valid 
> use case (or a necessary or even a potentially useful use case) in 
> localisation.
>
> Also ... if we are re-inventing terminology and disambiguation, maybe 
> we should analyse which other data categories fall under the type 
> “text analysis”? Domain is a suitable candidate as well (and if we 
> create a suitable text analysis category, maybe domain analysis can be 
> subcategorized under that as well in order to support automated domain 
> analysis solutions (EuroVoc has an automated domain classifier, for 
> instance))?).
>


Here I would disagree: our domain data category is just for transporting 
domain information between content and tools, including a potential 
mapping of domain identifiers inbetween. The "terminology vs 
disambiguation" discussion came from the observations that two data 
categories in ITS2 have a huge overlap. I don't see that situation for 
domain.

> With this I would like to emphasize that overgeneralization is not the 
> best approach as we are creating data categories for different 
> consumption scenarios.
>

But are they so different? It sounds to me rather that in your scenario, 
many opportunities are lost because you don't consume disamgiuatino at 
all ... so having one umbrella data category might even give you more 
data consumption opportunities.


>
>
> So, the first idea behind below approach is: if you want to represent 
> just one linguistic layer (or "qualifier" in Christian's mail at
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
> ) , you use "tan-type" attribute to differentiate annotations. That 
> leads to following models inline models:
>
> 1) A term has its-tan-type with value "term" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0">Dublin</span>
> Comparison to current ITS1 "Terminology":
> its-tan-type="term" plays the role of term="yes". its-tan-info-ref 
> plays the role of termInfoRef. its-tan-ident-ref links to a term data 
> base. its-tan-confidence provide confidence information.
> (Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, 
> I'm just trying to exemplify the annotation approach here)
>
> Mārcis:
>
> Also one thing I tried to emphasize at lunchtime in Prague, 
> TermInfoRef is not necessarily an identity reference. It does not 
> always point to something unique (if we understand that a set is not 
> unique). You can have multiple term entries from multiple user 
> collections in a term bank relating to one semantic term. In the case 
> if you do not specify a domain you could end up having a reference 
> that points to totally different (also contrasting) terms or if you do 
> not specify a target language you may end up having multiple entries 
> because most of the collections are bilingual and not multilingual. 
> Why is that so? It is because a term-bank is not a disambiguator – it 
> acts like a search engine (more or less) – the disambiguation for the 
> “external” information (the meaning; the term unithood is defined by 
> the flag term=”yes” itself) has to be done by the consumers 
> (translation engines or human translators). In most cases (as in the 
> biggest term-banks – IATE, ETB) it does not have a hierarchical 
> understanding of terms as some lexical (WordNet, f.i.) or ontological 
> resources may have. For MT engines a valuable information is already – 
> term=“yes” as that defines the term unithood, which means that the 
> term should be translated as a non-breakable phrase. So ... the MT 
> engine could ignore the TermInfoRef at all if it does not have a 
> suitable disambiguation module and maybe leave the disambiguation to 
> human post-editors...
>
> So ... “ident” is misleading (at least in the case of Terminology 
> annotation)!
>
> Also important: HOW WOULD YOU REPRESENT term=”no”? This is a very 
> important feature of the flag type annotation.
>
> would you say: its-tan-type="not-a-term"? That would require data 
> producers to handle higher complexity annotation!
>


I don't have a clear answer to above questions - others, feel free to 
chime in if you do.

> 2) An entity has its-tan-type with value "entity" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> its-tan-class-ref=" 
> http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7">Dublin</span>
>
> So above is only different naming compared to current "Terminology" 
> and Disambiguation. Below is now the standoff approach. The processing 
> expectation for tools *producing the annotation* is like this:
> - If there is no inline annotation, just create it (e.g. 1) or 2))
> - If there is inline annotation, check if there is an id attribute (in 
> HTML) or xml:id (if XML serizalization of HTML is used and with lower 
> precedence compared to id). For formats other than HTML, add xml:id if 
> possible or use the id attribute appropriate for that format.
>
> Then, for creating standoff annotations, add an 
> "its:textAnalyticsAnnotations" element to the document, e.g. in HTML 
> "script" if needed.
>
> Let's assume before annotation we have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> 
> its-tan-confidence="0.7">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" 
> *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0"/>
> </its:textAnalyticsAnnotations>
>
>
> Let's now assume that before annotation we have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0" *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/>
> </its:textAnalyticsAnnotations>
>
> Now, if several "entity" annotation tools have been used, we could 
> also have
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" 
> annotatorsRef="tan|tool-x"/>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4" 
> annotatorsRef="tan|tool-y"/>
> </its:textAnalyticsAnnotations>
>
> Above approach would also influence the consumption of this data 
> category, and of annotatorsRef:
>
> - A consuming tools goes through the document and gathers all 
> textAnalyticsAnnotations elements
> - It then goes through the document. For each element node
> * check for existing inline markup. If it's available, add it to the 
> list of annotations for that node. Assume the inline version up in the 
> document tree of annotatorsRef to be responsible for the annotation of 
> that markup.
> * check the accumulated standoff textAnalyticsAnnotations elements for 
> occurrences of IDs that match the node. If there is such an ID, add 
> the related annotation to the list for the node, including the 
> additional annotatorsRef tool, e.g. tool-x or tool-y in the above case.
>
> Mārcis:
>
> Do I understand you correctly that we may end up having contradicting 
> annotations also, for instance term=”yes” and term=”no”? This would 
> require a data consumer to be able to handle a lot of ambiguity in the 
> data.
>
>

Sure. But they could identify the ambiguity with a multilayer annotation 
that clearly identifies the tool used, via annotatorsRef.
Currently, what would you do with this
<span its-term="yes"><span its-term="no">screwdriver</span></span>
how would you resolve the ambiguity here? "Terminology" has no 
inheritance. This makes sense, otherwise in the following
<span its-term="yes"><span class="em">screw</span>driver</span>
the embedded "span" element would constitute a span. But that leads to 
this test suite output for
<span its-term="yes"><span its-term="no">screwdriver</span></span>
/span[1] term="yes"
/span[1]/span[1] term="no"
and both "span" nodes contain the same string "screwdriver". So how do 
you resolve the ambiguity here?

> In summary, this standoff tries to solve several issues:
>
> - avoid the 16+ inline attribute monster data category
>
> Mārcis:
>
> Again, I did not understand why this is worse than having a heavy 
> “stand-off” mechanism.
>
>
> - allow for multiple annotations of the same span, with different tools
> Mārcis:
>
> In Prague Tadej and I had a discussion whether there is a use case for 
> using two tools producing contradicting mark-up and we came to the 
> conclusion that neither of us would produce such data and if such a 
> scenario exists, then the content producer should fuse (disambiguate) 
> the outputs of the two separate tools prior to ITS 2.0 metadata 
> application. I am talking about the same type (for instance, two term 
> annotation tools on the same span) of annotation, not two separate types.
>
> Then my question: does such a scenario exist? Who is implementing it?
>

If both you and Tadej would agree on one data category: everybody who 
wants to use both your tools would implement it. And this has the value 
that people could compare the outcome of the tools.


>
> - avoid the ITS1/2 or general inline annotation issues with 
> inheritance and overriding - as with the standoff approach at 
> exemplified at
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> annotation information is just accumulated for a given base item (in 
> our case, element nodes with an ID).
> Mārcis:
>
> So ... at the end, with this method we would allow:
>
> 1) Hierarchical annotation
>
> 2) Contradicting annotation
>
> 3) (possibly also) overlapping annotation
>


Correct.

>
> I'm not yet asking for this change, but I see it as a way forward that 
> could make the life of both annotation producers (Marcis and Tadej) 
> and consumers (Yves et al.) simpler. So I'm eager to hear thoughts on 
> this :)
> Mārcis:
>
> As I understand the proposal – it is the complete opposite from being 
> simple (or simplifying things as they are right now having Terminology 
> and Disambiguation separately), it complicates things significantly 
> from the Terminology standpoint as now I do not see where term=”yes” 
> fits in, we have to deal with contradicting annotation (allow or 
> prohibit it is now a question to the consumers – I as a consumer would 
> ask to prohibit it as I do not see a use case for term=”yes” and 
> term=”no” at the same time), and what is more, we have to re-implement 
> the parsers so that instead of overriding and inheritance they would 
> work with accumulating information (and this is a complete revision of 
> the parser logics for the Terminology data category).
>

I understand the burden on implementation you emphasize - but it seems 
that one scenario - annotation using different tools even for the 
terminology data category, see the nested "terminology" annotations 
above - is not resolved by your proposal. You say this would not be 
implemented before ITS2 annotation - but if the tool providers are not 
from the same organization?

Best,

Felix


>
>
> Thoughts?
>
> - Felix
>
Received on Tuesday, 29 January 2013 09:27:05 UTC