Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup from Felix Sasaki on 2013-01-28 (public-multilingualweb-lt@w3.org from January 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 28 Jan 2013 13:45:08 +0100
To: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
CC: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Artūrs Vasiļevskis <arturs.vasilevskis@Tilde.lv>
Message-ID: <510672D4.9020106@w3.org>
Hi Mārcis, all,

thanks a lot for your detailed feedback. Due to other commitments I 
can't to respond to this in detail today or probably also not tomorrow. 
But maybe we can discuss it on the Wednesday call. That may not work for 
you, so maybe next week Monday call?

In the meantime I will try to come up with examples and test files for 
my proposal. Mārcis, Tadej, could I encourage you to do the same? I know 
you provided slides for Prague, but not everything might have seen then, 
and having a summary of your options might help to move things forward. 
Short is fine, just having the examples on the list would help a great deal.

Just a judgment from my side: I think at the moment we don't have 
consensus for

- leaving everything as is (Dave's proposal)
- adding the 16 plus attributes (Tadej+Marcis proposal)
- defining the multilayer feature for terminology + disambiguation (my 
proposal)
- adding that feature for all data categories (I think Phil was saying 
that?)

So I'm looking forward to see more discussion on this today and later.


Best,

Felix

Am 28.01.13 11:18, schrieb Mārcis Pinnis:
>
> Hi Felix, all,
>
> I see that there have been a lot of opinion exchanges on the proposal 
> brought up by Felix.
>
> I have some comments to add. I am now speaking as a data producer and 
> later maybe also a data consumer (and I am not speaking as a linguist! 
> ... that has to be understood as well).
>
> First of all, I would like to ask whether we agreed that ITS 2.0 
> should be able to represent data in the structure as TEI, NIF, XCES or 
> other NLP related standards do – that is, as far as I understand, the 
> direction where this discussion is heading. Should ITS 2.0 try to 
> re-invent these data standards? I would incline to saying – no!
>
> Secondly, as we are in a last call phase, I understand that such 
> significant change to the ITS 2.0 data categories would rewrite them 
> (and maybe it will get clearer when you read my comments till the 
> end). I as a data producer now will have to rewrite my parsers and 
> data producing systems just to accommodate the „stand-off” mechanisms, 
> which is in a content providers and content consumers perspective a 
> diametric change to just adding additional independent attributes or 
> changing the names of attributes (which was actually the initial 
> proposal by Tadej and me). I would like for others to understand that 
> this solution asks for re-development rather than simple adjustments.
>
> Other comments are inline below...
>
> After reading the comments here is a summary:
>
> In my understanding the proposal complicates data production and 
> consumption significantly as it creates possibilities for a lot of 
> ambiguity, which I guess is the opposite of what initially was meant 
> by the disambiguation data category(!) and at least in our Use Case it 
> requires revision of parser logics and ITS 2.0 metadata annotation logics.
>
> However, I will have a discussion with my colleagues in order to 
> estimate how much changes would be required to our use case from a 
> development perspective.
>
> I also understand that this proposal wants to fuse all types of 
> possible NLP-related text analyses together, but I did not have the 
> feeling that ITS 2.0 should be used as a TEI, XCES, NIF, etc. clone? 
> This is how I see where the changes will lead us.
>
> However, I also do not say that that is a bad thing... we would 
> definitely make linguists more happier, but I as a content provider 
> and later also a consumer would have difficulties working with the 
> data as I would have to agree accepting uncertainty/ambiguity in the 
> ITS 2.0 metadata by default (except external resources as those are 
> defined between consumers/producers and not ITS 2.0).
>
> Best regards,
>
> Mārcis ;o)
>
> *From:*Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Sunday, January 27, 2013 9:25 AM
> *To:* public-multilingualweb-lt@w3.org
> *Subject:* issue-68 from an annotation representation point of view, 
> with potential implications for annotatorsRef and standoff markup
>
> Hi all,
>
> sorry, this is going to be long ... but please have a look, esp. the 
> implementers (both consumers and producers) of terminology and 
> disambiguation.
>
> in the last 10 1/2 months, since Tadej's presentation at the Dublin 
> workshop, we had a lot of discussions on disambiguation, and sometimes 
> (as now) including terminology. But it seems that we never discussed 
> whether ITS2 approach of selection (global, local, inheritence, 
> overriding (partial or not)...) is suitable for this type of information.
>
> By "this type" I mean annotation of linguistic information. Most ITS2 
> and ITS1 data categories are process related (e.g. "Don't translate 
> this ..."), but both terminology and what's now called disambiguation 
> are information that you find in linguistic corpora and processing 
> tools. Now, my point is that in both in such natural language 
> processing tool chains and related corpora, a representation of 
> information *inline per document node* is rather the exception. Mostly 
> you have *standoff information*, that is a complete seperation of 
> information from actual content - as in NIF.
>
> Mārcis:
>
> Parsing and understanding of the mark-up is the main difference (how 
> overriding and inheritance work) that requires this „stand-off” 
> mechanism for „this type” of annotation. If there would be only flat 
> level annotation, we would not have this discussion at all. Also, 
> “stand-off” is only good if you really have to add a lot of complex 
> data, but here we have to add just a flag or a reference (if put in 
> simple words). In Prague me and Tadej discussed that if hierarchical 
> information is needed, that should be encoded in the external resources.
>
> If I understand correctly, stand-off mark-up has no inheritance and it 
> has no overriding – it describes a span? If so, I assume that with 
> your proposal we are back at requiring hierarchical annotation, 
> overlapping annotation and contradictive annotation, which will allow 
> all kinds of text analysis annotations (without restricted types – 
> term, entity, ontology, lexical, etc.). This will require data 
> consumers to re-think their data consumption strategies as they will 
> have to disambiguate the “disambiguation-style” annotations (which 
> means that at the end we do not help data consumers, but make the life 
> rather more difficult).
>
> In the current ITS 2.0 draft the annotation is flat - it is simple to 
> parse, simple to consume, simple to produce – it is not hierarchical 
> and it does not overlap.
>
> From this perspective, the proposed change is a complete overhaul of 
> the 2 data categories in something different.
>
> Also – we do require the flag. That is something that will be heavily 
> complicated with the “stand-off” mechanism (that has to be 
> understood), or won’t be possible at all?! Having a simple attribute 
> inline is the simplest you can achieve. Having a “stand-off” on the 
> other hand is the most complex you can achieve.
>
> And ... if I remember correctly, we did not want to make life 
> difficult for producers/consumers if they did not care about the other 
> data categories?
>
>
>
> Why is that? In linguistic annotation it is common that you have 
> several layers of information, like our lexical, ontological etc. 
> information. Some of these might be complex in itself (e.g. named 
> entities), some of these might be related to others (e.g. an 
> ontological concept related to a lexical item). I won't try to define 
> these layers here - but my point is that due to the complexity of 
> representing such information inline, nearly nobody is trying to 
> represent several layers at the same time inline. The common approach 
> is rather to have a base layer, and then pointers from the various 
> annotation layers.
>
> In a sense you can describe NIF as an approach of taking character 
> offsets as the implicit base layer (implicit because characters don't 
> need explicit anchors). The TEI here
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> provides an example for an offset using words as the base unit, with 
> exlicit xml:id attributes.
>
> So far we haven't taken this approach for terminology or 
> disambiguation. This is why we had to came of with 16+ attributes: if 
> you want to do everything "inline", you need to differenciate 
> attribute names and come up with a monster data category. Inline 
> annotations are just not suitable for such information.
>
> Mārcis:
>
> I disagree that 16+ attributes are the difficulty here. The difficulty 
> from the beginning were the questions: 1) how many types of annotation 
> should be supported (we narrowed the list down to 4 – terminology, 
> named entities, ontology concepts, lexical concepts)? 2) should 
> overlapping be supported? 3) should hierarchical annotation be 
> supported? 4) should contradicting annotation be supported?
>
> Also ... data producers would have to worry just about a maximum of 5 
> attributes simultaneously and they would be able to ignore the rest. 
> For instance, I have no use for the attributes for disambiguation 
> categories. Although I would agree writing a parser that parses all 
> these attributes (just for compliancy with the data category), I would 
> as a consumer consume only the ones related to terminology and I as a 
> producer would produce only those related to terminology. I would nor 
> consume, nor produce the disambiguation related attributes. From that 
> perspective, I disagree to the complexity in the attribute scenario.
>
> For terminology I require a flagging mechanism (with the possibility 
> to add either a reference, a confidence score, or both).
>
> I do agree that we are limiting the annotation with having separate 
> attributes, but then again ... ITS 2.0 does not have to represent 
> every possible text analysis annotation type. It is supposed to aid in 
> localisation processes and not all text analysis types have a valid 
> use case (or a necessary or even a potentially useful use case) in 
> localisation.
>
> Also ... if we are re-inventing terminology and disambiguation, maybe 
> we should analyse which other data categories fall under the type 
> “text analysis”? Domain is a suitable candidate as well (and if we 
> create a suitable text analysis category, maybe domain analysis can be 
> subcategorized under that as well in order to support automated domain 
> analysis solutions (EuroVoc has an automated domain classifier, for 
> instance))?). With this I would like to emphasize that 
> overgeneralization is not the best approach as we are creating data 
> categories for different consumption scenarios.
>
>
>
> So, the first idea behind below approach is: if you want to represent 
> just one linguistic layer (or "qualifier" in Christian's mail at
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
> ) , you use "tan-type" attribute to differentiate annotations. That 
> leads to following models inline models:
>
> 1) A term has its-tan-type with value "term" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example:
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0">Dublin</span>
> Comparison to current ITS1 "Terminology":
> its-tan-type="term" plays the role of term="yes". its-tan-info-ref 
> plays the role of termInfoRef. its-tan-ident-ref links to a term data 
> base. its-tan-confidence provide confidence information.
> (Esp. at Marcis: I know that "Dublin" is a bad candidate for a term, 
> I'm just trying to exemplify the annotation approach here)
>
> Mārcis:
>
> Also one thing I tried to emphasize at lunchtime in Prague, 
> TermInfoRef is not necessarily an identity reference. It does not 
> always point to something unique (if we understand that a set is not 
> unique). You can have multiple term entries from multiple user 
> collections in a term bank relating to one semantic term. In the case 
> if you do not specify a domain you could end up having a reference 
> that points to totally different (also contrasting) terms or if you do 
> not specify a target language you may end up having multiple entries 
> because most of the collections are bilingual and not multilingual. 
> Why is that so? It is because a term-bank is not a disambiguator – it 
> acts like a search engine (more or less) – the disambiguation for the 
> “external” information (the meaning; the term unithood is defined by 
> the flag term=”yes” itself) has to be done by the consumers 
> (translation engines or human translators). In most cases (as in the 
> biggest term-banks – IATE, ETB) it does not have a hierarchical 
> understanding of terms as some lexical (WordNet, f.i.) or ontological 
> resources may have. For MT engines a valuable information is already – 
> term=“yes” as that defines the term unithood, which means that the 
> term should be translated as a non-breakable phrase. So ... the MT 
> engine could ignore the TermInfoRef at all if it does not have a 
> suitable disambiguation module and maybe leave the disambiguation to 
> human post-editors...
>
> So ... “ident” is misleading (at least in the case of Terminology 
> annotation)!
>
> Also important: HOW WOULD YOU REPRESENT term=”no”? This is a very 
> important feature of the flag type annotation.
>
> would you say: its-tan-type="not-a-term"? That would require data 
> producers to handle higher complexity annotation!
>
> 2) An entity has its-tan-type with value "entity" and optional 
> its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example:
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> its-tan-class-ref=" 
> http://nerd.eurecom.fr/ontology#Place" 
> its-tan-confidence="0.7">Dublin</span>
>
> So above is only different naming compared to current "Terminology" 
> and Disambiguation. Below is now the standoff approach. The processing 
> expectation for tools *producing the annotation* is like this:
> - If there is no inline annotation, just create it (e.g. 1) or 2))
> - If there is inline annotation, check if there is an id attribute (in 
> HTML) or xml:id (if XML serizalization of HTML is used and with lower 
> precedence compared to id). For formats other than HTML, add xml:id if 
> possible or use the id attribute appropriate for that format.
>
> Then, for creating standoff annotations, add an 
> "its:textAnalyticsAnnotations" element to the document, e.g. in HTML 
> "script" if needed.
>
> Let's assume before annotation we have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> 
> its-tan-confidence="0.7">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" 
> *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0"/>
> </its:textAnalyticsAnnotations>
>
>
> Let's now assume that before annotation we have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0">Dublin</span>
> Then after annotation we would have
> <span its-tan-type="term" 
> its-tan-ident-ref="http://termdatabase.example.com/entry37" 
> <http://termdatabase.example.com/entry37> 
> its-tan-info-ref="http://termdatabase.example.com/entry37/description" 
> <http://termdatabase.example.com/entry37/description> 
> its-tan-confidence="1.0" *id="a8"*>Dublin</span>
> and this:
> <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/>
> </its:textAnalyticsAnnotations>
>
> Now, if several "entity" annotation tools have been used, we could 
> also have
> <its:textAnalyticsAnnotations>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" 
> annotatorsRef="tan|tool-x"/>
> <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" 
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin> 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4" 
> annotatorsRef="tan|tool-y"/>
> </its:textAnalyticsAnnotations>
>
> Above approach would also influence the consumption of this data 
> category, and of annotatorsRef:
>
> - A consuming tools goes through the document and gathers all 
> textAnalyticsAnnotations elements
> - It then goes through the document. For each element node
> * check for existing inline markup. If it's available, add it to the 
> list of annotations for that node. Assume the inline version up in the 
> document tree of annotatorsRef to be responsible for the annotation of 
> that markup.
> * check the accumulated standoff textAnalyticsAnnotations elements for 
> occurrences of IDs that match the node. If there is such an ID, add 
> the related annotation to the list for the node, including the 
> additional annotatorsRef tool, e.g. tool-x or tool-y in the above case.
>
> Mārcis:
>
> Do I understand you correctly that we may end up having contradicting 
> annotations also, for instance term=”yes” and term=”no”? This would 
> require a data consumer to be able to handle a lot of ambiguity in the 
> data.
>
>
> In summary, this standoff tries to solve several issues:
>
> - avoid the 16+ inline attribute monster data category
>
> Mārcis:
>
> Again, I did not understand why this is worse than having a heavy 
> “stand-off” mechanism.
>
>
> - allow for multiple annotations of the same span, with different tools
> Mārcis:
>
> In Prague Tadej and I had a discussion whether there is a use case for 
> using two tools producing contradicting mark-up and we came to the 
> conclusion that neither of us would produce such data and if such a 
> scenario exists, then the content producer should fuse (disambiguate) 
> the outputs of the two separate tools prior to ITS 2.0 metadata 
> application. I am talking about the same type (for instance, two term 
> annotation tools on the same span) of annotation, not two separate types.
>
> Then my question: does such a scenario exist? Who is implementing it?
>
>
> - avoid the ITS1/2 or general inline annotation issues with 
> inheritance and overriding - as with the standoff approach at 
> exemplified at
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
> annotation information is just accumulated for a given base item (in 
> our case, element nodes with an ID).
> Mārcis:
>
> So ... at the end, with this method we would allow:
>
> 1) Hierarchical annotation
>
> 2) Contradicting annotation
>
> 3) (possibly also) overlapping annotation
>
>
> I'm not yet asking for this change, but I see it as a way forward that 
> could make the life of both annotation producers (Marcis and Tadej) 
> and consumers (Yves et al.) simpler. So I'm eager to hear thoughts on 
> this :)
> Mārcis:
>
> As I understand the proposal – it is the complete opposite from being 
> simple (or simplifying things as they are right now having Terminology 
> and Disambiguation separately), it complicates things significantly 
> from the Terminology standpoint as now I do not see where term=”yes” 
> fits in, we have to deal with contradicting annotation (allow or 
> prohibit it is now a question to the consumers – I as a consumer would 
> ask to prohibit it as I do not see a use case for term=”yes” and 
> term=”no” at the same time), and what is more, we have to re-implement 
> the parsers so that instead of overriding and inheritance they would 
> work with accumulating information (and this is a complete revision of 
> the parser logics for the Terminology data category).
>
>
>
> Thoughts?
>
> - Felix
>
Received on Monday, 28 January 2013 12:59:50 UTC