- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 29 Jan 2013 18:24:24 +0100
- To: public-multilingualweb-lt@w3.org
- Message-ID: <510805C8.5020103@w3.org>
Hi Mârcis, all, Am 29.01.13 15:48, schrieb Mârcis Pinnis: > > Hi Felix, > > My comments are inline. > > Best regards, > > Mârcis ;o) > > *From:*Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Tuesday, January 29, 2013 11:27 AM > *To:* Mârcis Pinnis > *Cc:* public-multilingualweb-lt@w3.org; Artűrs Vasiďevskis > *Subject:* Re: issue-68 from an annotation representation point of > view, with potential implications for annotatorsRef and standoff markup > > Hi Mârcis, all, > > even if this discussion has now continued in a different thread, let > met give further feedback here too - it may help to clarify things, > and to continue the discussion in general. > > > Am 28.01.13 11:18, schrieb Mârcis Pinnis: > > Hi Felix, all, > > I see that there have been a lot of opinion exchanges on the > proposal brought up by Felix. > > I have some comments to add. I am now speaking as a data producer > and later maybe also a data consumer (and I am not speaking as a > linguist! ... that has to be understood as well). > > First of all, I would like to ask whether we agreed that ITS 2.0 > should be able to represent data in the structure as TEI, NIF, > XCES or other NLP related standards do – that is, as far as I > understand, the direction where this discussion is heading. Should > ITS 2.0 try to re-invent these data standards? I would incline to > saying – no! > > > > As far as I understand, there standards are not yet implemented in > localization tool chains. However, the "multilayer annotation" > proposal brought one feature from these standards into such tool > chains: the standoff mechanism. I'd rather see this as a value than a > problem: bringing NLP friendly representations into localization > workflows. Would you disagree? > > Mârcis: From the perspective of adding all kinds of annotation, > overlapping, contradicting, hierarchical, it certainly is beneficial > (I do agree in this aspect). > > Mârcis: From the perspective of implementation: > > Mârcis: 1) for consumers you suggest reading only known mark-up. I do > agree that if we would care only about one tool then we could ignore > the rest. But this asks consumers to know who produced the mark-up. By > having a flat level flag the consumers did not have to worry about who > produced the annotation (also – human and machine users could apply > annotation and have an effect on the data); they just read the > annotation and used it as is. This is not possible in the stand-off > mechanism – the consumer has to know which producer to trust in order > to consume the data; otherwise the consumer has to have a > disambiguation module at hand that tries to find some reason in all > the annotations. > > Mârcis: 2) for producers the stand-off mark-up requires adaptation > (more than just adding attributes, but still adaptation), which > probably is not a big issue. > > Mârcis: We (Tilde) are doing both right now – we consume and we > produce Terminology. But ... we could switch to just producing and not > consuming (which is the part that worries me more...). So we would not > have to deal with the disambiguation of the stand-off mark-up and also > which annotator to trust or not. > > Secondly, as we are in a last call phase, I understand that such > significant change to the ITS 2.0 data categories would rewrite > them (and maybe it will get clearer when you read my comments till > the end). I as a data producer now will have to rewrite my parsers > and data producing systems just to accommodate the „stand-off” > mechanisms, which is in a content providers and content consumers > perspective a diametric change to just adding additional > independent attributes or changing the names of attributes (which > was actually the initial proposal by Tadej and me). I would like > for others to understand that this solution asks for > re-development rather than simple adjustments. > > > I agree - this would be quite some work, and we need to justify the > benefit clearly. > > Mârcis: The change will affect our Showcase the most as right now we > rely on inline mark-up. If we won’t have the inline mark-up at the end > (or we will have additional stand-off mark-up) then we will have to > re-think the Showcase design and the visualisation possibilities in > the Showcase. > Is this because of CSS used for visualization? I agree that with standoff markup visulization gets more compliciated - but not that match. See the javascript bit used for localization quality issue here http://www.w3.org/TR/2012/WD-its20-20121206/#EX-locQualityIssue-html5-local-2 http://www.w3.org/TR/2012/WD-its20-20121206/examples/html5/EX-locQualityIssue-html5-local-2.html The resolution of the ID is only a few lines of javascript code. > > Other comments are inline below... > > After reading the comments here is a summary: > > In my understanding the proposal complicates data production and > consumption significantly as it creates possibilities for a lot of > ambiguity, which I guess is the opposite of what initially was > meant by the disambiguation data category(!) and at least in our > Use Case it requires revision of parser logics and ITS 2.0 > metadata annotation logics. > > > The proposal basically says: here is a way to represent ambiguity, > created by several tools annotating the same document. However, I'd > see this as a value, not a problem: with separate > "its:textAnalyticsAnnotations" elements, including each its own > annotatorsRef, you can clearly identify which tool created what > annotation. This may be even clearer than the current annotatorsRef. > > Mârcis: I do agree, however see above for my comment related to > consumers. For them the consumption is different – they will have to > know whom to trust. If you think that consumers have to know who > produced the data all the time then it is fine by me... (but it is a > change from the Terminology as it is right now). > I got your point about trust. And - trying to bring the discussion back - the intial comment was not about trust or no trust. It was rather about unfying terminology and disambiguation - and by relfecting the levels mentioned in disambiguation in different annotaion levels, I trid to find a work-around for this. But that work-around and multilayer annotation are not the main topic. So, if we drop disambiguation granularity, keep the term yes / no requirement, we may have this representation as a unified approach for both data categories: <spanits-tan-confidence="0.7" its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" its-tan-ident-ref="http://dbpedia.org/resource/Dublin" its-term="no">Dublin</span> If we define that its-term="yes" triggers its-tan-ident-ref to be interpreted as a reference to a termDB, we would have unified the data categories. Well, I guess I'm trying to use an axe for moving this forward ... but let's see what you think. Best, Felix > > > However, I will have a discussion with my colleagues in order to > estimate how much changes would be required to our use case from a > development perspective. > > I also understand that this proposal wants to fuse all types of > possible NLP-related text analyses together, but I did not have > the feeling that ITS 2.0 should be used as a TEI, XCES, NIF, etc. > clone? This is how I see where the changes will lead us. > > However, I also do not say that that is a bad thing... we would > definitely make linguists more happier, but I as a content > provider and later also a consumer would have difficulties working > with the data as I would have to agree accepting > uncertainty/ambiguity in the ITS 2.0 metadata by default (except > external resources as those are defined between > consumers/producers and not ITS 2.0). > > Best regards, > > Mârcis ;o) > > *From:*Felix Sasaki [mailto:fsasaki@w3.org] > *Sent:* Sunday, January 27, 2013 9:25 AM > *To:* public-multilingualweb-lt@w3.org > <mailto:public-multilingualweb-lt@w3.org> > *Subject:* issue-68 from an annotation representation point of > view, with potential implications for annotatorsRef and standoff > markup > > Hi all, > > sorry, this is going to be long ... but please have a look, esp. > the implementers (both consumers and producers) of terminology and > disambiguation. > > in the last 10 1/2 months, since Tadej's presentation at the > Dublin workshop, we had a lot of discussions on disambiguation, > and sometimes (as now) including terminology. But it seems that we > never discussed whether ITS2 approach of selection (global, local, > inheritence, overriding (partial or not)...) is suitable for this > type of information. > > By "this type" I mean annotation of linguistic information. Most > ITS2 and ITS1 data categories are process related (e.g. "Don't > translate this ..."), but both terminology and what's now called > disambiguation are information that you find in linguistic corpora > and processing tools. Now, my point is that in both in such > natural language processing tool chains and related corpora, a > representation of information *inline per document node* is rather > the exception. Mostly you have *standoff information*, that is a > complete seperation of information from actual content - as in NIF. > > Mârcis: > > Parsing and understanding of the mark-up is the main difference > (how overriding and inheritance work) that requires this > „stand-off” mechanism for „this type” of annotation. If there > would be only flat level annotation, we would not have this > discussion at all. Also, “stand-off” is only good if you really > have to add a lot of complex data, but here we have to add just a > flag or a reference (if put in simple words). In Prague me and > Tadej discussed that if hierarchical information is needed, that > should be encoded in the external resources. > > If I understand correctly, stand-off mark-up has no inheritance > and it has no overriding – it describes a span? > > > Correct. > > > If so, I assume that with your proposal we are back at requiring > hierarchical annotation, overlapping annotation and contradictive > annotation, which will allow all kinds of text analysis > annotations (without restricted types – term, entity, ontology, > lexical, etc.). This will require data consumers to re-think their > data consumption strategies as they will have to disambiguate the > “disambiguation-style” annotations (which means that at the end we > do not help data consumers, but make the life rather more difficult). > > > As said above: if a consumer doesn't want to deal with several layers > of annotations, it can just say: I want to consume the annotations > made by Tilde or by JSI. This is guaranteed by the annotatorsRef > attribute. > > > Mârcis: Again, see my comment above about the consumer having to know > whom to trust. > > > The current state of quo creates this situation: if Tilde already has > annotated a text, and JSI wants to add annotations, and you want to > compare them: how to do this? You can say "one creates terminology > markup, the other disambiguation markup". But what about even more tools? > > Mârcis: I agree, currently you can have only one Terminology > annotation tool (the disambiguation is not a nice example), but in the > current version we acknowledge that there can be only one Terminology > annotator for a single phrase (I am fine with that). I understand that > the stand-off is a possible way how to solve this issue, but for the > previous consumer this will create non-comparable annotations (or he > will have to update to consuming the ambiguous annotations or just > trust one of the annotations). For future consumers this might as well > be acceptable. > > In the current ITS 2.0 draft the annotation is flat - it is simple > to parse, simple to consume, simple to produce – it is not > hierarchical and it does not overlap. > > > See above - if a consumer does not want to consume relations between > annotation tools or levels, you don't have to, and annotatorsRef gives > you the ability to differentiate the annotations. > > Btw., current disambiguation and terminology also don't inherit, see > the table at > http://www.w3.org/TR/2012/WD-its20-20121206/#datacategories-defaults-etc > that is: the annotations of both data categories don't inherit to > nested markup. So we could resolve the issue also via something like this: > Input before annotation: Dublin > First annotation: <span its-term="yes">Dublin</span> > Second annotation: <span its-term="yes"><span > its-term="no">Dublin</span></span> > > Mârcis: As I understand this disambiguates Terminology mark-up from > different producers automatically? This is possible in the current > version. > > The non-existent inheritance, of course requires the marked span not > to contain other mark-up. This is, of course, is a limitation of the > current version and I agree, might be solved by the stand-off mechanism. > > > But that has the annotatorsRef issue if several "term annotation" > tools have been used. > > Mârcis: In the current version we as a consumer treat all Terminology > annotators as equally important, thus the annotatorsRef for us is not > necessary. However, for data tracking purposes this might be important > and with the stand-off mechanism the annotatorsRef becomes mandatory > (at least in our consumption scenario). > > From this perspective, the proposed change is a complete overhaul > of the 2 data categories in something different. > > Also – we do require the flag. That is something that will be > heavily complicated with the “stand-off” mechanism (that has to be > understood), or won’t be possible at all?! > > > Setting the type would give you the flat. I know that in the proposal at > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0212.html > Tadej dropped the flat. But we could have instead of fixed values, > e.g. "term", an URI. You could then interpret that URI as a term flag, > e.g. > <span tan-type="http://example.com/term" <http://example.com/term>> > > Mârcis: I agree that it gives term=“yes”, but not term=”no”. > > Having a simple attribute inline is the simplest you can achieve. > Having a “stand-off” on the other hand is the most complex you can > achieve. > > And ... if I remember correctly, we did not want to make life > difficult for producers/consumers if they did not care about the > other data categories? > > > > Correct, but here we have the situation that two data categories might > be just too similar for keeping everthing as is. > > > > > Why is that? In linguistic annotation it is common that you have > several layers of information, like our lexical, ontological etc. > information. Some of these might be complex in itself (e.g. named > entities), some of these might be related to others (e.g. an > ontological concept related to a lexical item). I won't try to > define these layers here - but my point is that due to the > complexity of representing such information inline, nearly nobody > is trying to represent several layers at the same time inline. The > common approach is rather to have a base layer, and then pointers > from the various annotation layers. > > In a sense you can describe NIF as an approach of taking character > offsets as the implicit base layer (implicit because characters > don't need explicit anchors). The TEI here > http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO > provides an example for an offset using words as the base unit, > with exlicit xml:id attributes. > > So far we haven't taken this approach for terminology or > disambiguation. This is why we had to came of with 16+ attributes: > if you want to do everything "inline", you need to differenciate > attribute names and come up with a monster data category. Inline > annotations are just not suitable for such information. > > Mârcis: > > I disagree that 16+ attributes are the difficulty here. The > difficulty from the beginning were the questions: 1) how many > types of annotation should be supported (we narrowed the list down > to 4 – terminology, named entities, ontology concepts, lexical > concepts)? 2) should overlapping be supported? 3) should > hierarchical annotation be supported? 4) should contradicting > annotation be supported? > > > about 1): no type at all would be one solution, but the term identifer > issue would come up. about 2): if a consumer just takes up one > annotation, e.g. the output of Tilde's tool, there is no need to > process overlap. And we can leave that to consumers IMO. 3): same like > 2). 4) Same like 2). > > Mârcis: 1) I agree that the type itself is important. I know that > Tadej said that a Ref URI might have the Type embedded, but for > Terminology we do not always have the URI available. > > Mârcis: 2-4) I agree, but only if you ask the consumers to know which > producer to trust. If that is not an issue, then it is fine by me (it > is a compromise as we lose the ability to not have to trust anyone at all) > > Also ... data producers would have to worry just about a maximum > of 5 attributes simultaneously and they would be able to ignore > the rest. For instance, I have no use for the attributes for > disambiguation categories. > > > I think that's the heart of issue-68: there are two quite similar > pieces of information, but consumers separate them. > > > > Although I would agree writing a parser that parses all these > attributes (just for compliancy with the data category), I would > as a consumer consume only the ones related to terminology and I > as a producer would produce only those related to terminology. I > would nor consume, nor produce the disambiguation related attributes. > > > That wouldn't work if we have one data category: our conformance > requirements say: you implement it global or local or both. You then > can also decide whether you implement it in HTML or XML or both. But > you cannot cherry pick attributes for consumption. We don't say > anything wrt production - but our schema helps us to verify that the > "right data" has been produced. > > Mârcis: I think you are misunderstanding – parsing content for > consumption and production is a totally different architectural level > than the logics that makes any use of the content. So ... in my > understanding we are not failing on conformance. We are if there is a > requirement that we have to really consume and really produce the > other types of data in the application logics layers. Is this the case? > > > > From that perspective, I disagree to the complexity in the > attribute scenario. > > > I think part of the disagreement comes from the "free spirit" you have > as data producer and consumer, see above. > > > > For terminology I require a flagging mechanism (with the > possibility to add either a reference, a confidence score, or both). > > I do agree that we are limiting the annotation with having > separate attributes, but then again ... ITS 2.0 does not have to > represent every possible text analysis annotation type. It is > supposed to aid in localisation processes and not all text > analysis types have a valid use case (or a necessary or even a > potentially useful use case) in localisation. > > Also ... if we are re-inventing terminology and disambiguation, > maybe we should analyse which other data categories fall under the > type “text analysis”? Domain is a suitable candidate as well (and > if we create a suitable text analysis category, maybe domain > analysis can be subcategorized under that as well in order to > support automated domain analysis solutions (EuroVoc has an > automated domain classifier, for instance))?). > > > > Here I would disagree: our domain data category is just for > transporting domain information between content and tools, including a > potential mapping of domain identifiers inbetween. The "terminology vs > disambiguation" discussion came from the observations that two data > categories in ITS2 have a huge overlap. I don't see that situation for > domain. > > Mârcis: I do not agree that domain identification is in structure > different than terminology annotation or named entity recognition or > even sentence breaking, but fair enough ... I brought this up to show > that in general domain annotation is also annotation... an equal in > structure task as term tagging or named entity recognition (usually > just in bigger spans – but not necessarily). > > With this I would like to emphasize that overgeneralization is not > the best approach as we are creating data categories for different > consumption scenarios. > > > But are they so different? It sounds to me rather that in your > scenario, many opportunities are lost because you don't consume > disamgiuatino at all ... so having one umbrella data category might > even give you more data consumption opportunities. > > Mârcis: :) ... of course, the more annotation, the more possibilities > (this is a philosophical truth – I cannot argue). But for producing > Terminology in our Use Case we do not require knowledge on the purely > semantic lexical, ontological or entity level. Our Use Case uses > knowledge from the tools that are applied in the process and they do > not ask for information supplied by those data categories (we do > require domain, language and others though ... that we, of course, > use). The use of Disambiguation data categories would require > re-thinking of the modules that do not deal with ITS 2.0 explicitly – > the term extraction, term weighing, term retrieval methods which are > out-of-scope in this project. > > > > So, the first idea behind below approach is: if you want to > represent just one linguistic layer (or "qualifier" in Christian's > mail at > http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html > ) , you use "tan-type" attribute to differentiate annotations. > That leads to following models inline models: > > 1) A term has its-tan-type with value "term" and optional > its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref. Example: > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0">Dublin</span> > Comparison to current ITS1 "Terminology": > its-tan-type="term" plays the role of term="yes". its-tan-info-ref > plays the role of termInfoRef. its-tan-ident-ref links to a term > data base. its-tan-confidence provide confidence information. > (Esp. at Marcis: I know that "Dublin" is a bad candidate for a > term, I'm just trying to exemplify the annotation approach here) > > > Mârcis: > > Also one thing I tried to emphasize at lunchtime in Prague, > TermInfoRef is not necessarily an identity reference. It does not > always point to something unique (if we understand that a set is > not unique). You can have multiple term entries from multiple user > collections in a term bank relating to one semantic term. In the > case if you do not specify a domain you could end up having a > reference that points to totally different (also contrasting) > terms or if you do not specify a target language you may end up > having multiple entries because most of the collections are > bilingual and not multilingual. Why is that so? It is because a > term-bank is not a disambiguator – it acts like a search engine > (more or less) – the disambiguation for the “external” information > (the meaning; the term unithood is defined by the flag term=”yes” > itself) has to be done by the consumers (translation engines or > human translators). In most cases (as in the biggest term-banks – > IATE, ETB) it does not have a hierarchical understanding of terms > as some lexical (WordNet, f.i.) or ontological resources may have. > For MT engines a valuable information is already – term=“yes” as > that defines the term unithood, which means that the term should > be translated as a non-breakable phrase. So ... the MT engine > could ignore the TermInfoRef at all if it does not have a suitable > disambiguation module and maybe leave the disambiguation to human > post-editors... > > So ... “ident” is misleading (at least in the case of Terminology > annotation)! > > Also important: HOW WOULD YOU REPRESENT term=”no”? This is a very > important feature of the flag type annotation. > > would you say: its-tan-type="not-a-term"? That would require data > producers to handle higher complexity annotation! > > > > I don't have a clear answer to above questions - others, feel free to > chime in if you do. > > Mârcis: This is important to understand. Will this be dropped at all > or will there be an alternative mechanism? > > 2) An entity has its-tan-type with value "entity" and optional > its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref. Example: > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> its-tan-class-ref=" > http://nerd.eurecom.fr/ontology#Place" > its-tan-confidence="0.7">Dublin</span> > > So above is only different naming compared to current > "Terminology" and Disambiguation. Below is now the standoff > approach. The processing expectation for tools *producing the > annotation* is like this: > - If there is no inline annotation, just create it (e.g. 1) or 2)) > - If there is inline annotation, check if there is an id attribute > (in HTML) or xml:id (if XML serizalization of HTML is used and > with lower precedence compared to id). For formats other than > HTML, add xml:id if possible or use the id attribute appropriate > for that format. > > Then, for creating standoff annotations, add an > "its:textAnalyticsAnnotations" element to the document, e.g. in > HTML "script" if needed. > > Let's assume before annotation we have > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> > its-tan-confidence="0.7">Dublin</span> > Then after annotation we would have > <span its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" > *id="a8"*>Dublin</span> > and this: > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0"/> > </its:textAnalyticsAnnotations> > > > Let's now assume that before annotation we have > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0">Dublin</span> > Then after annotation we would have > <span its-tan-type="term" > its-tan-ident-ref="http://termdatabase.example.com/entry37" > <http://termdatabase.example.com/entry37> > its-tan-info-ref="http://termdatabase.example.com/entry37/description" > <http://termdatabase.example.com/entry37/description> > its-tan-confidence="1.0" *id="a8"*>Dublin</span> > and this: > <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x"> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7"/> > </its:textAnalyticsAnnotations> > > Now, if several "entity" annotation tools have been used, we could > also have > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" > annotatorsRef="tan|tool-x"/> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.4" > annotatorsRef="tan|tool-y"/> > </its:textAnalyticsAnnotations> > > Above approach would also influence the consumption of this data > category, and of annotatorsRef: > > - A consuming tools goes through the document and gathers all > textAnalyticsAnnotations elements > - It then goes through the document. For each element node > * check for existing inline markup. If it's available, add it to > the list of annotations for that node. Assume the inline version > up in the document tree of annotatorsRef to be responsible for the > annotation of that markup. > * check the accumulated standoff textAnalyticsAnnotations elements > for occurrences of IDs that match the node. If there is such an > ID, add the related annotation to the list for the node, including > the additional annotatorsRef tool, e.g. tool-x or tool-y in the > above case. > > > Mârcis: > > Do I understand you correctly that we may end up having > contradicting annotations also, for instance term=”yes” and > term=”no”? This would require a data consumer to be able to handle > a lot of ambiguity in the data. > > > Sure. But they could identify the ambiguity with a multilayer > annotation that clearly identifies the tool used, via annotatorsRef. > Currently, what would you do with this > <span its-term="yes"><span its-term="no">screwdriver</span></span> > how would you resolve the ambiguity here? "Terminology" has no > inheritance. This makes sense, otherwise in the following > <span its-term="yes"><span class="em">screw</span>driver</span> > the embedded "span" element would constitute a span. But that leads to > this test suite output for > <span its-term="yes"><span its-term="no">screwdriver</span></span> > /span[1] term="yes" > /span[1]/span[1] term="no" > and both "span" nodes contain the same string "screwdriver". So how do > you resolve the ambiguity here? > > Mârcis: I do not see the issue in the above example. As you said, > Terminology does not inherit, therefore, the only thing that is stated > is that the “screwdriver” is not a term. > > Mârcis: However, one thing I have not understood so far – is there a > limitation of how many annotations can be done by the same producer > (human or machine). Even the annotatorsRef in my understanding does > not always resolve contradictions. Or ... is there a precedence rule > if there are equal, but contradicting stand-off annotations, for > instance (I made this up to simplify the under): > > <its:textAnalyticsAnnotations> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" > <http://nerd.eurecom.fr/ontology#Place> its-tan-confidence="0.7" > annotatorsRef="tan|annotator-1"/> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Person" > <%22http:/nerd.eurecom.fr/ontology#Person%22> its-tan-confidence="0.4" > annotatorsRef="tan|annotator-1"/> > <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity" > its-tan-ident-ref="http://dbpedia.org/resource/Dublin" > <http://dbpedia.org/resource/Dublin> > its-tan-class-ref="http://nerd.eurecom.fr/ontology#Organisation" > <%22http:/nerd.eurecom.fr/ontology#Organisation%22> > its-tan-confidence="0.4" annotatorsRef="tan|annotator-1"/> > </its:textAnalyticsAnnotations> > Mârcis: Here “Dublin” can be all three (Place, Person, Organisation) > simultaneously, right? > > In summary, this standoff tries to solve several issues: > > - avoid the 16+ inline attribute monster data category > > Mârcis: > > Again, I did not understand why this is worse than having a heavy > “stand-off” mechanism. > > > - allow for multiple annotations of the same span, with different > tools > Mârcis: > > In Prague Tadej and I had a discussion whether there is a use case > for using two tools producing contradicting mark-up and we came to > the conclusion that neither of us would produce such data and if > such a scenario exists, then the content producer should fuse > (disambiguate) the outputs of the two separate tools prior to ITS > 2.0 metadata application. I am talking about the same type (for > instance, two term annotation tools on the same span) of > annotation, not two separate types. > > Then my question: does such a scenario exist? Who is implementing it? > > > If both you and Tadej would agree on one data category: everybody who > wants to use both your tools would implement it. And this has the > value that people could compare the outcome of the tools. > > Mârcis: So you would ask the consumers to disambiguate or choose (in > this way they would not use both if both would produce Terminology), > right? If yes, it is totally fine, I just want to make sure I > understand your idea. > > > - avoid the ITS1/2 or general inline annotation issues with > inheritance and overriding - as with the standoff approach at > exemplified at > http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO > annotation information is just accumulated for a given base item > (in our case, element nodes with an ID). > Mârcis: > > So ... at the end, with this method we would allow: > > 1) Hierarchical annotation > > 2) Contradicting annotation > > 3) (possibly also) overlapping annotation > > > > Correct. > > > > I'm not yet asking for this change, but I see it as a way forward > that could make the life of both annotation producers (Marcis and > Tadej) and consumers (Yves et al.) simpler. So I'm eager to hear > thoughts on this :) > Mârcis: > > As I understand the proposal – it is the complete opposite from > being simple (or simplifying things as they are right now having > Terminology and Disambiguation separately), it complicates things > significantly from the Terminology standpoint as now I do not see > where term=”yes” fits in, we have to deal with contradicting > annotation (allow or prohibit it is now a question to the > consumers – I as a consumer would ask to prohibit it as I do not > see a use case for term=”yes” and term=”no” at the same time), and > what is more, we have to re-implement the parsers so that instead > of overriding and inheritance they would work with accumulating > information (and this is a complete revision of the parser logics > for the Terminology data category). > > > I understand the burden on implementation you emphasize - but it seems > that one scenario - annotation using different tools even for the > terminology data category, see the nested "terminology" annotations > above - is not resolved by your proposal. You say this would not be > implemented before ITS2 annotation - but if the tool providers are not > from the same organization? > > Mârcis: Our proposal did not allow nested annotations. Nor does the > current ITS 2.0 version. Also – this was my question – is there a > necessity to produce 2 Terminology annotations or 2 named entity > annotations on top of each other. I see that you are saying – Yes, > there is. > > > > Best, > > Felix > > > > > > Thoughts? > > - Felix >
Received on Tuesday, 29 January 2013 17:24:53 UTC