Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi Mârcis, all,

just a small comment.

Am 29.01.13 19:26, schrieb Mârcis Pinnis:
>
> Hi Felix, all,
>
> I have replied inline (in a color close to cyan).
>
> Best regards,
>
> Mârcis ;o)
>
> *From:*Felix Sasaki [mailto:fsasaki@w3.org]
> *Sent:* Tuesday, January 29, 2013 7:24 PM
> *To:* public-multilingualweb-lt@w3.org
> *Subject:* Re: issue-68 from an annotation representation point of 
> view, with potential implications for annotatorsRef and standoff markup
>
> Hi Mârcis, all,
>
>
> Am 29.01.13 15:48, schrieb Mârcis Pinnis:
>
>     Hi Felix,
>
>     My comments are inline.
>
>     Best regards,
>
>     Mârcis ;o)
>
>     *From:*Felix Sasaki [mailto:fsasaki@w3.org]
>     *Sent:* Tuesday, January 29, 2013 11:27 AM
>     *To:* Mârcis Pinnis
>     *Cc:* public-multilingualweb-lt@w3.org
>     <mailto:public-multilingualweb-lt@w3.org>; Artűrs Vasiďevskis
>     *Subject:* Re: issue-68 from an annotation representation point of
>     view, with potential implications for annotatorsRef and standoff
>     markup
>
>     Hi Mârcis, all,
>
>     even if this discussion has now continued in a different thread,
>     let met give further feedback here too - it may help to clarify
>     things, and to continue the discussion in general.
>
>
>     Am 28.01.13 11:18, schrieb Mârcis Pinnis:
>
>         Hi Felix, all,
>
>         I see that there have been a lot of opinion exchanges on the
>         proposal brought up by Felix.
>
>         I have some comments to add. I am now speaking as a data
>         producer and later maybe also a data consumer (and I am not
>         speaking as a linguist! ... that has to be understood as well).
>
>         First of all, I would like to ask whether we agreed that ITS
>         2.0 should be able to represent data in the structure as TEI,
>         NIF, XCES or other NLP related standards do – that is, as far
>         as I understand, the direction where this discussion is
>         heading. Should ITS 2.0 try to re-invent these data standards?
>         I would incline to saying – no!
>
>
>
>     As far as I understand, there standards are not yet implemented in
>     localization tool chains. However, the "multilayer annotation"
>     proposal brought one feature from these standards into such tool
>     chains: the standoff mechanism. I'd rather see this as a value
>     than a problem: bringing NLP friendly representations into
>     localization workflows. Would you disagree?
>
>     Mârcis: From the perspective of adding all kinds of annotation,
>     overlapping, contradicting, hierarchical, it certainly is
>     beneficial (I do agree in this aspect).
>
>     Mârcis: From the perspective of implementation:
>
>     Mârcis: 1) for consumers you suggest reading only known mark-up. I
>     do agree that if we would care only about one tool then we could
>     ignore the rest. But this asks consumers to know who produced the
>     mark-up. By having a flat level flag the consumers did not have to
>     worry about who produced the annotation (also – human and machine
>     users could apply annotation and have an effect on the data); they
>     just read the annotation and used it as is. This is not possible
>     in the stand-off mechanism – the consumer has to know which
>     producer to trust in order to consume the data; otherwise the
>     consumer has to have a disambiguation module at hand that tries to
>     find some reason in all the annotations.
>
>     Mârcis: 2) for producers the stand-off mark-up requires adaptation
>     (more than just adding attributes, but still adaptation), which
>     probably is not a big issue.
>
>     Mârcis: We (Tilde) are doing both right now – we consume and we
>     produce Terminology. But ... we could switch to just producing and
>     not consuming (which is the part that worries me more...). So we
>     would not have to deal with the disambiguation of the stand-off
>     mark-up and also which annotator to trust or not.
>
>         Secondly, as we are in a last call phase, I understand that
>         such significant change to the ITS 2.0 data categories would
>         rewrite them (and maybe it will get clearer when you read my
>         comments till the end). I as a data producer now will have to
>         rewrite my parsers and data producing systems just to
>         accommodate the „stand-off” mechanisms, which is in a content
>         providers and content consumers perspective a diametric change
>         to just adding additional independent attributes or changing
>         the names of attributes (which was actually the initial
>         proposal by Tadej and me). I would like for others to
>         understand that this solution asks for re-development rather
>         than simple adjustments.
>
>
>     I agree - this would be quite some work, and we need to justify
>     the benefit clearly.
>
>     Mârcis: The change will affect our Showcase the most as right now
>     we rely on inline mark-up. If we won’t have the inline mark-up at
>     the end (or we will have additional stand-off mark-up) then we
>     will have to re-think the Showcase design and the visualisation
>     possibilities in the Showcase.
>
>
>
> Is this because of CSS used for visualization? I agree that with 
> standoff markup visulization gets more compliciated - but not that 
> match. See the javascript bit used for localization quality issue here
> http://www.w3.org/TR/2012/WD-its20-20121206/#EX-locQualityIssue-html5-local-2
> http://www.w3.org/TR/2012/WD-its20-20121206/examples/html5/EX-locQualityIssue-html5-local-2.html
>
> The resolution of the ID is only a few lines of javascript code.
>
> Mârcis: Thank you for the hint to the javascript code. If we will have 
> an agreement that the stand-off mechanism has to be applied, we will 
> definitely look into this (if our developers won’t already have a 
> solution).
>
>         Other comments are inline below...
>
>         After reading the comments here is a summary:
>
>         In my understanding the proposal complicates data production
>         and consumption significantly as it creates possibilities for
>         a lot of ambiguity, which I guess is the opposite of what
>         initially was meant by the disambiguation data category(!) and
>         at least in our Use Case it requires revision of parser logics
>         and ITS 2.0 metadata annotation logics.
>
>
>     The proposal basically says: here is a way to represent ambiguity,
>     created by several tools annotating the same document. However,
>     I'd see this as a value, not a problem: with separate
>     "its:textAnalyticsAnnotations" elements, including each its own
>     annotatorsRef, you can clearly identify which tool created what
>     annotation. This may be even clearer than the current annotatorsRef.
>
>
>     Mârcis: I do agree, however see above for my comment related to
>     consumers. For them the consumption is different – they will have
>     to know whom to trust. If you think that consumers have to know
>     who produced the data all the time then it is fine by me... (but
>     it is a change from the Terminology as it is right now).
>
>
> I got your point about trust. And - trying to bring the discussion 
> back - the intial comment was not about trust or no trust. It was 
> rather about unfying terminology and disambiguation - and by 
> relfecting the levels mentioned in disambiguation in different 
> annotaion levels, I trid to find a work-around for this. But that 
> work-around and multilayer annotation are not the main topic.
>
> So, if we drop disambiguation granularity, keep the term yes / no 
> requirement, we may have this representation as a unified approach for 
> both data categories:
>
>
> <span its-tan-confidence="0.7" 
> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place" 
> <http://nerd.eurecom.fr/ontology#Place>
> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" 
> <http://dbpedia.org/resource/Dublin>
>
> its-term="no">Dublin</span>
>
> If we define that its-term="yes" triggers its-tan-ident-ref to be 
> interpreted as a reference to a termDB, we would have unified the data 
> categories. Well, I guess I'm trying to use an axe for moving this 
> forward ... but let's see what you think.
>
> Mârcis: 1. Let’s have a summary on this:
>
> 1.1.We have inline mark-up and stand-off mark-up
>
> a.The stand-off mark-up has no inheritance and no overriding, but 
> works on spans allowing hierarchical annotation and also conflicting 
> annotations - OK
>
> b.The inline mark-up has no inheritance, which means that basically 
> nested mark-ups are treated differently than in the alternative 
> stand-off mark-up (won’t we create two ways to understand the 
> annotation? Seems like the inline is not compatible with stand-off)
>

In other data categories with stand-off, we have either stand-off or 
local. So maybe the incompatibility is not an issue if we do the same here.

> 1.2.We drop the granularity and allow the class-ref or the ident-ref 
> know about what the disambiguation categories are (only if there isn’t 
> a term=”yes”) - OK
>
> 1.3.If there is term=”yes”, we ignore the class-ref as that is not 
> necessary, have the confidence optional and have the ident-ref 
> optional. If the ident-ref is used, it points to a term-bank. -OK
>
> 2.There are generally 3 possibilities for its-term as I see it if we 
> do not have conflicting annotations (but I understand that there may be):
>
> 2.1.It is not given at all – that means that we can apply mark-up if 
> we would like to
>
> 2.2.It says its-term=”yes” – we do not apply mark-up, because an 
> existing mark-up exists saying that a phrase is a term (we might 
> however try referencing to a term base if the reference does not exist 
> and the borders of the spans match)
>
> 2.3.It says its-term=”no” – we do not apply mark-up, because an 
> existing mark-up exists saying that a phrase is definitely not a term
>
> 3. Keeping in mind that we may have the stand-off principle, this 
> creates the following possibilities for data production (somewhat 
> different to the previous):
>
>       3.1. If there is no mark-up – we apply it if we want to
>
>       3.2.a) If there is mark-up – we ignore it and apply our mark-up 
> if we want to (this way we would create the conflicting mark-ups), or:
>
>       3.2.b) If there is mark-up – we do not apply mark-up as there 
> already is existing mark-up for a given span (this way we would 
> accumulate all stand-off mark-ups and inline mark-ups and mark only 
> the fragments that have no span under a term mark-up)
>
> For consumers –3.2.a is better if we ask consumers to trust producers, 
> 3.2.b is better if we do not have to express trust to producers.
>
> In a summary of the whole comment: 1) to simply get rid of the „red” 
> 1.1.b. issue above, we would apply only stand-off mark-up ignoring 
> in-line mark-up possibilities, 2) depending on decisions on having 
> consumers to trust producers or not we would produce data according to 
> 2a) or 2b); and 3) we would consume data trusting only our annotation 
> tool if we have to trust producers or consolidate annotations if the 
> trust would not be needed.
>
> This is how we would proceed if this scenario moves forward.
>

Good to know Thanks a lot for looking into this in detail! Now looking 
forward to further feedback.

Best,

Felix

>
> Best,
>
> Felix
>
>
>
>
>
>         However, I will have a discussion with my colleagues in order
>         to estimate how much changes would be required to our use case
>         from a development perspective.
>
>         I also understand that this proposal wants to fuse all types
>         of possible NLP-related text analyses together, but I did not
>         have the feeling that ITS 2.0 should be used as a TEI, XCES,
>         NIF, etc. clone? This is how I see where the changes will lead us.
>
>         However, I also do not say that that is a bad thing... we
>         would definitely make linguists more happier, but I as a
>         content provider and later also a consumer would have
>         difficulties working with the data as I would have to agree
>         accepting uncertainty/ambiguity in the ITS 2.0 metadata by
>         default (except external resources as those are defined
>         between consumers/producers and not ITS 2.0).
>
>         Best regards,
>
>         Mârcis ;o)
>
>         *From:*Felix Sasaki [mailto:fsasaki@w3.org]
>         *Sent:* Sunday, January 27, 2013 9:25 AM
>         *To:* public-multilingualweb-lt@w3.org
>         <mailto:public-multilingualweb-lt@w3.org>
>         *Subject:* issue-68 from an annotation representation point of
>         view, with potential implications for annotatorsRef and
>         standoff markup
>
>         Hi all,
>
>         sorry, this is going to be long ... but please have a look,
>         esp. the implementers (both consumers and producers) of
>         terminology and disambiguation.
>
>         in the last 10 1/2 months, since Tadej's presentation at the
>         Dublin workshop, we had a lot of discussions on
>         disambiguation, and sometimes (as now) including terminology.
>         But it seems that we never discussed whether ITS2 approach of
>         selection (global, local, inheritence, overriding (partial or
>         not)...) is suitable for this type of information.
>
>         By "this type" I mean annotation of linguistic information.
>         Most ITS2 and ITS1 data categories are process related (e.g.
>         "Don't translate this ..."), but both terminology and what's
>         now called disambiguation are information that you find in
>         linguistic corpora and processing tools. Now, my point is that
>         in both in such natural language processing tool chains and
>         related corpora, a representation of information *inline per
>         document node* is rather the exception. Mostly you have
>         *standoff information*, that is a complete seperation of
>         information from actual content - as in NIF.
>
>         Mârcis:
>
>         Parsing and understanding of the mark-up is the main
>         difference (how overriding and inheritance work) that requires
>         this „stand-off” mechanism for „this type” of annotation. If
>         there would be only flat level annotation, we would not have
>         this discussion at all. Also, “stand-off” is only good if you
>         really have to add a lot of complex data, but here we have to
>         add just a flag or a reference (if put in simple words). In
>         Prague me and Tadej discussed that if hierarchical information
>         is needed, that should be encoded in the external resources.
>
>         If I understand correctly, stand-off mark-up has no
>         inheritance and it has no overriding – it describes a span?
>
>
>     Correct.
>
>
>
>         If so, I assume that with your proposal we are back at
>         requiring hierarchical annotation, overlapping annotation and
>         contradictive annotation, which will allow all kinds of text
>         analysis annotations (without restricted types – term, entity,
>         ontology, lexical, etc.). This will require data consumers to
>         re-think their data consumption strategies as they will have
>         to disambiguate the “disambiguation-style” annotations (which
>         means that at the end we do not help data consumers, but make
>         the life rather more difficult).
>
>
>     As said above: if a consumer doesn't want to deal with several
>     layers of annotations, it can just say: I want to consume the
>     annotations made by Tilde or by JSI. This is guaranteed by the
>     annotatorsRef attribute.
>
>
>     Mârcis: Again, see my comment above about the consumer having to
>     know whom to trust.
>
>
>     The current state of quo creates this situation: if Tilde already
>     has annotated a text, and JSI wants to add annotations, and you
>     want to compare them: how to do this? You can say "one creates
>     terminology markup, the other disambiguation markup". But what
>     about even more tools?
>
>
>     Mârcis: I agree, currently you can have only one Terminology
>     annotation tool (the disambiguation is not a nice example), but in
>     the current version we acknowledge that there can be only one
>     Terminology annotator for a single phrase (I am fine with that). I
>     understand that the stand-off is a possible way how to solve this
>     issue, but for the previous consumer this will create
>     non-comparable annotations (or he will have to update to consuming
>     the ambiguous annotations or just trust one of the annotations).
>     For future consumers this might as well be acceptable.
>
>         In the current ITS 2.0 draft the annotation is flat - it is
>         simple to parse, simple to consume, simple to produce – it is
>         not hierarchical and it does not overlap.
>
>
>     See above - if a consumer does not want to consume relations
>     between annotation tools or levels, you don't have to, and
>     annotatorsRef gives you the ability to differentiate the annotations.
>
>     Btw., current disambiguation and terminology also don't inherit,
>     see the table at
>     http://www.w3.org/TR/2012/WD-its20-20121206/#datacategories-defaults-etc
>     that is: the annotations of both data categories don't inherit to
>     nested markup. So we could resolve the issue also via something
>     like this:
>     Input before annotation: Dublin
>     First annotation: <span its-term="yes">Dublin</span>
>     Second annotation: <span its-term="yes"><span
>     its-term="no">Dublin</span></span>
>
>
>     Mârcis: As I understand this disambiguates Terminology mark-up
>     from different producers automatically? This is possible in the
>     current version.
>
>     The non-existent inheritance, of course requires the marked span
>     not to contain other mark-up. This is, of course, is a limitation
>     of the current version and I agree, might be solved by the
>     stand-off mechanism.
>
>
>     But that has the annotatorsRef issue if several "term annotation"
>     tools have been used.
>
>     Mârcis: In the current version we as a consumer treat all
>     Terminology annotators as equally important, thus the
>     annotatorsRef for us is not necessary. However, for data tracking
>     purposes this might be important and with the stand-off mechanism
>     the annotatorsRef becomes mandatory (at least in our consumption
>     scenario).
>
>         From this perspective, the proposed change is a complete
>         overhaul of the 2 data categories in something different.
>
>         Also – we do require the flag. That is something that will be
>         heavily complicated with the “stand-off” mechanism (that has
>         to be understood), or won’t be possible at all?!
>
>
>     Setting the type would give you the flat. I know that in the
>     proposal at
>     http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0212.html
>     Tadej dropped the flat. But we could have instead of fixed values,
>     e.g. "term", an URI. You could then interpret that URI as a term
>     flag, e.g.
>     <span tan-type="http://example.com/term" <http://example.com/term>>
>
>     Mârcis: I agree that it gives term=“yes”, but not term=”no”.
>
>
>         Having a simple attribute inline is the simplest you can
>         achieve. Having a “stand-off” on the other hand is the most
>         complex you can achieve.
>
>         And ... if I remember correctly, we did not want to make life
>         difficult for producers/consumers if they did not care about
>         the other data categories?
>
>
>
>     Correct, but here we have the situation that two data categories
>     might be just too similar for keeping everthing as is.
>
>
>
>
>
>         Why is that? In linguistic annotation it is common that you
>         have several layers of information, like our lexical,
>         ontological etc. information. Some of these might be complex
>         in itself (e.g. named entities), some of these might be
>         related to others (e.g. an ontological concept related to a
>         lexical item). I won't try to define these layers here - but
>         my point is that due to the complexity of representing such
>         information inline, nearly nobody is trying to represent
>         several layers at the same time inline. The common approach is
>         rather to have a base layer, and then pointers from the
>         various annotation layers.
>
>         In a sense you can describe NIF as an approach of taking
>         character offsets as the implicit base layer (implicit because
>         characters don't need explicit anchors). The TEI here
>         http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
>         provides an example for an offset using words as the base
>         unit, with exlicit xml:id attributes.
>
>         So far we haven't taken this approach for terminology or
>         disambiguation. This is why we had to came of with 16+
>         attributes: if you want to do everything "inline", you need to
>         differenciate attribute names and come up with a monster data
>         category. Inline annotations are just not suitable for such
>         information.
>
>         Mârcis:
>
>         I disagree that 16+ attributes are the difficulty here. The
>         difficulty from the beginning were the questions: 1) how many
>         types of annotation should be supported (we narrowed the list
>         down to 4 – terminology, named entities, ontology concepts,
>         lexical concepts)? 2) should overlapping be supported? 3)
>         should hierarchical annotation be supported? 4) should
>         contradicting annotation be supported?
>
>
>     about 1): no type at all would be one solution, but the term
>     identifer issue would come up. about 2): if a consumer just takes
>     up one annotation, e.g. the output of Tilde's tool, there is no
>     need to process overlap. And we can leave that to consumers IMO.
>     3): same like 2). 4) Same like 2).
>
>     Mârcis: 1) I agree that the type itself is important. I know that
>     Tadej said that a Ref URI might have the Type embedded, but for
>     Terminology we do not always have the URI available.
>
>     Mârcis: 2-4) I agree, but only if you ask the consumers to know
>     which producer to trust. If that is not an issue, then it is fine
>     by me (it is a compromise as we lose the ability to not have to
>     trust anyone at all)
>
>         Also ... data producers would have to worry just about a
>         maximum of 5 attributes simultaneously and they would be able
>         to ignore the rest. For instance, I have no use for the
>         attributes for disambiguation categories.
>
>
>     I think that's the heart of issue-68: there are two quite similar
>     pieces of information, but consumers separate them.
>
>
>
>
>         Although I would agree writing a parser that parses all these
>         attributes (just for compliancy with the data category), I
>         would as a consumer consume only the ones related to
>         terminology and I as a producer would produce only those
>         related to terminology. I would nor consume, nor produce the
>         disambiguation related attributes.
>
>
>     That wouldn't work if we have one data category: our conformance
>     requirements say: you implement it global or local or both. You
>     then can also decide whether you implement it in HTML or XML or
>     both. But you cannot cherry pick attributes for consumption. We
>     don't say anything wrt production - but our schema helps us to
>     verify that the "right data" has been produced.
>
>
>     Mârcis: I think you are misunderstanding – parsing content for
>     consumption and production is a totally different architectural
>     level than the logics that makes any use of the content. So ... in
>     my understanding we are not failing on conformance. We are if
>     there is a requirement that we have to really consume and really
>     produce the other types of data in the application logics layers.
>     Is this the case?
>
>
>
>
>         From that perspective, I disagree to the complexity in the
>         attribute scenario.
>
>
>     I think part of the disagreement comes from the "free spirit" you
>     have as data producer and consumer, see above.
>
>
>
>
>         For terminology I require a flagging mechanism (with the
>         possibility to add either a reference, a confidence score, or
>         both).
>
>         I do agree that we are limiting the annotation with having
>         separate attributes, but then again ... ITS 2.0 does not have
>         to represent every possible text analysis annotation type. It
>         is supposed to aid in localisation processes and not all text
>         analysis types have a valid use case (or a necessary or even a
>         potentially useful use case) in localisation.
>
>         Also ... if we are re-inventing terminology and
>         disambiguation, maybe we should analyse which other data
>         categories fall under the type “text analysis”? Domain is a
>         suitable candidate as well (and if we create a suitable text
>         analysis category, maybe domain analysis can be subcategorized
>         under that as well in order to support automated domain
>         analysis solutions (EuroVoc has an automated domain
>         classifier, for instance))?).
>
>
>
>     Here I would disagree: our domain data category is just for
>     transporting domain information between content and tools,
>     including a potential mapping of domain identifiers inbetween. The
>     "terminology vs disambiguation" discussion came from the
>     observations that two data categories in ITS2 have a huge overlap.
>     I don't see that situation for domain.
>
>
>     Mârcis: I do not agree that domain identification is in structure
>     different than terminology annotation or named entity recognition
>     or even sentence breaking, but fair enough ... I brought this up
>     to show that in general domain annotation is also annotation... an
>     equal in structure task as term tagging or named entity
>     recognition (usually just in bigger spans – but not necessarily).
>
>         With this I would like to emphasize that overgeneralization is
>         not the best approach as we are creating data categories for
>         different consumption scenarios.
>
>
>     But are they so different? It sounds to me rather that in your
>     scenario, many opportunities are lost because you don't consume
>     disamgiuatino at all ... so having one umbrella data category
>     might even give you more data consumption opportunities.
>
>     Mârcis: :) ... of course, the more annotation, the more
>     possibilities (this is a philosophical truth – I cannot argue).
>     But for producing Terminology in our Use Case we do not require
>     knowledge on the purely semantic lexical, ontological or entity
>     level. Our Use Case uses knowledge from the tools that are applied
>     in the process and they do not ask for information supplied by
>     those data categories (we do require domain, language and others
>     though ... that we, of course, use). The use of Disambiguation
>     data categories would require re-thinking of the modules that do
>     not deal with ITS 2.0 explicitly – the term extraction, term
>     weighing, term retrieval methods which are out-of-scope in this
>     project.
>
>
>
>         So, the first idea behind below approach is: if you want to
>         represent just one linguistic layer (or "qualifier" in
>         Christian's mail at
>         http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html
>         ) , you use "tan-type" attribute to differentiate annotations.
>         That leads to following models inline models:
>
>         1) A term has its-tan-type with value "term" and optional
>         its-tan-confidence, its-tan-ident-ref, and its-tan-info-ref.
>         Example:
>         <span its-tan-type="term"
>         its-tan-ident-ref="http://termdatabase.example.com/entry37"
>         <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>         <http://termdatabase.example.com/entry37/description>its-tan-confidence="1.0">Dublin</span>
>         Comparison to current ITS1 "Terminology":
>         its-tan-type="term" plays the role of term="yes".
>         its-tan-info-ref plays the role of termInfoRef.
>         its-tan-ident-ref links to a term data base.
>         its-tan-confidence provide confidence information.
>         (Esp. at Marcis: I know that "Dublin" is a bad candidate for a
>         term, I'm just trying to exemplify the annotation approach here)
>
>
>
>         Mârcis:
>
>         Also one thing I tried to emphasize at lunchtime in Prague,
>         TermInfoRef is not necessarily an identity reference. It does
>         not always point to something unique (if we understand that a
>         set is not unique). You can have multiple term entries from
>         multiple user collections in a term bank relating to one
>         semantic term. In the case if you do not specify a domain you
>         could end up having a reference that points to totally
>         different (also contrasting) terms or if you do not specify a
>         target language you may end up having multiple entries because
>         most of the collections are bilingual and not multilingual.
>         Why is that so? It is because a term-bank is not a
>         disambiguator – it acts like a search engine (more or less) –
>         the disambiguation for the “external” information (the
>         meaning; the term unithood is defined by the flag term=”yes”
>         itself) has to be done by the consumers (translation engines
>         or human translators). In most cases (as in the biggest
>         term-banks – IATE, ETB) it does not have a hierarchical
>         understanding of terms as some lexical (WordNet, f.i.) or
>         ontological resources may have. For MT engines a valuable
>         information is already – term=“yes” as that defines the term
>         unithood, which means that the term should be translated as a
>         non-breakable phrase. So ... the MT engine could ignore the
>         TermInfoRef at all if it does not have a suitable
>         disambiguation module and maybe leave the disambiguation to
>         human post-editors...
>
>         So ... “ident” is misleading (at least in the case of
>         Terminology annotation)!
>
>         Also important: HOW WOULD YOU REPRESENT term=”no”? This is a
>         very important feature of the flag type annotation.
>
>         would you say: its-tan-type="not-a-term"? That would require
>         data producers to handle higher complexity annotation!
>
>
>
>     I don't have a clear answer to above questions - others, feel free
>     to chime in if you do.
>
>
>     Mârcis: This is important to understand. Will this be dropped at
>     all or will there be an alternative mechanism?
>
>         2) An entity has its-tan-type with value "entity" and optional
>         its-tan-confidence, its-tan-ident-ref, and its-tan-class-ref.
>         Example:
>         <span its-tan-type="entity"
>         its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>         <http://dbpedia.org/resource/Dublin>its-tan-class-ref="
>         http://nerd.eurecom.fr/ontology#Place"
>         its-tan-confidence="0.7">Dublin</span>
>
>         So above is only different naming compared to current
>         "Terminology" and Disambiguation. Below is now the standoff
>         approach. The processing expectation for tools *producing the
>         annotation* is like this:
>         - If there is no inline annotation, just create it (e.g. 1) or 2))
>         - If there is inline annotation, check if there is an id
>         attribute (in HTML) or xml:id (if XML serizalization of HTML
>         is used and with lower precedence compared to id). For formats
>         other than HTML, add xml:id if possible or use the id
>         attribute appropriate for that format.
>
>         Then, for creating standoff annotations, add an
>         "its:textAnalyticsAnnotations" element to the document, e.g.
>         in HTML "script" if needed.
>
>         Let's assume before annotation we have
>         <span its-tan-type="entity"
>         its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>         <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>         <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7">Dublin</span>
>         Then after annotation we would have
>         <span its-tan-type="entity"
>         its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>         <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>         <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7" *id="a8"*>Dublin</span>
>         and this:
>         <its:textAnalyticsAnnotations>
>         <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="term"
>         its-tan-ident-ref="http://termdatabase.example.com/entry37"
>         <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>         <http://termdatabase.example.com/entry37/description>its-tan-confidence="1.0"/>
>         </its:textAnalyticsAnnotations>
>
>
>         Let's now assume that before annotation we have
>         <span its-tan-type="term"
>         its-tan-ident-ref="http://termdatabase.example.com/entry37"
>         <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>         <http://termdatabase.example.com/entry37/description>
>         its-tan-confidence="1.0">Dublin</span>
>         Then after annotation we would have
>         <span its-tan-type="term"
>         its-tan-ident-ref="http://termdatabase.example.com/entry37"
>         <http://termdatabase.example.com/entry37>its-tan-info-ref="http://termdatabase.example.com/entry37/description"
>         <http://termdatabase.example.com/entry37/description>its-tan-confidence="1.0"
>         *id="a8"*>Dublin</span>
>         and this:
>         <its:textAnalyticsAnnotations annotatorsRef="tan|tool-x">
>         <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>         its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>         <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>         <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7"/>
>         </its:textAnalyticsAnnotations>
>
>         Now, if several "entity" annotation tools have been used, we
>         could also have
>         <its:textAnalyticsAnnotations>
>         <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>         its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>         <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>         <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7" annotatorsRef="tan|tool-x"/>
>         <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>         its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>         <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>         <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.4" annotatorsRef="tan|tool-y"/>
>         </its:textAnalyticsAnnotations>
>
>         Above approach would also influence the consumption of this
>         data category, and of annotatorsRef:
>
>         - A consuming tools goes through the document and gathers all
>         textAnalyticsAnnotations elements
>         - It then goes through the document. For each element node
>         * check for existing inline markup. If it's available, add it
>         to the list of annotations for that node. Assume the inline
>         version up in the document tree of annotatorsRef to be
>         responsible for the annotation of that markup.
>         * check the accumulated standoff textAnalyticsAnnotations
>         elements for occurrences of IDs that match the node. If there
>         is such an ID, add the related annotation to the list for the
>         node, including the additional annotatorsRef tool, e.g. tool-x
>         or tool-y in the above case.
>
>
>
>         Mârcis:
>
>         Do I understand you correctly that we may end up having
>         contradicting annotations also, for instance term=”yes” and
>         term=”no”? This would require a data consumer to be able to
>         handle a lot of ambiguity in the data.
>
>
>     Sure. But they could identify the ambiguity with a multilayer
>     annotation that clearly identifies the tool used, via annotatorsRef.
>     Currently, what would you do with this
>     <span its-term="yes"><span its-term="no">screwdriver</span></span>
>     how would you resolve the ambiguity here? "Terminology" has no
>     inheritance. This makes sense, otherwise in the following
>     <span its-term="yes"><span class="em">screw</span>driver</span>
>     the embedded "span" element would constitute a span. But that
>     leads to this test suite output for
>     <span its-term="yes"><span its-term="no">screwdriver</span></span>
>     /span[1] term="yes"
>     /span[1]/span[1] term="no"
>     and both "span" nodes contain the same string "screwdriver". So
>     how do you resolve the ambiguity here?
>
>
>     Mârcis: I do not see the issue in the above example. As you said,
>     Terminology does not inherit, therefore, the only thing that is
>     stated is that the “screwdriver” is not a term.
>
>     Mârcis: However, one thing I have not understood so far – is there
>     a limitation of how many annotations can be done by the same
>     producer (human or machine). Even the annotatorsRef in my
>     understanding does not always resolve contradictions. Or ... is
>     there a precedence rule if there are equal, but contradicting
>     stand-off annotations, for instance (I made this up to simplify
>     the under):
>
>     <its:textAnalyticsAnnotations>
>     <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"
>     <http://nerd.eurecom.fr/ontology#Place>its-tan-confidence="0.7"
>     annotatorsRef="tan|annotator-1"/>
>     <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Person"
>     <%22http:/nerd.eurecom.fr/ontology#Person%22>its-tan-confidence="0.4"
>     annotatorsRef="tan|annotator-1"/>
>     <its:textAnalyticsAnnotation *ref="a8"* its-tan-type="entity"
>     its-tan-ident-ref="http://dbpedia.org/resource/Dublin"
>     <http://dbpedia.org/resource/Dublin>its-tan-class-ref="http://nerd.eurecom.fr/ontology#Organisation"
>     <%22http:/nerd.eurecom.fr/ontology#Organisation%22>its-tan-confidence="0.4"
>     annotatorsRef="tan|annotator-1"/>
>     </its:textAnalyticsAnnotations>
>     Mârcis: Here “Dublin” can be all three (Place, Person,
>     Organisation) simultaneously, right?
>
>         In summary, this standoff tries to solve several issues:
>
>         - avoid the 16+ inline attribute monster data category
>
>         Mârcis:
>
>         Again, I did not understand why this is worse than having a
>         heavy “stand-off” mechanism.
>
>
>         - allow for multiple annotations of the same span, with
>         different tools
>         Mârcis:
>
>         In Prague Tadej and I had a discussion whether there is a use
>         case for using two tools producing contradicting mark-up and
>         we came to the conclusion that neither of us would produce
>         such data and if such a scenario exists, then the content
>         producer should fuse (disambiguate) the outputs of the two
>         separate tools prior to ITS 2.0 metadata application. I am
>         talking about the same type (for instance, two term annotation
>         tools on the same span) of annotation, not two separate types.
>
>         Then my question: does such a scenario exist? Who is
>         implementing it?
>
>
>     If both you and Tadej would agree on one data category: everybody
>     who wants to use both your tools would implement it. And this has
>     the value that people could compare the outcome of the tools.
>
>     Mârcis: So you would ask the consumers to disambiguate or choose
>     (in this way they would not use both if both would produce
>     Terminology), right? If yes, it is totally fine, I just want to
>     make sure I understand your idea.
>
>
>
>         - avoid the ITS1/2 or general inline annotation issues with
>         inheritance and overriding - as with the standoff approach at
>         exemplified at
>         http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
>         annotation information is just accumulated for a given base
>         item (in our case, element nodes with an ID).
>         Mârcis:
>
>         So ... at the end, with this method we would allow:
>
>         1) Hierarchical annotation
>
>         2) Contradicting annotation
>
>         3) (possibly also) overlapping annotation
>
>
>
>     Correct.
>
>
>
>
>         I'm not yet asking for this change, but I see it as a way
>         forward that could make the life of both annotation producers
>         (Marcis and Tadej) and consumers (Yves et al.) simpler. So I'm
>         eager to hear thoughts on this :)
>         Mârcis:
>
>         As I understand the proposal – it is the complete opposite
>         from being simple (or simplifying things as they are right now
>         having Terminology and Disambiguation separately), it
>         complicates things significantly from the Terminology
>         standpoint as now I do not see where term=”yes” fits in, we
>         have to deal with contradicting annotation (allow or prohibit
>         it is now a question to the consumers – I as a consumer would
>         ask to prohibit it as I do not see a use case for term=”yes”
>         and term=”no” at the same time), and what is more, we have to
>         re-implement the parsers so that instead of overriding and
>         inheritance they would work with accumulating information (and
>         this is a complete revision of the parser logics for the
>         Terminology data category).
>
>
>     I understand the burden on implementation you emphasize - but it
>     seems that one scenario - annotation using different tools even
>     for the terminology data category, see the nested "terminology"
>     annotations above - is not resolved by your proposal. You say this
>     would not be implemented before ITS2 annotation - but if the tool
>     providers are not from the same organization?
>
>     Mârcis: Our proposal did not allow nested annotations. Nor does
>     the current ITS 2.0 version. Also – this was my question – is
>     there a necessity to produce 2 Terminology annotations or 2 named
>     entity annotations on top of each other. I see that you are saying
>     – Yes, there is.
>
>
>
>     Best,
>
>     Felix
>
>
>
>
>
>
>         Thoughts?
>
>         - Felix
>

Received on Tuesday, 29 January 2013 18:54:26 UTC