W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > January 2013

RE: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

From: Mārcis Pinnis <marcis.pinnis@Tilde.lv>
Date: Tue, 29 Jan 2013 08:52:58 +0200
To: Felix Sasaki <fsasaki@w3.org>, Tadej Štajner <tadej.stajner@ijs.si>
CC: Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>, Artūrs Vasiļevskis <arturs.vasilevskis@Tilde.lv>
Message-ID: <AC6FD4BB9BB02540AC7322091A6C3B5472B0F00FF4@postal.Tilde.lv>
Hi Felix,

If I understood correctly, the new proposal is to slightly change the Disambiguation data category (by dropping granularity) and leave Terminology as is? If yes, then I’m OK with that if everyone else is.

Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Monday, January 28, 2013 9:57 PM
To: Tadej Štajner
Cc: Mārcis Pinnis; Yves Savourel; public-multilingualweb-lt@w3.org; Artūrs Vasiļevskis
Subject: Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi Tadej, all,

sorry for not giving detailed replies to other mails. Trying to bring together *some* loose ends here.

Am 28.01.13 19:08, schrieb Tadej Štajner:
Hi, all, (long e-mail ahead, you can scroll to TL;DR)
true - the current state is a local optimum that satisfies the requirements. It would need some polish, better guidance and stricter definitions, and possibly renaming disambigGranularity back to disambigType.

As an improvement, Felix's proposal makes some sense, since it makes ITS2.0 capable of proper multi-layer annotation. If this two mechanisms for inline+standoff annotation is too complex to implement, it would be an acceptable compromise to just have only the stand-off and no inline (except for term="yes", maybe), but I'd vote in favor of keeping the inline part.

Also, the ref/id pointing could also be expressed the other way around, pointing from fragment to the annotation. Instead of:
<span id="dublin1">Dublin</span>
...
<its:textAnalysisAnnotation its:tanType="entity" its:tanIdentRef="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> ref="dublin1" />

I would suggest same mechanism as in LQI, so we have some symmetry:

<span its:tanRefs="tan1">Dublin</span>
<its:textAnalysisAnnotations id="tan1">
    <its:textAnalysisAnnotation its:tanType="entity" its:tanIdentRef="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>/>
</its:textAnalysisAnnotations>

In the above you use the name its:tanRefs. Does that imply that you assume referencs to several annotations?
At Yves, as a reply to
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0206.html

"I don't see a difference between what the standoff markup of LQI/Provenance does and this standoff for Term+Disambiguation does."
I think the difference is how you store in my example the external annotations: in separate units, pointing to the same ID. In Tadejs example you then also have the potential to point to several units. I think that is different from the current LQI/Provenance approach: here the idea is to just add one link relation. I'm not sure yet whether that difference is significant - I have to think about it.
But while doing that a question on the LQI/Provenance implementers: is it a feature that you point to just one external standoff unit, or an oversight, and it could it be several ones?

Wrt to the below, the lowest effort would probably be "drop granularity", that is 2) below. To accomodate one part of Christian's comment at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html

we could rename disambigatution to its-tan-*, and re-write the disambiguation section.

If we then forsee that several annotations might happen, we could accomodate for the LQI/Provenance standoff approach.

Since there have been many others mails on this, and I can't reply to these here: Mārcis, Yves, would that resolve your concerns and questions? Christian, I assume that Tadej's characterization "less-specific 'pointer to some meaning identifier' brother to Terminology." of disambiguation (or "tan") would not satisfy your concern - what would you propose?

Best,

Felix




Secondly, I'll give another alternative (and orthogonal) proposal, repeating what Pablo Mendes already hinted at in August: remember the question of supporting the distinction between different disambiguation types - entity, lexical concept, ontology, concept, which is now encoded in the 'disambigGranularity' attribute (relevant discussion http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Aug/0322.html).

When trying to merge Terminology and Disambiguation, having that many disambiguation types supported in the same way implies that we end up with 16 or so attributes. After some discussion in Prague, we realized that although we've established that a distinction between those types exists and it is important, we couldn't come up with a use case where having that information would make a difference in the actual workflows.

Let me clarify:  if a consumer component cares about disambiguation, it will try to resolve the disambigIdentRef identifier. By resolving it, it is able to know what type/level/granularity of disambiguation it's dealing with. By that reasoning, having this information explicit is redundant, because the system already did its job. The question is, is there a use case that justifies keeping the 'disambigGranularity'? For instance, operating on the disambiguation values without actually resolving them? Maybe filtering?

So, we'd go from:
<span
          its-disambig-confidence="0.7"
          its-disambig-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place>
          its-disambig-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>
          its-disambig-granularity="entity">Dublin</span>
      is the <span
          its-disambig-source="Wordnet3.0"
          its-disambig-ident="301467919"
          its-disambig-granularity="lexical-concept"
          its-disambig-confidence="0.5"
          >capital</span> of Ireland.

to:
<span
          its-disambig-confidence="0.7"
          its-disambig-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place>
          its-disambig-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>>Dublin</span>
      is the <span
          its-disambig-source="Wordnet3.0"
          its-disambig-ident="301467919"
          its-disambig-confidence="0.5"
          >capital</span> of Ireland.

In this setting, ITS would just operate with references to identifiers and wouldn't care about the type of that relationship. I understand this is losing information, and it weakens the expressive power, but I'm asking this because it might simplify a couple of solutions here. Even though I proposed it initially, I wouldn't push something that hasn't got any consumers behind it (the T in ITS doesn't stand for Tadej.. :) ). It would also establish a clearer boundary between what ITS covers and what other formats should cover.

TL;DR
In short, I see the some scenarios that I'd be ok with:
1) If we keep 'granularity':
    1a) We keep granularity in the form of its:tanType and go with Felix's proposal in the form of its:tanType, and possibly inverting the addressing so it's like LQI;
    1b) We keep granularity, we keep current proposed Disambiguation data model, possibly renaming 'disambigGranularity' back to 'disambigType';
2) If we drop 'granularity', we probably wouldn't need the new its:tan* model, and it would make sense to keep the rest of the disambiguation data category as-is, and describing the three usage scenarios only as best practices. Disambiguation would then serve as a less-specific 'pointer to some meaning identifier' brother to Terminology.

-- Tadej

On 28. 01. 2013 16:42, Mārcis Pinnis wrote:

Hi Felix, all,



I also do not have anything against leaving everything as is.

I however (as I made clear in my previous e-mail) don't think that the stand-off markup is a nice solution.



Best regards,

Mārcis ;o)



-----Original Message-----

From: Yves Savourel [mailto:ysavourel@enlaso.com]

Sent: Monday, January 28, 2013 5:31 PM

To: 'Felix Sasaki'; Mārcis Pinnis

Cc: public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Artūrs Vasiļevskis

Subject: RE: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup



Hi Felix, all,



Just a judgment from my side: I think at the moment we don't have

consensus for



- leaving everything as is (Dave's proposal)

I don't have anything against leaving things as is.

There is nothing really broken.



It's just that having both data categories fused would be a bit nicer. But overall if there is no time to make that work, we can indeed just leave it as it is.



cheers,

-yves






Received on Tuesday, 29 January 2013 06:53:30 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:08:26 UTC