Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)

From: Lieske, Christian <christian.lieske@sap.com>
Date: Thu, 10 Jan 2013 10:14:35 +0100
To: "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>
Message-ID: <8EA44C66E2911C4AB21558F4720695DC60D7CA859D@DEWDFECCR01.wdf.sap.corp>

Please find below comments/observations/questions/ideas concerning the ITS 2.0 working draft dated December 6, 2012 (http://www.w3.org/TR/2012/WD-its20-20121206/).  Please feel free to contact me for clarifications if anything is unclear.

The section related to the "disambiguation" data category to me is one of the most important ones of the draft. ITS 2.0 from my point-of-view moves ITS 1.0 closer to Natural Language Processing (NLP), and "disambiguation" to me is related to NLP in various ways. Thus, making "disambiguation" powerful and easy to use (e.g. via a clear distinction to other data categories, as well as conceptualizations and wording that are not just known within linguistics) seems important to me.

While looking at "disambiguation" from this angle, I started to wonder if it could benefit from additions/modifications. I apologize in advance if a reply to this comment may require that discussions which presumably already took place may have to be summarized.

Here are my observations/questions/ideas:

a.       I sense that ITS users will have difficulties to decide when to use "term" and when to use "disambiguation" (the note in the Working Draft indicates this).

b.      Annotation of known terms, generation of so-called "term candidates", (named) entity recognition, and other automation can be subsumed under the heading "(automated) text analysis".

I am thus wondering if the following would be worth considering:

1.       Enhance the current "disambiguation" so that also the current "term" can be covered

2.       Deprecate "term"

3.       Revising some of the terminology used in the spec (e.g. "disambiguation", "disambigGranularity")

An example use of a revised "disambiguation" (and deprecated "term") - partially inspired by ISOCat (see http://www.isocat.org/ ) - is the following:

Data category name: (automated) text analysis annotation (atan/tan); using "text analysis annotation" would have the advantage that even manual work (e.g. "promoting a term candidate to a term") could be covered

Data category "qualifier" (currently "disambigGranularity"): atan-type or tan-type

Values for "qualifier": lexical, term, termCandidate, ontological-class, ontological-entity; possibly even URIs such as http://www.isocat.org/datcat/DC-2275 - would allow rather fine-grained and under certain provisions standard-conformant (ISO 12620; see http://www.ttt.org/clsframe/datcats.html) annotation






          its-tan-type=" http://www.isocat.org/datcat/DC-2275">Dublin</span>

