R: some additional comments on the current LIME specification from Armando Stellato on 2015-09-04 (public-ontolex@w3.org from September 2015)

From: Armando Stellato <stellato@info.uniroma2.it>
Date: Fri, 4 Sep 2015 12:47:52 +0200
To: "'John McCrae'" <john@mccr.ae>
CC: "'public-ontolex'" <public-ontolex@w3.org>
Message-ID: <DUB408-EAS356A5A04327616AEB3C7CDEA0570@phx.gbl>

Hi John,

Replies here below:

if we want to represent the avgAMbiguity, we need to compute the ambiguity of each single entry. Now, let’s think about the ambiguity of “banks”, how should this be computed wrt the available entries?

Intended current intention of our concept of ambiguity: ambiguity(“bank”) = 3

I think the concept of ambiguity should follow that of the definition of entry.

Per esempio... let's take the Italian word 'asse', which has two meanings the masculine 'i assi' are wooden boards and the feminine 'le assi' are the axes of a graph, would you consider these ambiguous? Similarly are 'essere' and 'sei' ambiguous, e.g., 'tu sei qui alle sei'?

I argue that as long as there is a reasonable linguistic distinction (gender, inflected forms, part-of-speech,...) that the entries are distinct and not ambiguous. What is more unusual is that we have decided to count etymology as a criteria for distinction, thus the two forms of 'bank' are distinguishable in English and ergo not ambiguous.

[Armando Stellato]

Three things:

1) A trivial correction and out of scope, but you may be interested in it given your great capability to make examples in any language :-D The example about assi is reversed. The graph context asks for the masculine and the wood axes asks for the feminine.

2) Less trivial but still out of the scope of our comment: it’s ambiguous defining…what is ambiguous. If I put “bank” in the search panel of a web portal for a lexical resource, without any context…I would like to see 3 elements (I agree, with a bag-of() reporting a set of 2 entries and a singleton). We were only mentioning that there is no glueing at all in the in model, and this would be completely demanded to an indexing system serving that search, so moving some “knowledge” from the model to occasional processes working on it. But, as said, we just mentioned the fact (in case you agreed something was missing) and do not strongly suggest any change.

3) Really in the scope of our comment (which was not replied below): ok, so, to resume: in order to compute the avgAmbiguity over all entries, what would be the ambiguity of bank? Would it be 3, or would there be two ambiguities, 2 (for bank-1) and 1 (for bank-2)?
In the latter case, isn’t ambiguity of the two separated entries just the level of polysemy, the name of which we changed to avgAmbiguity to take into account also the homonymy? (which is canceled by the separation of these two entries?).

2) interpretation for avgNumOfLexicalizations: we just return on something which has never been discussed (but only presented by us) as there were more urgent matters. Now that the model is stable, we limit to present again the possibility to change it as it wouldn’t scramble all the model. Actually, we are not even pushing for one interpretation or the other, but thought was worth mentioning.

according to the formula in: https://www.w3.org/community/ontolex/wiki/images/9/90/Formula_avgNumOfLexicalizations-v1.png, the denominator comprises all the elements in the ontology. Since we already have statistics about overall covered elements (lime:percentage) we could consider to apply a different version of avgNumOfLexicalizations, which is considering only the elements effectively participating to at least a lexicalization. In this case, this value would be more independent (and thus more descriptive) from the other.

E:g. I have a ontology O lexicalized by lexicon L, for only 10% of its concepts. However, for those 10% concepts, the average number of lexicalizations is 4. This means that the lexicon badly covers the ontology, but to its extent, it really describes well the covered references. If we considered the avg over all references (including the non lexicalized ones), I would get a 0,4, which is not providing much. Getting 10% of percentage and 4 of avgNumOfLexicalization much better represents the lexicalization.

Surely avgNumOfLexicalizations is more descriptive if it is allowed to include entries that have zero lexicalizations. I don't see the advantage to this change, either value can be obtained by multiplying or dividing the other by 'percentage'.

Well, this is a matter of “quality”, and quality is always difficult to measure.

You say “surely”, but I don’t see the rationale. Ok, take this as a joke (because obviously it can be reverted), but in my point of view, the normalized factor is a <quality_value>, independent of the other one (i.e. percentage).

It tells me how many entries it provides for each reference (but limited to the portion of the ontology which is being covered).

So, your one is (<quality_value> * percentage), which obviously can reconduct to the quality value by performing: (<quality_value> * percentage) / percentage.

In any case, we wouldn’t push for it, but would hear also others’ opinions.

3) Name of percentage: this has been raised by John in his resuming email as well. Actually, we called it initially coverage, but it had a different structure. With the reification of the property using partitions it was later changed to percentage. Since we changed again the structure, maybe that coverage makes much more sense. Percentage was good in the context of a reified object expressing the coverage, and the percentage was limited to the mere number. Now it could make sense to get back to the original name.

'coverage' or even 'ontologyCoverage' would be preferable to 'percentage'

Ok, can we apply it or do we wait for a more general clearance?

4) The formula in: https://www.w3.org/community/ontolex/wiki/File:Percentage_formula.gif is a bit confusing. The predicate lexicalizes(entity,entry) is never used formally in the specification. In any case, probably lexicalizes(entry,entity) would make more sense as usually, when giving verbs as names of predicates, the action should go from the first argument to the second one. Also, reference should be referenceDataset or ontology, as used in other cases. If ok for you, we can change it.

I would be in favour of introducing a clear, consistent formula for every ratio in this section. From Manuel's answer to my final list of points it seems that I could not figure out how to calculate all the values.

Ok, we will come up with a proposal

5) Definition of avgNumOfLinks: this property indicates the average number of links to a concept for each ontology element in the reference dataset.

Erm... what is the issue?

6) we don't link "to a concept", as it seems that in play we have a single concept linked many times by the same reference. Could we restate as: “this property indicates the average number of links to lexical concepts for each ontology element in the reference dataset” ?

I don't see why not

I’m truly sorry, we made two errors in the same sentence. 5 and 6 were a single sentence, and “link” was “like”. That was the correct one:

5) Definition of avgNumOfLinks: “this property indicates the average number of links to a concept for each ontology element in the reference dataset.”
We don't like "to a concept", as it seems that in play we have a single concept linked many times by the same reference. Could we restate as: “this property indicates the average number of links to lexical concepts for each ontology element in the reference dataset” ?

So, I’m not sure if your “I don’t see why not” was a clearance to change following our proposal, or was referred to the our wrong “we don’t link”.

Cheers,

Armando

Received on Friday, 4 September 2015 10:48:40 UTC