Re: LIME proposal for the OntoLex W3C Community Group

Dear Armando, all,

sorry for getting back to this very late. The bottom line is: I agree 
with your proposal:

1) On the distinction of Lexicon and Lexicalization as subclasses of 
void:Dataset, I agree. The main distinction being that i) A lexicon 
introduces a container with local lexical entries, while ii) a 
Lexicalization essentially reuses URIs of lexical entries defined 
elsewhere to link them to an ontology. Fair enough, I like it.

2) On the concrete properties, I agree that the absolute count "lexical 
entries" is not going to to the job alone, and I am still not convinced 
that we should have ratios in the lime module, for two reasons: i) 
ratios get quickly outdated, and ii) a developer interested in knowing 
how many lexicalizations there are for a given set of concepts of 
interest would need to write a SPARQL query anyway. But I am happy to be 
convinced. Let me make a proposal which I think addresses the needs.

Two comments on this:

1) for a SKOS document, I see that counting the number of 
lexicalizations per concept is useful, and having the ratio might be 
useful as well, and this is feasible because the data is under control 
and the ratios can be updated as the lexicalizations and the 
(referenced) concepts are in the same file. Fair enough. But computing 
the ratios using Java code (over JENA) or SPARQL is really 
straightforward, so having the ratios explicit might not be that much of 
a benefit.

2) for a distributed scenario (e.g. our Lexicalization scenario) where 
the lexicon and the ontolog(ies) exist separately, the above is going to 
be difficult. Why? Because a Lexicalization might include references to 
different ontologies. Then is the question: to what does the ratio 
refer? To the union of all ontologies? Do we include a ratio for each 
ontology that the Lexicalization datasets refers to? I think in this 
scenario it is not that easy. Assume further that I am only interested 
in the avg. number of lexicalizations for classes, or only for 
properties. We would not cover this at metadata level and one would have 
to write a SPARQL query anyhow.

Here is a list of metadata properties that I propose to include in lime 
as a basis for discussion

lime:referencedOntologies -> indicating the (distinct) URIs of the 
ontologies reference in the dataset.

lime:numberOfLexicalEntries -> giving the number of lexical entries in a 
Lexicon/Lexicalization (void:Dataset) or by a particular Lexicon (in a 
given language)

lime:numberOfReferences -> distinct number of references used in a 
Lexicon/Lexicalization (void:Dataset) or by a particular lexicon

lime:numberofLexicalizations -> number of distinct lexical entry -> 
sense -> reference paths, in essence the number of senses in the dataset 
which have a reference as every sense is supposed to be unique for a 
pair of lexical entry and reference; this corresponds to your number of 
lexicalization

dc:language -> indicating the number of languages covered by a Lexicon / 
Lexicalization (as void:Datasets, for a particular ontolex:Lexicon there 
is only one language covered)

and then to make language-specific coverage explicit:

lime:languageCoverage

which introduces a lime:LanguageCoverage object to which all the above 
properties can be attached, specifying the above numbers per language.

Given this, your ratios are easy to calculate:

1) Average number of lexicalizations per concept: either 
lime:numberofLexicalizations / lime:numberOfReferences if you want to 
have the average number of lexicalizations per distinct ontological 
element referenced in the Lexicon / Lexicalization or
numberofLexicalization; or lime:numberOfLexicalizations / # concepts in 
the referenced ontologies obtained in some way (e.g. using SPARQL query)

2) Average number of references per lexical entry (a measure of 
ambiguity so to speak): simply lime:numberOfLexicalizations / 
numberOfLexicalEntries

In addition, we would of course recommend in ontolex to use standard 
dublin core properties such as: creator, created, language, date, 
license, publisher, title etc. etc.

Comments etc. are welcome.

I will try to come up with an example lexicalization in a separate email.

Philipp.


Am 26.03.14 01:02, schrieb Armando Stellato:
>
> Dear all,
>
> a few more comments about the metadata properties, concerning the 
> “primary properties vs derived properties” issue. Quite long, but 
> better to resume the whole situation across the various emails 
> exchange and conference calls, and then go ahead.
>
> One point raised was: if we accept that we may provide the number of 
> lexical entries in a Lexicon (as I replied affirmatively from John 
> proposal, telling it was also in the original Lime, in the section 
> about linguistic resources), and we already have from VoID the number 
> of entries (which is actually not always guaranteed), why should we 
> provide the coverage as it is a derived information?
>
> A short resume of the situation, and some news.
>
> RESUME
>
> In the last conference call, we analyzed various scenarios and 
> motivations. According to my experience and the scenarios I had in 
> mind, I observed that in any case the ratio 
> lexicalentries/num-of-resources-in-the-ontology was a primary 
> information for agents, and a more objective information than just 
> having the numerator.
>
> …though, if we are comparing different lexicalizations for the same 
> ontology, the denominator is always the same...
>
> …though, an agent may be interested in knowing how good is the 
> linguistic coverage to compare it with *other* kind of info; in this 
> case, the denominator helps in normalizing the information about the 
> amount of lexical information available wrt the given ontology.
>
> In any case, I agreed that, IFF the two data values are redundant, 
> then we may decide to drop one of them.
>
> One other aspect for me in favor of keeping the ratio over the 
> numerator was that, again, Lime was not thought for Ontolex alone (and 
> we would really like to get a unified metadata vocabulary covering 
> ALSO ontolex, but only that), and in the hypothesis of a simple 
> SKOS-enriched vocabulary, I would see more interest in keeping the 
> ratio than in knowing how many skos:xxxLabels are there…
>
> …but, again, in the rush of the phone call (and with the redundancy 
> issue being raised just the day before), I didn’t have time to think 
> back about it in detail.
>
> MORE RESUME
>
> Yesterday, me and Manuel took some time to think back about the 
> original Lime, and about the updated structure that we sent after the 
> remark of Philipp that we should cover all scenarios including 
> separated Lexicons and Ontologies.
>
> We recall here what has been said about the scenarios. Given an 
> ontology O and a Lexicon L, there are three scenarios:
>
> 1)O and L are part of the same dataset. Simply, the publisher of the 
> ontology decided to use Ontolex to model linguistic data there
>
> 2)We have dataset O, and a dataset L pointing to it. Most common case 
> probably. So, someone wrote a lexicon L by purpose for ontology O.
>
> 3)L is a dataset on its own (e.g. WordNet), developed independently of 
> O. Someone then lexicalized O with elements from L. Almost 100% 
> certain, he will not use all of the lexical entries in L to lexicalize 
> the resources in O (this is very important!).
>
> I think, yet from now, it would be worth to introduce the concept of 
> Lexicalization. This has not to be owl classified anyhow in the data 
> yet, for clarity, it is easier to recognize its existence, and, more 
> in general, consider the case 1) as Ontology, Lexicon and 
> Lexicalization in same dataset 2) Ontology separate from 
> Lexicon&Lexicalization 3) all of them separate
>
> Now, three LIME properties which seemed to fit Ontolex were the following:
>
> lime:languageCoverage for each language, the percentage of RDF 
> resources, per type (classes, individuals, properties, SKOS concepts) 
> described by at least a lexicalization in that language.
>
> lime:lexicalResourceCoverage for each specified lexical resource, the 
> percentage of RDF resources, per type (classes, individuals, 
> properties, SKOS concepts) described by at least an 
> ontolex:LexicalConcept in that lexical resource.
>
> lime:avgNumOfEntries per concept
>
> NEWS
>
> Ok, simply the put, there is no redundancy.
>
> We are interested, for the avgNumOfEntries (and even more evidently in 
> the xxxCoverage!) in the number of attachments (that is, of 
> lexicalizations), and not in the number of lexical entries.
>
> There is a series of very good reasons for that:
>
> 1)Consider Scenario 3: the number of lexical entries in the lexicon is 
> useless for our counts if not all of them are involved in the 
> lexicalization (which will not happen to be
>
> a.As a consequence, the number of lexical entries for the lexicon may 
> still be considered as a useful metadata per se (so, we do not have to 
> make a choice and we can keep both), but again, it has to be local to 
> the lexicon, and is not relevant for the onto-lexical metadata
>
> 2)Even with a 100% participation of lexical entries to a 
> lexicalization, a lexical entry could participate in lexicalizing two 
> concepts (polysemy), and we would really prefer to tell that two 
> concepts benefited from that lexical content
>
> 3)In the specific case of the xxxcoverage properties, the real target 
> is the amount of concepts being lexicalized, so it is in no way 
> related to the amount of lexical entries. If we had 100 skos:Concepts 
> and 1000 lexical entries, and only one concept covered by those 1000 
> lexical entries which happen to be synonyms, then the coverage for 
> class skos:Concept is sadly 1%.
>
> See in this sense, the distinction we already made in the LIME paper 
> [1] (which actually dates back to the precursor of Lime, the 
> Linguistic Watermark [2, 3] ) about “lexical metadata” and 
> “onto-lexical metadata”
>
> Now, coming back to the Lexicalization, we really feel it is a 
> determinant element to be put into consideration. We are not 
> suggesting to address it in the core OntoLex vocabulary. After all, 
> with the exception of owl:Ontology, for the most datasets are not 
> categorized in their own data. And the concept of Dataset is 
> introduced in VoID, which is targeting metadata.
>
> For this reason, I would suggest to include the notion of Lexicon and 
> Lexicalization in the metadata, as subclasses of void:Dataset. The 
> property void:subset should then help to address all the three 
> scenarios we foresaw.
>
> Ok, I stop the mail here (already quite long :D ), and wait for your 
> feedback, before sending a concrete proposal to the list.
>
> Cheers,
>
> Armando and Manuel
>
> [1] http://aclweb.org/anthology//W/W13/W13-5504.pdf 
> <http://aclweb.org/anthology/W/W13/W13-5504.pdf>
>
> [2] 
> http://art.uniroma2.it/publications/docs/2008_OntoLex08_Enriching%20Ontologies%20with%20Linguistic%20Content%20an%20Evaluation%20Framework.pdf
>
> [3] http://iospress.metapress.com/content/x043167268663268/
>
> *From:*Philipp Cimiano [mailto:cimiano@cit-ec.uni-bielefeld.de]
> *Sent:* Thursday, March 13, 2014 11:18 AM
> *To:* Armando Stellato; 'John P. McCrae'
> *Cc:* 'Manuel Fiorelli'; public-ontolex@w3.org
> *Subject:* Re: LIME proposal for the OntoLex W3C Community Group
>
> Dear all,
>
>  ok, so we clarified that per se it is fine to include materialized 
> results of pre-defined SPARQL queries as new vocabulary elements.
>
> So we are a step further guys ;-)
>
> Whether or not we want to include properties related to linguistic 
> resource coverage is then the real point of discussion I think. So 
> let's focus on this point.
>
> Other than that: maybe it is not so important whether the values can 
> be computed using SPARQL or we need some procedural component to 
> compute them (as in the lime Java API mentioned by Armando).
>
> My point was rather: let's define what we mean exactly with these 
> properties by giving them an exact semantics. It is fine if this 
> semantics is made explicit. But the point is: if it is not the case 
> that all creators of lexica use the properties in the same way, then 
> they become sort of useless, see our recent discussion of the 
> "confidence" property to indicate confidence in a translation: it is 
> quite useless if people adopt a completely different interpretation of 
> this value.
>
> So rather than really having SPARQL Construct Statements for most 
> metadata properties, let's give precise semantics so that anyone could 
> compute the values of the properties consistently with this semantics.
>
> Does this make sense?
>
> Talk to you all tomorrow.
>
> Philipp.
>
> Am 08.03.14 20:44, schrieb Armando Stellato:
>
>     Dear John,
>
>     well I’m a bit puzzled, in that this is surely worth discussing,
>     but it’s a completely orthogonal topic again. The fact that
>     Philipp mentioned the possibility to define their semantics
>     through SPARQL does not change anything about the nature of these
>     properties so, if you found them useless because of their
>     redundancy with the data, they were useless/redundant even before.
>
>     Maybe we should synthetize a few aspects and discuss them in a
>     page of the wiki. What do you think? The impression is that in the
>     emails we are opening new topics instead of closing the open ones,
>     so it may be worth to have separate threads. Please let us know,
>     if you feel we are almost close to the end, we may even go along
>     with emails (maybe with specific threads).
>
>     Btw, to reply to your specific question:
>
>     The point of metadata is not to optimize commonly run SPARQL
>     queries, for two primary reasons, firstly it bulks up the model
>     and instances of the model with triples for these 'pre-compiled'
>     queries and secondly it is very hard to predict what queries an
>     end-user will want to run. It seems that the kind of metadata we
>     are proposing to model is nearly entirely pre-compiled queries,
>     and are of questionable practical application. That is, I ask a
>     simple question: /if we can achieve resource interoperability for
>     OntoLex already with SPARQL why the heck do we need metadata anyway??/
>
>     Personally, as an engineer, I’m biased towards considering
>     “redundancy the evil”, and keep information to its minimum (so I
>     would tend to agree with your point). But, engineering 101 manual
>     tells that you may sometimes give up the orthodoxy on the above
>     principle, if this greatly improves performance, scalability etc…
>
>     Furthermore, instead of trivially giving up, you should designate
>     how, when and where the redundancy points are defined (whatever
>     system you are speaking about).
>
>     Now, narrowing down to our case, we have a clear point, the void
>     file, that is a surrogate of a dataset, contains its metadata, and
>     is always updated following updates to its content: no danger of
>     dangling out-of-date redundant information then.
>
>     We have also a clear scenario: packs of spiders roaming around the
>     web and getting plenty of useful information from tons of
>     different datasets without stressing their SPARQL endpoints;
>     mediators examining metadata from multiple resources and taking
>     decisions very quickly etc…
>
>     But, I’m a just poor guy :) so, out of my personal view, let me
>     mention some notable predecessors:
>
>     Already mentioned by Manuel in his email of today, we have VOAF:
>     http://lov.okfn.org/vocab/voaf/v2.3/index.html
>
>     ..but VOAF is not a standard…
>
>     …talking about standards, ladies and gentlemen, here is VoID
>     itself and its many SPARQL deducible properties!
>
>     https://code.google.com/p/void-impl/wiki/SPARQLQueriesForStatistics
>
>     ..and to happily close my defense, well, in any case Manuel just
>     confirmed in his email that I should have thought one second more
>     about the SPARQL deducibility of LIME’s properties :-)
>
>     Some of them are in fact SPARQL deducible, but it seems the one we
>     took as an example (lime:languageCoverage
>     <http://art.uniroma2.it/ontologies/lime#languageCoverage>) is
>     exactly one of those not so trivial to write (maybe I’m not an
>     expert with CONSTRUCTS, but I would say not possible at all).
>
>     In the LIME module, we used RDF API and plain Java post processing
>     to compute them, so I was not recalling which ones were simple
>     SPARQL constructs and which ones needed more processing.
>
>     Cheers,
>
>     Armando
>
>
>
>
> -- 
>   
> Prof. Dr. Philipp Cimiano
>   
> Phone: +49 521 106 12249
> Fax: +49 521 106 12412
> Mail:cimiano@cit-ec.uni-bielefeld.de  <mailto:cimiano@cit-ec.uni-bielefeld.de>
>   
> Forschungsbau Intelligente Systeme (FBIIS)
> Raum 2.307
> Universität Bielefeld
> Inspiration 1
> 33619 Bielefeld


-- 

Prof. Dr. Philipp Cimiano

Phone: +49 521 106 12249
Fax: +49 521 106 12412
Mail: cimiano@cit-ec.uni-bielefeld.de

Forschungsbau Intelligente Systeme (FBIIS)
Raum 2.307
Universität Bielefeld
Inspiration 1
33619 Bielefeld

Received on Thursday, 3 April 2014 05:30:35 UTC