Re: LIME proposal for the OntoLex W3C Community Group from Philipp Cimiano on 2014-03-07 (public-ontolex@w3.org from March 2014)

From: Philipp Cimiano <cimiano@cit-ec.uni-bielefeld.de>
Date: Fri, 07 Mar 2014 22:43:05 +0100
To: Armando Stellato <stellato@info.uniroma2.it>, 'John McCrae' <jmccrae@cit-ec.uni-bielefeld.de>
CC: 'Manuel Fiorelli' <fiorelli@info.uniroma2.it>, "public-ontolex@w3.org" <public-ontolex@w3.org>
Message-ID: <531A3D69.1040907@cit-ec.uni-bielefeld.de>
Dear Armando, all,

  second email on the metadata, referring in particular to the 
aggregating properties.

Many of the properties that we are proposing in the metadata module are 
aggregating properties: number of lexical entries, average number of 
lexical entries etc.

We sort of agreed that these are computed locally for the dataset in 
question without consulting external lexica etc. right?

The values for most of these values could be calculated using SPARQ 
construct statements it seems in the sense that some information is 
aggregated and added as explicit value of some lime property. Fair 
enough, it saves people the effort to run the SPARQL queries over the 
dataset themselves, making this information readily accessible.

However, in order to properly document the semantics of the lime 
poperties we introduce, would it not be feasible to indicate a SPARQL 
construct query that computes the property value? In that sense we would 
clearly define the semantics of these metadata properties.

What do you think?

Philipp.


Am 06.03.14 20:17, schrieb Armando Stellato:
>
> Dear Philipp and John,
>
> no need to say sorry, you are coordinating a whole community group, we 
> cannot say the same on our side, yet we are no quicker than you in 
> replying :D
>
> You raise an important point, the solution of which actually raises up 
> an interesting opportunity for other important aspects of at the level 
> of the web architecture of Ontolex.
>
> Before we delve further into the details, let us ask one more question:
>
> What is the relationship between ontologies and the lexica, is it 1:n 
> (an ontology may have multiple lexica) or m:n (as before, plus the 
> same lexicon may be connected to multiple ontologies) ?. A strictly 
> related question is: "is a lexicon built specifically for an ontology?".
>
> Having ported WordNet in Ontolex should already give the answer to 
> that (WordNet exists a-priori from any ontology, and thus it should be 
> one example in favor of the m:n hypothesis, though we may think of a 
> Lexicon as something importing WordNet and extending it for being a 
> lexicon for a given ontology).
>
> In case the m:n hypothesis is confirmed, we should think about some 
> form of binding, as a third object implementing the connection between 
> an independent lexicon and an ontology.
>
> I think I already asked something related to that when I had some 
> doubts about how to deal with compound terms: if a lexicon exists 
> independently, it will probably not contain some compounds needed to 
> describe resources of the ontology, so we cannot assume these should 
> be always available (at the time of my question, I remember I was 
> told: "for things like "red car", you should foresee a dedicated entry 
> in the lexicon, though it can then be decomposed through the 
> decomposition module", thus implying that the lexicon has to exist FOR 
> a given ontology.
>
> Probably I'm missing something here, but I think these are fundamental 
> aspects which should be made clear in the wiki pages about the overall 
> architecture and the usage of the model.
>
> Ok, sorry for the long introduction, but how you will see, it is 
> related to our topic...we however maybe managed to handle this 
> independently of the above. So, back to the topic...
>
> ...Our model relates to a void file, but this file could be, for 
> instance, not the void file of an ontology, but the void file of 
> (something similar to) a void:linkset which binds a lexicon to an 
> ontology. To cover also the need you express at the end of the email, 
> we could propose the following changes:
>
> 1)A lime:lexicon property,
>
> a.domain: lime:LanguageCoverage (the class obviously)
>
> b.range:     void:Dataset (or an appropriate subclass lime:Lexicon to 
> define a dataset which contains linguistic expressions for some 
> dataset. Note that a void:Dataset containing both conceptual and 
> lexical info would be the lexicon of itself!
>
> 2)lime:lexicalModel (old linguisticModel, moved to having domain set 
> to languageConverage)
>
> so we could have a structure like that:
>
> void:Dataset  --lime:languageCoverage--> lime:LanguageCoverage 
> --lime:lexicon--> void:Dataset
>
> --lime:lexicalModel--> (rdfs:, skos:, skosxl:, ontolex: )
>
> --lime:resourceCoverage--> (usual stat info)
>
> But then, we would have another issue...what is a lexicon? If a 
> lexicon is something independent of the "enrichment" of an ontology 
> with respect to a language, and lives on its own, then, here, in our 
> case, we are more interested in knowing the third element we were 
> mentioning above, that is, the "man in the middle" providing links 
> between a conceptual resource and a lexicon. Thus, with just a 
> terminological change, (lexicon --> lexicalization), and relying on 
> the fact that this representation delegates to the lexicalization the 
> pointers to the lexicon:
>
> void:Dataset --lime:languageCoverage--> lime:LanguageCoverage 
> --*lime:lexicalization*--> void:Dataset (this holds the bindings, 
> though it can be obviously "both the bindings and the lexicon")
>
>                                                  --lime:lexicalModel--> rdfs:resource 
> (thought to hold rdfs:, skos:, skosxl:, ontolex: )
>
>                                                  --lime:resourceCoverage--> 
> (usual stat info)
>
> So, where to hold this information? Probably in the void file of this 
> lexicalization, which mandatorily contains the triples linking (by use 
> of ontolex:reference, for instance), the lexicon to the onto resource 
> and which may contain the lexicon itself (or refer to an external 
> lexicon).
>
> We could, at this point, use a combo of inverseproperty and 
> subproperty axioms to try to model in a lighter way (by inference) the 
> simple case where a conceptual void:Dataset holds its own 
> lexicalization, but still under the umbrella of this more general 
> case, where potentially the ontoresource, the lexicon and their 
> binding are different resources.
>
> Another possible pattern, defined by means of a new class: 
> lime:Lexicalization, which is a void:Dataset proxy for the resource 
> containing the above bindings, is the following:
>
> lime:Lexicalization --lexicalizedDataset--> rdfs:resource (thought to 
> hold the conceptual dataset)
>
>                     --lime:lexicalModel--> rdfs:resource (thought to 
> hold rdfs:, skos:, skosxl:, ontolex: )
>
> --lime:resourceCoverage--> (usual stat info)
>
> The one above has the advantage of using the same proxy of the 
> elements we are describing (the bindings) as subject in the triples of 
> its own void file.
>
> An example of it usage:
>
> /** inside the void file of a conceptual dataset
>
> :dat lime:lexicalization myItLex:myItalianLexicalizationOfDat
>
> /** inside the void file of the lexicalization
>
> myItLex:myItalianLexicalizationOfDat
>
>   lime:lang "it";
>
>   lime:lexicalizedDataset :dat ;
>
>   lime:lexicalModel ontolex: ;
>   lime:resourceCoverage [
>     lime:class owl:Class;
>     lime:percentage 0.75;
>     lime:avgNumOfEntries 3.5
>   ].
>
> This would be very easy to remap in the simpler case of a dataset 
> holding its own labels (as for simple rdfs skos or in most of the 
> cases even skosxl labels). In this case, we can simply declare that:
>
> :dat lime:lexicalization :myItalianLexicalizationOfFOAF
>
> And add the following to myItalianLexicalizationOfDat
>
> :myItalianLexicalizationOfFOAF void:subset :dat
>
> In the case above, both of them would be on the same void file.
>
> Obviously, for the usual ontolex case, we will have the inverse of 
> lime:lexicalization which goes from a void of a Lexicalization to the 
> void of the ontoresource.
>
> Cheers,
>
> Armando and Manuel
>
> P.S. One note about the examples in the wiki: you avoided the use of 
> bnodes, but sometimes using them actually improves readability (at 
> least, when the serialization format allows for a compact notation, 
> so, not in NTRIPLES for instance), exactly because bnodes are meant to 
> support structures that are not meant to have any meaningful name
>
> P.P.S. there are some aspects instead, not at the metadata level, but 
> at the content level, concerning the mentioned triad ontology 
> -->binding-->lexicon which I would like to discuss tomorrow.
>
> P.P.P.S. all of the above has to be surely refined. We were in a hurry 
> so we quickly assigned some names which should be refined, and also, 
> some void to void relationships should not exist. A void proxy should 
> point directly to a content element, which would then contain the 
> reference to its void. But for the moment, this simplification made 
> the examples a slightly easier to be understood.
>
> *From:*Philipp Cimiano [mailto:cimiano@cit-ec.uni-bielefeld.de]
> *Sent:* Tuesday, March 4, 2014 7:36 AM
> *To:* Manuel Fiorelli
> *Cc:* Armando Stellato; John McCrae
> *Subject:* Re: LIME proposal for the OntoLex W3C Community Group
>
> Dear Manuel, Armando, John,
>
>  apologies for getting back to you so late on this.
>
> I started editing the final model specification to include the lime 
> vocabulary as you propose:
>
> https://www.w3.org/community/ontolex/wiki/Final_Model_Specification#Metadata_.28lime.29
>
> Can you please confirm that this is who you intended?
>
> Other than that, I would like to clarify something.
>
> In my understanding you are proposing to use this vocabulary to attach 
> metadata to any void:Dataset that has lexicalizations according to 
> some linguistic model, so basically every ontology and every ontolex 
> instance in particular.
>
> However, you seem to assume that the lexicalizations are inline in the 
> actual model so that we can actually "count" in a closed fashion how 
> many lexiclaizations per owl:Class etc. there are.
>
> However, the main foresee application of the ontolex model is one 
> where the ontology and the lexicon are in separate files, so the 
> lexicon is external and there are possibly many external lexica. In 
> the lime vocabulary, the coverage in terms of ontolex lexical entries 
> would in the general case be 0, right?
>
> Further, in the resource coverage I can not refer to the linguistic 
> model anymore, right? Can I see that 75% of the owl:Classes have a 
> rdfs:label in English, 50% have a ontolex lexical entry, etc.?
>
> I would like to discuss these issues per email and also Friday on our 
> telco.
>
> Thanks for your input so far.
>
> Best regards,
>
> Philipp.
>
>
> Am 21.02.14 13:33, schrieb Manuel Fiorelli:
>
>     Dear Philipp,
>
>     hereafter I send our proposal to the OntoLex W3C Community Group
>     for the standardization of LIME.
>
>     LIME [1] (LInguistic MEtadata) is a vocabulary of mostly
>     linguistic metadata about linguistically grounded datasets and
>     linguistic resources published as Linked Open Data.
>
>     The vocabulary has been implemented as an extension of VoID [2],
>     which provides a vocabulary of general metadata, and an extensible
>     framework for publishing additional metadata.
>
>     We decided to submit to the OntoLex community group the subset of
>     LIME concerning linguistically grounded datasets, while omitting
>     for the moment the rest of the vocabulary about linguistic
>     resources and their connection with other RDF datasets. We hope
>     that it will be easier to reach consensus on a limited proposal
>     despite the fact the working group is near its conclusion.
>     Furthermore, the omission is motivated by the lack of a
>     standardized vocabulary for expressing various kinds of linguistic
>     resources the content of which cannot be represented through LIME.
>
>     Immagine rimossa dal mittente.
>
>     In the rest of the email, we report the vocabulary as it is now,
>     and identify potential elements of discussion.
>
>     *lime:language*
>
>     Property: lime:language
>       Range: xsd:string
>
>     lime:languageholds the natural languages in which a given dataset
>     is expressed. This property is meant to be asserted multiple
>     times, one for each natural language.
>
>     In the original model lime:languageis defined as a datatype
>     property, whose values are simple literals (no datatype, no
>     language tags) representing the language identifiers, as they are
>     expressed in RDF. Note that RDF 1.1 changed (in a backward
>     compatible way) the representation of language tags. Therefore, we
>     should be careful in how we express the requirements on lime:language.
>
>     Since the focus of the proposal is on RDF datasets only, we could
>     clarify that the domain of lime:languageis void:Dataset.
>
>     Concerning lime:language, we should discuss, whether adopting or
>     linking dct:language, and using URIs [3] for representing natural
>     languages.
>
>     *
>     lime:linguisticModel*
>
>     Property: lime:linguisticModel
>
>     SubPropertyOf: void:vocabulary
>
>     The presence of any linguistic description does not guarantee that
>     an agent might exploit it. Indeed, the agent must know whether
>     linguistic
>     information is available in the form of traditional rdfs:label(s),
>     SKOS labels, SKOS-XL reified labels, or whatever. Most datasets
>     are likely to use multiple linguistic models simultaneously, each
>     one for different needs (e.g. the distinction between preferred
>     and alternative labels may be or not of interest). These models
>     are hold by the property lime:linguisticModel, which extends the
>     property void:vocabulary, as the former expresses a more specific
>     association with the vocabulary.
>
>     In the original model we assumed that if multiple linguistic
>     models are used, then they express the same information, within
>     the limits of their respective scopes. For example, when RDFS and
>     SKOS are both used, we assumed that the each of the
>     skos:{pref,alt,hidden}Labels are materialized as rdfs:labels.
>     However, this assumption does not hold, when a dataset uses SKOS
>     for expressing labels and RDFS just for comments. While in this
>     case the reasoner could materialize the rdf:labels, we think that
>     in general the original assumption should be retracted.
>
>     Moreover, we should clarify our stance with respect to reasoning
>     and implicit information. Let's assume that a dataset has explicit
>     SKOS labels, but no explicit RDFS label. Surely, this dataset uses
>     SKOS as linguistic model. Concerning RDFS, we think that the
>     metadata should not include it (despite a reasoner could support
>     the materialization of RDFS labels).
>
>     *Statistical facts
>     *:dat lime:languageCoverage [
>       lime:lang "en";
>       lime:resourceCoverage [
>         lime:class skos:Concept;
>         lime:percentage 0.75;
>         lime:avgNumOfEntries 3.5
>       ]
>     ].
>
>     The previous snippets represents that 75% of skos:Concepts have
>     attachments in English, and that skos:Concepts have on average 3.5
>     of such attachments.
>
>     Again, we should clarify our stance with respect to inference.
>     Moreover, with respect to the section about linguistic models, we
>     should decide whether it is better to specify which kind of
>     attachments we are talking about.
>
>
>     Finally, we should decide whether a tighter connection with the
>     DataCube [4] vocabulary would be valuable.
>
>     References:
>     [1] Manuel Fiorelli, Maria Teresa Pazienza and Armando Stellato.
>     /LIME: Towards a Metadata Module for Ontolex/, 2nd Workshop on
>     Linked Data in Linguistics: Representing and Linking lexicons,
>     terminologies and other language data (LDL-2013), collocated with
>     the Conference on Generative Approaches to the Lexicon, September
>     23rd in Pisa, Italy, Pisa, Italy
>     (http://art.uniroma2.it/publications/docs/2013_LDL_LIME%20Towards%20a%20Metadata%20Module%20for%20Ontolex.pdf)
>
>     [2] http://www.w3.org/TR/void/
>
>     [3]
>     http://www.semantic-web-journal.net/content/lexvoorg-language-related-information-linguistic-linked-data-cloud-0
>
>
>     [4] http://www.w3.org/TR/vocab-data-cube/
>
>     -- 
>     Manuel Fiorelli
>     PhD student in Computer and Automation Engineering
>     Dept. of Computer Science, Systems and Production
>     University of Rome "Tor Vergata"
>     Via del Politecnico 1
>     00133 Roma, Italy
>
>     tel: +39-06-7259-7334
>     skype: fiorelli.m
>
>
>
>
> -- 
>   
> Prof. Dr. Philipp Cimiano
>   
> Phone: +49 521 106 12249
> Fax: +49 521 106 12412
> Mail:cimiano@cit-ec.uni-bielefeld.de  <mailto:cimiano@cit-ec.uni-bielefeld.de>
>   
> Forschungsbau Intelligente Systeme (FBIIS)
> Raum 2.307
> Universität Bielefeld
> Inspiration 1
> 33619 Bielefeld


-- 

Prof. Dr. Philipp Cimiano

Phone: +49 521 106 12249
Fax: +49 521 106 12412
Mail: cimiano@cit-ec.uni-bielefeld.de

Forschungsbau Intelligente Systeme (FBIIS)
Raum 2.307
Universität Bielefeld
Inspiration 1
33619 Bielefeld
Received on Friday, 7 March 2014 21:43:39 UTC