RE: LIME Final Model from Armando Stellato on 2015-01-26 (public-ontolex@w3.org from January 2015)

From: Armando Stellato <stellato@info.uniroma2.it>
Date: Mon, 26 Jan 2015 20:04:20 +0100
To: "'John P. McCrae'" <jmccrae@cit-ec.uni-bielefeld.de>, "'public-ontolex'" <public-ontolex@w3.org>
Message-ID: <DUB408-EAS390BE125A0C343F4F0A1914A0350@phx.gbl>
Dear John,

 

good thing, we more or less agree with you :-)

 

Sorry in advance for the long email, but we will address a few points: why initially it was not agreed (by all of us) to be like that, why it could be, and which possibilities we propose.

 

Just as an historical note about it not being a subset. This emerged in a quite old phone call (it’s not in the minutes as we only report agreed decisions and usually not rejections). Actually in that call we all speculated about this possibility, and later on all of us agreed on rejecting it as we preferred to have a different nature for this coverage. The reason is mainly that by having a clear representation for the dataset, and just an appendix entity for statistical information about the coverage (such it was at that time), there were no ambiguity on where certain information had to be asserted.

Let’s make a short example over a LexicalizationSet:

 

:EnglishLexicalizationSet

  rdf:type lime:LexicalizationSet ;

  ontolex:language "en" ;

  lime:referenceDataset <http://www.cimiano.de/ontologies/foaf-meta#VocabularyFOAF> ;

  lime:lexicalizationModel <http://www.w3.org/ns/lemon/ontolex> ;

  lime:lexiconDataset :FOAFEnglishLexicon ;    

    

  lime:coverage [

      lime:resourceType owl:Thing ;

      lime:percentage 0.171 ;

      lime:avgNumOfLexicalizations 0.197 ; 

  ] ;

 

 

Clearly, all the information such as referenceDataset, lexicalizationModel and lexiconDataset are valid for the lexicalization as a whole. The coverage was limited to hold those simple statistics we were talking about.

If we consider the coverages to be subsets (whichever property points to them) of the LexicalizationSet, then one would expect to find the same info (referenceDataset, lexicalizationModel etc…) on these subsets. However, property values are not “passed” from datasets to the their subsets, as they all represent different objects and need to be described as well.

 

Now, let’s come to today: wrt our original proposal, there has been much debate about the possibility of putting many other properties (not only averages and percentages, but also counts) even in the coverages, which eventually ended in these much richer coverage representations which…yes…are at this point, very similar to the LexicalizationSet itself.

In a short, modeling the coverages (for LexicalLinkSets, LexicalizationSets…and maybe Conceptualizations) as objects of the same nature of their containers (actually subsets of them) is, at this point, surely better.

However, there is still the same issue we addressed before: the non-inheritability of property values to the subsets.

 

However…in the end…this is really the same issue which exists in VoID (and it seems to be not addressed that much there). For instance, in a void:Dataset and its subsets, shouldn’t the SPARQL endpoint be the same? We observed the list of properties there, and took some examples. It seems there are quite loose semantics and more “best practices of interpretation”. Just to provide two different cases: you may find void:dataDump respecified in the datasets, describing the files containing the specific triples of the subsets, while the SPARQL endpoint is generally assumed to be the one of the containing dataset. But nothing in the model clarifies this.
If we are happy with keeping the same loose semantics (and considering the larger amount of shared information between coverages and their containers wrt the original proposal), then why not? We can go for a subset approach.
 

So, if we go for the subset approach, we suggest a few modifications to your proposal, as we originally discussed in that call:

 

1)      Do not coin a dedicated class for the coverages. Just keep the container (LexicalizationSet, LexicalLinkSet..and again…maybe Conceptualization, but we’ll discuss this in a separate thread) and assume that they have the same properties (with all the semantically loose assumptions about the inheritance of prop-values)

2)      Use a property to address the partition. In this case, why not simply reusing void:classPartition?

 

Concerning point 2, observe that void itself is not providing that many axioms, but if you like to write them, we could define:

 

LexicalizationSet ⊑ ∀ classPartition.( LexicalizationSet ⊓ =1 void:class)

 

Analogous axioms hold for LexicalLinkSet ( and again maybe Conceptualization)

 

Finally, property chains could be defined to make the subsets inherit the values of their supersets, though only for object properties…

 

Cheers,

 

Armando and Manuel

 

P.S: on the use of void:classPartitition. The description in the void specs ( <http://www.w3.org/TR/void/#class-property-partitions> http://www.w3.org/TR/void/#class-property-partitions) there is not totally clear. For sure property partitions indicate subsets containing triples exclusively featuring a given property as their predicate. However, for void;classPartition, its definition mentions “descriptions of instances of the given class”, which we do not know if it is meant to be interpreted as triples with “those instances in the subject”, or a wider interpretation of everything related to them. The stricter interpretation has a problem: trivially, the direction of the predicates used in the LexicalizationSet (the reference dataset objects could be the objects of ontolex:denotes triples)

 

 

 

 

From: johnmccrae@gmail.com [mailto:johnmccrae@gmail.com] On Behalf Of John P. McCrae
Sent: Friday, January 23, 2015 4:57 PM
To: Manuel Fiorelli
Cc: public-ontolex; Armando Stellato
Subject: Re: LIME Final Model

 

OK, one more thing that I think I have not made clear yet. The motivation for this is that it makes it easier to understand that all properties that can be stated about a Lexicalization can also be stated about a LexicalizationCoverage. If one is a subset of the other this is more obvious and uses one axiom to express what otherwise requires many axioms.

For the language question, we agreed on dcterms:language:
http://www.w3.org/2014/10/17-ontolex-minutes.html

Regards,
John

 

On Fri, Jan 23, 2015 at 4:50 PM, Manuel Fiorelli <manuel.fiorelli@gmail.com <mailto:manuel.fiorelli@gmail.com> > wrote:

Dear John, All

see my answers below.

 

2015-01-23 15:48 GMT+01:00 John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de <mailto:jmccrae@cit-ec.uni-bielefeld.de> >:

 

 

On Fri, Jan 23, 2015 at 3:17 PM, Manuel Fiorelli <manuel.fiorelli@gmail.com <mailto:manuel.fiorelli@gmail.com> > wrote:

Dear John, All

see my answer below.

 

2015-01-23 14:59 GMT+01:00 John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de <mailto:jmccrae@cit-ec.uni-bielefeld.de> >:

 

On Fri, Jan 23, 2015 at 2:50 PM, Manuel Fiorelli <manuel.fiorelli@gmail.com <mailto:manuel.fiorelli@gmail.com> > wrote:
7. Properties avgNumOfLexicalization, percentage, lexicalizations no longer on Lexicalization

This is something that (if I remember correctly) was still under discussion. However, in the attached document I was open to the possibility to include these properties the LexicalizationSet.

The change you propose would dramatically change the semantics of the model. Currently, a coverage is only a container of statistics. With your change in place, a coverage would be a dataset, which contains (I presume) the lexicalization triples.

OK, I think the important thing is that properties such as lexicalizations can be added to the Lexicalization, it didn't look like that from the diagram

As for changing the semantics, I disagree. The lexicalization is not truly a 'dataset' in most cases as it is instead may be published as part of a lexicon (or even part of an ontology). Instead it is a dataset in the sense that it some set of triples, in this case the triples linking an ontology to a lexicon, thus for me a resource coverage is also a dataset, that is the set of triples linking a lexicon to a selection of the ontology's entities by type.

 

In the model, we have the following axiom

lime:LexicalizationSet rdfs:subClass void:Dataset

therefore, each lexicalizationSet is a dataset, in the sense of being a set of triples, i.e. representing the association between ontology entities and lexical entries.

As you argue, it may be a subset of another dataset. On this last point, maybe we were a bit ambiguous in previous telcos/emails. Suppose that I want to distribute an ontolex:Lexicon together with a lime:LexicalizationSet, what is the appropriate structure of the data?

a)

The lexicon also contains the triples related to the lexicalizationSet

:myLexicon a ontolex:Lexicon .
:myLexicon void:subset :myLexicalizationSet .

:myLexicalizationSet a lime:LexicalizationSet.

 

b)

 

The lexicon does not contain the triples related to the lexicalization; instead, both the lexicon and the lexicalizationSet are part of a larger dataset.


:myDataset a void:Dataset .

:myDataset void:subset :myLexicon .

:myDataset void:subset :myLexicalizationSet .

:myLexicon a ontolex:Lexicon .

:myLexicalizationSet a lime:LexicaliztionSet.

 

 

I thought that we agreed on the solution b), in order to completely remove "semantic" information from the lexicon. What is your position?

I think both solutions are in principle fine but would also prefer (b)... I'm not quite sure about the relevance here. By 'true dataset' I mean a collection of triples grouped together and made available as a single download, the semantics of VoID are much weaker making parts of a single download a dataset as well (although the definition <http://vocab.deri.ie/void#Dataset>  of void:Dataset seems to be a 'true dataset')

 

I asked because you wrote "The lexicalization is not truly a 'dataset' in most cases as it is instead may be published as part of a lexicon", thus making me think you were assuming solution a)

The following example from the spec clearly allows to define a (sub)set only for the purpose of providing metadata:

:DBpedia a void:Dataset;
    void:classPartition [
        void:class foaf:Person;
        void:entities 312000;
    ];
    void:propertyPartition [ 
        void:property foaf:name;
        void:triples 312000;
    ];
    .

 

 

For example VoID's classPartition property, which for me is closely related to lime:coverage, is a subproperty of void:subset, and hence any class partition is thus a void:Dataset. By the same principle I would say that the range of lime:coverage is also a void:Dataset as it is also a partition of the lexicalization. We could even go further and claim lime:coverage ⊑ void:subset!

See:
http://www.w3.org/TR/void/#class-property-partitions
http://vocab.deri.ie/void#classPartition

 

 

I see your point. You are suggesting that:

LexicalizationSet is the dataset containing all the triples related to lexicalization
then, by means of coverage, you introduce a subset that only concerns with a specific resource type. The object could be something like ResourceConstrainedLexicalizationSet.

I am sure that this option was already considered and collectively discarded during a telco. Unfortunately, I am not sure about the motivations.

Since your proposal seems reasonable, Armando and I will discuss about it on Monday, in order to accept or reject you proposal.

In the meantime, I want to highlight another aspect of the model I am not sure. Did we agree on the use of ontolex:languageURI o dcterms:language for languages expressed as resources?

-- 

Manuel Fiorelli
Received on Monday, 26 January 2015 19:04:58 UTC