Re: Issues concerning Morphology and Part of Speech Tags from Christian Chiarcos on 2019-03-20 (public-ontolex@w3.org from March 2019)

From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Date: Wed, 20 Mar 2019 20:05:42 +0100
To: "Julia Bosque Gil" <jbosque@fi.upm.es>
Cc: peikert.katrin@web.de, public-ontolex <public-ontolex@w3.org>
Message-ID: <op.zyysnsvp89jat0@kitaba>
Dear Julia,

> Thank you for your comments, this is leading to a nice discussion indeed  
> :)

;)

> So, without knowing to which entry each form belongs, I only see three  
> options (now revisited after your e-mail):
>> (a) All lexical entries receive the forms, triplicating the list of  
>> forms ("gal" in Sumerian has 198 attested forms without information  
>> about the sense, so this >option should probably be reconsidered...!)
> (b) The entries share the forms by interpreting "a" in the definition of  
> ontolex:Form as existential quantifier. What worries me here is that I  
> am not sure >about a "realisation" being a realisation of more than one  
> entry at the same time.
I understand your hesitations here, and with a technical perspective, I'm  
possibly too pragmatic here ;) Any input from a lexicographer?

>> (c) Only one lexical entry is linked to the forms. For the other  
>> lexical entries...either they inherit from the first entry with your  
>> new property, or we would >need to access the forms through lexicog  
>> mechanisms. Regarding this new property that you suggest, it makes a  
>> lot of sense to me if you know >beforehand that there is an "original"  
>> LexicalEntry (or one you want to treat as "original") which does occur  
>> in a series of forms, and the other lexical >entries are realized with  
>> the same grammatical properties.

I think that would be the case here, because the EPSD is providing the  
"dictionary-entry"-level POS tag in a more prominent fashion than the  
sense-specific POS tags. In dictionaries, a similar interpretation can be  
given to the sequential order of parts of speech. I.e., if a lexicographer  
puts one particular part of speech first, he either does so because it is  
more "prototypical" or "natural" for a potential reader, because he knows  
(and assumes his reader to know) about the origin of a "zero-derived"  
form, or because he follows a general pattern that would probably  
implement the intuition that verbs and nouns are somewhat more  
"fundamental" than adjectives or adverbs, or even function words. In  
either way, there would be a first, i.e., most prominently represented  
one, and we can just *define* it as the "original". However, this may be  
something different than the direction of morphological derivation (which,  
diachronically, may have been a morphological process, like the derivation  
of German adverbs from adjectives [Middle High German added an -e here,  
which was then lost by apocope]), and this is why I'm struggling a bit  
with the name of the property.

> It would be a nice solution to the problem of the adjective-adverb issue  
> in German we discussed in some calls on the lexicog module, as you  
> mentioned. >But, for the example of the Sumerian data, I might have  
> missed something or got lost in the process: how do we know which forms  
> are linked to the >"original" LexicalEntry on the first place, if there  
> is no way to know from the data which of the 198 forms of "gal" are  
> connected to which lexical entry (v, >n, or adj)? In other words,  are  
> we preventing a wrong scope of form attestations in any of the ways of  
> implementing option (c)?

We can, at least: If we state that a "derived" LexicalEntry does inherit   
(resp., they extend) the forms and senses *unless explicitly given*,  
explicitly giving form and/or sense information entails that they are  
*not* inherited (extended) to a particular derived form. This is a bit  
like overriding an inherited variable in an object-oriented programming  
language.

However, this subtle difference can only be maintained if we adopt the  
closed world assumption for OntoLex data. Because otherwise, we might  
incidentially just have lost the decisive ontolex:lexicalForm property  
that would have helped us to decide about the scope. This would be a *BIG*  
design decision to make (but one I would tentatively support -- I am  
probably missing some of its consequences, though).

This problem is deeply rooted in RDF semantics (and one of the aspects  
where it differs from, say, the default semantics of graph data bases). If  
we adopt the Open World Assumption, we *cannot* express (in any of these  
modelling choices) that something is *un*ambiguous -- because it could  
always be that we just lost the triple that connects the form with another  
lexical entry. This issue persists for the other modelling choices, as  
well. I just created an issue on https://github.com/w3c/EasierRDF  
requesting a compact means to assert CWA or OWA interpretations to an RDF  
graph.

Best,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 11-15, #107
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28334
Received on Wednesday, 20 March 2019 19:08:21 UTC