Re: Issues concerning Morphology and Part of Speech Tags from Christian Chiarcos on 2019-03-20 (public-ontolex@w3.org from March 2019)

From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Date: Wed, 20 Mar 2019 15:52:20 +0100
To: peikert.katrin@web.de, "Julia Bosque Gil" <jbosque@fi.upm.es>
Cc: public-ontolex <public-ontolex@w3.org>
Message-ID: <op.zyygxib989jat0@kitaba>
Am .03.2019, 13:45 Uhr, schrieb Julia Bosque Gil <jbosque@fi.upm.es>:

>
> Dear Katrin,
> I will try to provide some possible solutions for your second issue,  
> concerning the single part of speech -tag and the lexicog:Entry  
> approach. My answer >in-lines ;)
>>
>> The second issue concerns senses and part of speech tags. In EPSD2 it  
>> is possible for an entry to have a “general” part of speech tag, but  
>> some >>senses of it have a different tag e.g. “gal”(big), which is  
>> characterized as a “V\i” , but it can also mean “goblet”, which is  
>> tagged as “N”. But since >>Ontolex does not allow an LexicalEntry to  
>> have more than one part of speech tag, it is unclear to me how one  
>> could model this phenomena. The >>lexicog solution would be to use a  
>> lexicog:Entry for “gal” in general, and three LexicalEntry-s for the  
>> three parts of speech.
>
> Exactly, you would have gal-v, with senses [1-5], gal-n with sense [6],  
> and gal-adj with sense [8].
>>
>> The problem is that EPSD2 stores information about the forms and their  
>> frequency for “gal”, but not for gal with senses [1-5], [6] or [7]  
>> separately. It >>is unclear which form of a word is connected to which  
>> sense and how often this specific sense with a specific form is used.
>
> From what I understood, since this information is not explicitly  
> provided in the dictionary, there is no way of automatically  
> distinguishing this case from >those in which all forms go with all  
> senses unless you take into account the case difference in the strings.  
> I see three possible ways of representing this, >one of them easier in  
> terms of querying, but overkill and leading to a high number of triples.  
> The other two are more concise but would create some lexical >entries  
> without a form, and you would need to query the dictionary entry to get  
> them.
> a) [lots of triples] Since these entries look in appearance like those  
> in which all forms go with all senses, each created LexicalEntry  
> receives all the forms, >which would need to be triplicated. The  
> disambiguation step in the future would involve an update to remove  
> those Forms that are not realisations of the >lexical entry at hand.
The first problem is that there is frequency information to be added about  
the forms, and these frequencies refer to non-disambiguated forms. If we  
represent the ambiguity by double linking, this is semantically less  
incorrect. The second problem is that not all forms seem to go with all  
senses, but we cannot tell them apart, so representing this without a hint  
that it is ambiguous is just wrong.

a'): Can't we just point from several lexical entries to the same form?  
The definition is " A form represents one grammatical realization of a  
lexical entry." and this is ambiguous regarding the interpretation of "a"  
as either "one" or an existential quantifier. The latter would permit  
multiple lexical entries per form.
Note that the situation is *not* analogous with LexicalSense, where we  
cannot interpret "a" as existential quantifier, because it is further  
elaborated as "a pair of a *uniquely determined* lexical entry and a  
uniquely determined ontology entity", but there doesn't seem to be a  
comparable restriction to ontolex:Form.

>
> b) [more concise] Only one LexicalEntry receives all forms (e.g. let us  
> say, randomly, the one with the first sense, so gal-v), which might be  
> not correct, >but in this way there are no ontolex:Forms without a  
> LexicalEntry. The other two LexicalEntries would not have a lexical  
> form, but the lexicog:Entry would >consist of LexicographicComponents  
> that point to them via describes. lexicog:LexicographicComponents can  
> also describe ontolex:Forms, since the >range of the describes property  
> is owl:Thing. If you state that the lexicog:Entry that includes  
> components describing the three ontolex:LexicalEntries also >has more  
> components, each describing a Form, you can later on get a list of all  
> the forms described in that dictionary entry. In this way, if you want  
> to >access the potential forms that would go with gal-adj or gal-n, you  
> would need to perform a query in SPARQL “Given than gal-n is described  
> by a >LexicographicComponent which is rdfs:member of a lexicog:Entry,  
> give me all the ontolex:Forms that are described by  
> LexicographicComponents which >are also rdfs:member of that same  
> lexicog:Entry”. Alternatively, “Given than gal-n is described by a  
> LexicographicComponent which is rdfs:member of a >lexicog:Entry, give me  
> all the ontolex:Forms of other LexicalEntries that are described by  
> LexicographicCompoents which are also rdfs:member of that >same  
> lexicog:Entry”, and then you would get the forms linked to gal-v. For  
> the last query you actually would not need to create  
> >LexicographicComponents describing Forms, because you access them via  
> gal-v (unless you consider that the EPSD has indeed a section in that  
> entry >devoted to form description and you want to capture that).
Pretty complicated, and the scope of form attestations and their  
frequencies would be equally incorrect as with duplicating all forms.

>
> c) Just like (b), but the LexicographicComponents of the lexicog:Entry  
> would not describe ontolex:LexicalEntries, but ontolex:LexicalSenses.  
> This depends >on how exactly you want to recreate the original structure  
> that you have in the EPSD2.
d): using a  yet-to-be-determined property from the morphology module that  
associates a LexicalEntry with another, and unless explicit forms are  
specified, inherits all its lexicalForm properties. From the current  
discussion, that could be a subproperty of the non-reified version of   
morph:DerivationalRelation (cf.  
https://www.w3.org/community/ontolex/wiki/Morphology, working examples) as  
suggested (by me ;) for "zero derivation" (morphology wiki, discussion  
under N11).

In Sumerian, this is probably not a derivation proper, but a  
grammaticalization or lexicalization, so morph:zeroDerivation or the like  
would be slightly misplaced, but a possible alternative name could be  
morph:reanalyzedAs (referring to the grammaticalization process),  
morph:grammaticalizedAs, or morph:cast (by analogy with type casting in  
programming languages), and the definition of this property could be "a  
derviational relation between a lexical entry and another lexical entry  
with the same canonical form, but different part of speech. For a  
reanalyzed (grammaticalized, zero-derived) lexical entry, sense and form  
information is optional, if not provided, sense and/or form information is  
extended from (or identical with that of) the original lexical entry.  
(Note that this constrain only holds under the closed world assumption).   
The canonical form of the target, if provided, *must* be string identical  
(modulo capitalization) to the canonical form of the source."

Another application of this property would the preposition-complementizer  
ambiguity in English, the adjective-adverb "derivation" in German or the  
preposition-adverb-particle ambiguity in most older West Germanic  
languages, so I think, there's enough lexicographic motivation.

Best,
Christian
>
>
> I hope this helps. I might be missing some other options of a solution  
> involving lexicog, so, if you have any more ideas/suggestions, they are  
> more than >welcome!
> Best,
> Julia
> El mié., 20 mar. 2019 a las 11:39, <peikert.katrin@web.de> escribió:
>> Hello everyone,
>>I am currently trying to create a Ontolex-model of the Electronic Penn  
>> Sumerian Dictionary
>> (EPSD2, http://oracc.museum.upenn.edu/epsd2/sux). But several issues  
>> have arisen, which
>> are not easily solvable within the current Ontolex version.
>>The first issue concerns the presentation of verbal prefixes in  
>> Sumerian. While there are ways
>> to describe different forms of the same word, there does not seem to be  
>> a way to do so by
>> describing the underlying morphological process. As an example,  
>> consider the lexical entry
>> (dictionary entry) for gal:  
>> http://oracc.museum.upenn.edu/epsd2/cbd/sux/sux.x0405180.html.
>> Under "verbal prefixes", it lists for example ba.i.n (i.e., ba.i.n.V,  
>> which stands for the morphological
>> gloss ba-i-n-gal, with three inflectional prefixes and the verbal  
>> root).  Beyond the morphological
>> segmentation, the analysis is not spelled out, but points to the  
>> original attestation(s). In OntoLex,
>> it is however, already unclear how to represent the morphological  
>> segmentation in the first place.
>>The second issue concerns senses and part of speech tags. In EPSD2 it  
>> is possible for an entry to
>> have a "general" part of speech tag, but some senses of it have a  
>> different tag e.g. "gal"(big), which
>> is characterized as a "V\i" , but it can also mean "goblet", which is  
>> tagged as "N". But since
>> Ontolex does not allow an LexicalEntry to have more than one part of  
>> speech tag, it is unclear to me
>> how one could model this phenomena. The lexicog solution would be to  
>> use a lexicog:Entry for "gal" in
>> general, and three LexicalEntry-s for the three parts of speech. The  
>> problem is that EPSD2 stores
>> information about the forms and their frequency for "gal", but not for  
>> gal with senses [1-5], [6] or [7]
>> separately. It is unclear which form of a word is connected to which  
>> sense and how often this specific
>> sense with a specific form is used. Thus, if you try to have several  
>> LexicalEntries of the same word,
>> there is no way to preserve information about forms and their  
>> frequencies, as we cannot automatically
>> disambiguate the forms. (Manually an expert can to a certain extent,  
>> the upper case strings in the forms
>> are determinative, which specify certain semantic types, e.g., the  
>> material an object consists of,
>> indicating a nominal or adjectival sense).
>>It would be really great if there could be found a way to solve this  
>> issues.
>>Best regards,
>> Katrin Peikert
>>Goethe UniversitätFrankfurt am Main
>
>
> --Julia Bosque Gil
> PhD Student
> Ontology Engineering Group
> Departamento de Inteligencia Artificial
> Universidad Politécnica de Madrid



-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 11-15, #107
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28334
Received on Wednesday, 20 March 2019 14:55:00 UTC