Re: Issues concerning Morphology and Part of Speech Tags from John P. McCrae on 2019-03-20 (public-ontolex@w3.org from March 2019)

From: John P. McCrae <john.mccrae@insight-centre.org>
Date: Wed, 20 Mar 2019 13:28:17 +0000
To: Julia Bosque Gil <jbosque@fi.upm.es>
Cc: peikert.katrin@web.de, public-ontolex <public-ontolex@w3.org>
Message-ID: <CAHLDFnrhqWSva0iDUQAWDJ-bdPm+E45WNzKU6L5iB=MT_=y=jQ@mail.gmail.com>
Hi Katrin,

Thanks for your email.

I would note that there is currently a module on Morphology under
development:

https://www.w3.org/community/ontolex/wiki/Morphology

If you would be able to contribute some of these issues to this, I think it
would be very helpful for the development of this module.

Regards,
John

On Wed, 20 Mar 2019 at 12:45, Julia Bosque Gil <jbosque@fi.upm.es> wrote:

> Dear Katrin,
>
> I will try to provide some possible solutions for your second issue,
> concerning the single part of speech -tag and the lexicog:Entry approach.
> My answer in-lines ;)
>
> The second issue concerns senses and part of speech tags. In EPSD2 it is
> possible for an entry to have a “general” part of speech tag, but some
> senses of it have a different tag e.g. “gal”(big), which is characterized
> as a “V\i” , but it can also mean “goblet”, which is tagged as “N”. But
> since Ontolex does not allow an LexicalEntry to have more than one part of
> speech tag, it is unclear to me how one could model this phenomena. The
> lexicog solution would be to use a lexicog:Entry for “gal” in general, and
> three LexicalEntry-s for the three parts of speech.
>
> Exactly, you would have *gal-v*, with senses [1-5], *gal-n* with sense
> [6], and *gal-adj* with sense [8].
>
> The problem is that EPSD2 stores information about the forms and their
> frequency for “gal”, but not for gal with senses [1-5], [6] or [7]
> separately. It is unclear which form of a word is connected to which sense
> and how often this specific sense with a specific form is used.
>
> From what I understood, since this information is not explicitly provided
> in the dictionary, there is no way of automatically distinguishing this
> case from those in which *all forms* go with *all senses* unless you take
> into account the case difference in the strings. I see three possible ways
> of representing this, one of them easier in terms of querying, but overkill
> and leading to a high number of triples. The other two are more concise but
> would create some lexical entries without a form, and you would need to
> query the dictionary entry to get them.
>
> a) [*lots* of triples] Since these entries look in appearance like those
> in which all forms go with all senses, each created LexicalEntry receives
> all the forms, which would need to be triplicated. The disambiguation step
> in the future would involve an update to remove those Forms that are not
> realisations of the lexical entry at hand.
>
> b) [more concise] Only one LexicalEntry receives *all forms* (e.g. let us
> say, randomly, the one with the first sense, so gal-v), which might be not
> correct, but in this way there are no ontolex:Forms without a LexicalEntry.
> The other two LexicalEntries would not have a lexical form, but the
> lexicog:Entry would consist of LexicographicComponents that point to them
> via *describes*. lexicog:LexicographicComponents can also describe
> ontolex:Forms, since the range of the describes property is owl:Thing. If
> you state that the lexicog:Entry that includes components describing the
> three ontolex:LexicalEntries also has more components, each describing a
> Form, you can later on get a list of all the forms described in that
> dictionary entry. In this way, if you want to access the potential forms
> that would go with *gal-adj* or *gal-n*, you would need to perform a
> query in SPARQL “Given than *gal-n* is described by a
> LexicographicComponent which is rdfs:member of a lexicog:Entry, give me all
> the ontolex:Forms that are described by LexicographicComponents which are
> also rdfs:member of that same lexicog:Entry”. Alternatively, “Given than
> *gal-n* is described by a LexicographicComponent which is rdfs:member of
> a lexicog:Entry, give me all the ontolex:Forms of other LexicalEntries that
> are described by LexicographicCompoents which are also rdfs:member of that
> same lexicog:Entry”, and then you would get the forms linked to *gal-v*.
> For the last query you actually would not need to create
> LexicographicComponents describing Forms, because you access them via
> *gal-v* (unless you consider that the EPSD has indeed a section in that
> entry devoted to form description and you want to capture that).
>
> c) Just like (b), but the LexicographicComponents of the lexicog:Entry
> would not describe ontolex:LexicalEntries, but ontolex:LexicalSenses. This
> depends on how exactly you want to recreate the original structure that you
> have in the EPSD2.
>
> I hope this helps. I might be missing some other options of a solution
> involving *lexicog*, so, if you have any more ideas/suggestions, they are
> more than welcome!
>
> Best,
>
> Julia
>
> El mié., 20 mar. 2019 a las 11:39, <peikert.katrin@web.de> escribió:
>
>> Hello everyone,
>>
>> I am currently trying to create a Ontolex-model of the Electronic Penn
>> Sumerian Dictionary
>> (EPSD2, http://oracc.museum.upenn.edu/epsd2/sux
>> <https://deref-web-02.de/mail/client/21_NCYmjA5w/dereferrer/?redirectUrl=http%3A%2F%2Foracc.museum.upenn.edu%2Fepsd2%2Fsux>).
>> But several issues have arisen, which
>> are not easily solvable within the current Ontolex version.
>>
>> The first issue concerns the presentation of verbal prefixes in Sumerian.
>> While there are ways
>> to describe different forms of the same word, there does not seem to be a
>> way to do so by
>> describing the underlying morphological process. As an example, consider
>> the lexical entry
>> (dictionary entry) for gal:
>> http://oracc.museum.upenn.edu/epsd2/cbd/sux/sux.x0405180.html
>> <https://deref-web-02.de/mail/client/m2K5fBXYL8E/dereferrer/?redirectUrl=http%3A%2F%2Foracc.museum.upenn.edu%2Fepsd2%2Fcbd%2Fsux%2Fsux.x0405180.html>
>> .
>> Under "verbal prefixes", it lists for example ba.i.n (i.e., ba.i.n.V,
>> which stands for the morphological
>> gloss ba-i-n-gal, with three inflectional prefixes and the verbal root).
>> Beyond the morphological
>> segmentation, the analysis is not spelled out, but points to the original
>> attestation(s). In OntoLex,
>> it is however, already unclear how to represent the morphological
>> segmentation in the first place.
>>
>> The second issue concerns senses and part of speech tags. In EPSD2 it is
>> possible for an entry to
>> have a "general" part of speech tag, but some senses of it have a
>> different tag e.g. "gal"(big), which
>> is characterized as a "V\i" , but it can also mean "goblet", which is
>> tagged as "N". But since
>> Ontolex does not allow an LexicalEntry to have more than one part of
>> speech tag, it is unclear to me
>> how one could model this phenomena. The lexicog solution would be to use
>> a lexicog:Entry for "gal" in
>> general, and three LexicalEntry-s for the three parts of speech. The
>> problem is that EPSD2 stores
>> information about the forms and their frequency for "gal", but not for
>> gal with senses [1-5], [6] or [7]
>> separately. It is unclear which form of a word is connected to which
>> sense and how often this specific
>> sense with a specific form is used. Thus, if you try to have several
>> LexicalEntries of the same word,
>> there is no way to preserve information about forms and their
>> frequencies, as we cannot automatically
>> disambiguate the forms. (Manually an expert can to a certain extent, the
>> upper case strings in the forms
>> are determinative, which specify certain semantic types, e.g., the
>> material an object consists of,
>> indicating a nominal or adjectival sense).
>>
>> It would be really great if there could be found a way to solve this
>> issues.
>>
>>
>> Best regards,
>> Katrin Peikert
>>
>>
>> *Goethe Universität *
>> *Frankfurt am Main*
>>
>
>
> --
>
> Julia Bosque Gil
> PhD Student
> Ontology Engineering Group <http://www.oeg-upm.net/>
> Departamento de Inteligencia Artificial
> Universidad Politécnica de Madrid
>
Received on Wednesday, 20 March 2019 13:28:54 UTC