Re: Issues concerning Morphology and Part of Speech Tags from Julia Bosque Gil on 2019-03-20 (public-ontolex@w3.org from March 2019)

From: Julia Bosque Gil <jbosque@fi.upm.es>
Date: Wed, 20 Mar 2019 13:45:35 +0100
To: peikert.katrin@web.de
Cc: public-ontolex <public-ontolex@w3.org>
Message-ID: <CA+B92MuHFQ8BbqyYh+vWByUTV5z=k_SHuEcuv0rcJiUah==p8g@mail.gmail.com>
Dear Katrin,

I will try to provide some possible solutions for your second issue,
concerning the single part of speech -tag and the lexicog:Entry approach.
My answer in-lines ;)

The second issue concerns senses and part of speech tags. In EPSD2 it is
possible for an entry to have a “general” part of speech tag, but some
senses of it have a different tag e.g. “gal”(big), which is characterized
as a “V\i” , but it can also mean “goblet”, which is tagged as “N”. But
since Ontolex does not allow an LexicalEntry to have more than one part of
speech tag, it is unclear to me how one could model this phenomena. The
lexicog solution would be to use a lexicog:Entry for “gal” in general, and
three LexicalEntry-s for the three parts of speech.

Exactly, you would have *gal-v*, with senses [1-5], *gal-n* with sense [6],
and *gal-adj* with sense [8].

The problem is that EPSD2 stores information about the forms and their
frequency for “gal”, but not for gal with senses [1-5], [6] or [7]
separately. It is unclear which form of a word is connected to which sense
and how often this specific sense with a specific form is used.

From what I understood, since this information is not explicitly provided
in the dictionary, there is no way of automatically distinguishing this
case from those in which *all forms* go with *all senses* unless you take
into account the case difference in the strings. I see three possible ways
of representing this, one of them easier in terms of querying, but overkill
and leading to a high number of triples. The other two are more concise but
would create some lexical entries without a form, and you would need to
query the dictionary entry to get them.

a) [*lots* of triples] Since these entries look in appearance like those in
which all forms go with all senses, each created LexicalEntry receives all
the forms, which would need to be triplicated. The disambiguation step in
the future would involve an update to remove those Forms that are not
realisations of the lexical entry at hand.

b) [more concise] Only one LexicalEntry receives *all forms* (e.g. let us
say, randomly, the one with the first sense, so gal-v), which might be not
correct, but in this way there are no ontolex:Forms without a LexicalEntry.
The other two LexicalEntries would not have a lexical form, but the
lexicog:Entry would consist of LexicographicComponents that point to them
via *describes*. lexicog:LexicographicComponents can also describe
ontolex:Forms, since the range of the describes property is owl:Thing. If
you state that the lexicog:Entry that includes components describing the
three ontolex:LexicalEntries also has more components, each describing a
Form, you can later on get a list of all the forms described in that
dictionary entry. In this way, if you want to access the potential forms
that would go with *gal-adj* or *gal-n*, you would need to perform a query
in SPARQL “Given than *gal-n* is described by a LexicographicComponent
which is rdfs:member of a lexicog:Entry, give me all the ontolex:Forms that
are described by LexicographicComponents which are also rdfs:member of that
same lexicog:Entry”. Alternatively, “Given than *gal-n* is described by a
LexicographicComponent which is rdfs:member of a lexicog:Entry, give me all
the ontolex:Forms of other LexicalEntries that are described by
LexicographicCompoents which are also rdfs:member of that same
lexicog:Entry”, and then you would get the forms linked to *gal-v*. For the
last query you actually would not need to create LexicographicComponents
describing Forms, because you access them via *gal-v* (unless you consider
that the EPSD has indeed a section in that entry devoted to form
description and you want to capture that).

c) Just like (b), but the LexicographicComponents of the lexicog:Entry
would not describe ontolex:LexicalEntries, but ontolex:LexicalSenses. This
depends on how exactly you want to recreate the original structure that you
have in the EPSD2.

I hope this helps. I might be missing some other options of a solution
involving *lexicog*, so, if you have any more ideas/suggestions, they are
more than welcome!

Best,

Julia

El mié., 20 mar. 2019 a las 11:39, <peikert.katrin@web.de> escribió:

> Hello everyone,
>
> I am currently trying to create a Ontolex-model of the Electronic Penn
> Sumerian Dictionary
> (EPSD2, http://oracc.museum.upenn.edu/epsd2/sux
> <https://deref-web-02.de/mail/client/21_NCYmjA5w/dereferrer/?redirectUrl=http%3A%2F%2Foracc.museum.upenn.edu%2Fepsd2%2Fsux>).
> But several issues have arisen, which
> are not easily solvable within the current Ontolex version.
>
> The first issue concerns the presentation of verbal prefixes in Sumerian.
> While there are ways
> to describe different forms of the same word, there does not seem to be a
> way to do so by
> describing the underlying morphological process. As an example, consider
> the lexical entry
> (dictionary entry) for gal:
> http://oracc.museum.upenn.edu/epsd2/cbd/sux/sux.x0405180.html
> <https://deref-web-02.de/mail/client/m2K5fBXYL8E/dereferrer/?redirectUrl=http%3A%2F%2Foracc.museum.upenn.edu%2Fepsd2%2Fcbd%2Fsux%2Fsux.x0405180.html>
> .
> Under "verbal prefixes", it lists for example ba.i.n (i.e., ba.i.n.V,
> which stands for the morphological
> gloss ba-i-n-gal, with three inflectional prefixes and the verbal root).
> Beyond the morphological
> segmentation, the analysis is not spelled out, but points to the original
> attestation(s). In OntoLex,
> it is however, already unclear how to represent the morphological
> segmentation in the first place.
>
> The second issue concerns senses and part of speech tags. In EPSD2 it is
> possible for an entry to
> have a "general" part of speech tag, but some senses of it have a
> different tag e.g. "gal"(big), which
> is characterized as a "V\i" , but it can also mean "goblet", which is
> tagged as "N". But since
> Ontolex does not allow an LexicalEntry to have more than one part of
> speech tag, it is unclear to me
> how one could model this phenomena. The lexicog solution would be to use a
> lexicog:Entry for "gal" in
> general, and three LexicalEntry-s for the three parts of speech. The
> problem is that EPSD2 stores
> information about the forms and their frequency for "gal", but not for gal
> with senses [1-5], [6] or [7]
> separately. It is unclear which form of a word is connected to which sense
> and how often this specific
> sense with a specific form is used. Thus, if you try to have several
> LexicalEntries of the same word,
> there is no way to preserve information about forms and their frequencies,
> as we cannot automatically
> disambiguate the forms. (Manually an expert can to a certain extent, the
> upper case strings in the forms
> are determinative, which specify certain semantic types, e.g., the
> material an object consists of,
> indicating a nominal or adjectival sense).
>
> It would be really great if there could be found a way to solve this
> issues.
>
>
> Best regards,
> Katrin Peikert
>
>
> *Goethe Universität *
> *Frankfurt am Main*
>


-- 

Julia Bosque Gil
PhD Student
Ontology Engineering Group <http://www.oeg-upm.net/>
Departamento de Inteligencia Artificial
Universidad Politécnica de Madrid
Received on Wednesday, 20 March 2019 12:44:05 UTC