Re: frequency dictionaries? from John McCrae on 2018-07-02 (public-ontolex@w3.org from July 2018)

From: John McCrae <john.mccrae@insight-centre.org>
Date: Mon, 2 Jul 2018 12:25:34 +0100
To: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Cc: public-ontolex <public-ontolex@w3.org>, Christian Chiarcos <christian.chiarcos@web.de>
Message-ID: <CAHLDFnpT08kP2eySxEc6VTx9uQwiyditPKHKY2hyYSER=2GqjA@mail.gmail.com>
On Fri 29 Jun 2018, 21:15 Christian Chiarcos, <
chiarcos@informatik.uni-frankfurt.de> wrote:

> Hi John,
>
> thanks for the immediate response.
>
> lexinfo:frequency is an object property with three values commonlyUsed,
> infrequentlyUsed and rarelyUsed. ... (Also please don't abuse confidence...
> this is the lexicographer's confidence)
>
> Yes, isocat frequency was intended to model how common a word feels to the
> lexicographer. Not corpus frequency, of course. I use a special-purpose
> property, then.
>
Yes, I think this is best

>
> Modelling frequency just by giving a single integer is fairly useless...
> knowing that 'cat' occurs 100 times is pointless if I don't know which
> corpus or at least how large the corpus is.
>
>
> Right, hence the provenance question; and also the wish to record both
> prepositional and subordinating uses within the same lexical entry (as the
> numbers within the LexicalEntry represent the total). Only absolute numbers
> allow aggregation over multiple sources. But to conform to the open world
> assumption, one should probably provide both absolute and the relative
> frequency. The relative frequency is in fact related to confidence, as this
> is (an approximation of) the unconditioned probability of a property to
> hold in a corpus, and since the Brown corpus, these (should) have been
> guiding (some) lexicographers' intuitions.
>
I don't see the conflict with the open world assumption, the OWA concerns
the validity of missing data, where as the statement of a frequency in a
corpus can be stated without any need to infer anything about missing data.

The connection between relative frequency and confidence is very weak for
human lexicographers, that is high frequency terms can be harder to define
than many low frequency terms.

>
> From https://www.w3.org/2016/05/ontolex/#lexical-entries
>
> A lexical entry represents a unit of analysis of the lexicon that consists
> of a set of forms that are grammatically related and a set of base meanings
> that are associated with all of these forms. Thus, a lexical entry is a
> word, multiword expression or affix with *a single part-of-speech,*
> morphological pattern, etymology and set of senses.
>
>
> Allright.
> The deeper issues here are that (a) linguists have no agreement on part of
> speech inventories, and (b) many languages feature "zero derivation", where
> words of one POS category can be used for another without difference in
> form or sense. Wrt. (a), one may claim that English does not have a
> preposition category that is distinct from subordinators, but rather that
> English preposition is a subclass of subordinating conjunction. This is a
> bit of a radical view, but it is not unheard of that languages implement
> subordination by nominalization, and this seems to be the idea behind the
> Penn tag. The solution for (a) is to use a language-specific inventory, of
> course ;)
> Wrt. (b), German adjectives can be used as adverbs, and we will probably
> think of both having distinct POS; also, every infinitive is formally
> identical to the deverbal noun. Shall we duplicate all lexical entries for
> adjectives and verbs? The problem is not so much the one-POS-per-entry
> constraint (which doesn't create much overhead), but the fact that "[t]he
> lexical sense has a single lexical entry", and then the sense definitions
> need to be duplicated, as well. Why would we need to record both the
> "zero-derived" and the "underlying" lexical entry if the senses are
> identical? Well, we might want to record attestations or frequencies. If we
> don't, the current lemon definitions (plus some user knowledge about zero
> derivation) are fine.
> An alternative ontolex extension would be to introduce a "zero derivation"
> property in a future morphology module that holds between lexical entries
> of different POS and that *requires* to inherit the canonical form and the
> senses of the "underlying" lexical entry. In many ways this would be like
> casting a variable in programming languages, and indeed, "castOf" could be
> a better name than "zeroDerivedFrom".
>
The recommendation here is to use the lexicographical norms for your
language, that is if most dictionaries don't list a zero derivation then
there is no need to in an OntoLex model. The morphology module should
contain exactly the kinds of derivations you describe and codify them in a
generative manner.

>
> It seems that Penn for some reason conflates two distinct part-of-speech
> values. I would recommend introducing a new part-of-speech value and using
> OWL axioms to state its relation to LexInfo (perhaps there is some Ontology
> of Linguistic Annotation that could be helpful here ;)
>
>
> The OLiA Penn ontology doesn't immediately help as it implements the
> semantics of IN by means of a disjunction. As I need to express that both
> categories apply, this could be represented with intersection of OWL
> classes:
>
> _:after lexinfo:partOfSpeech [ a olia:SubordinatingConjunction,
> olia:Adposition ] .
>
> (The example really needs OLiA, as subordinatingConjunction doesn't have
> an accompanying class in lexinfo.)
>
> But this is problematic because I want to count conjunction and adposition
> usage separately. (No RDF triple to reify.)
>
You need to introduce a part of speech model that makes sense relative to
the norms for the language and makes the data work for your application. I
would advise against using blank nodes as the value of the part of speech
property.

>
> - In order to annotate lexinfo:confidence to parts of speech, I had to
>> (RDF-)reify lexinfo:partOfSpeech. Is there another way?
>>
> Yes, this is standard practice and pretty much unavoidable.
>
>
> ok.
>
> - I would like to aggregate counts from multiple corpora. What is the
>> current/recommended treatment of provenance in ontolex?
>>
> I recommend the use of the PROV-O ontology. Is there are more specific
> issue here.
>
>
> That's what I thought. prov:wasDerivedFrom would work for a reified
> lexinfo:partOfSpeech property.
>
> I think the least abusive way of applying ontolex here would be to use an
> application-specific frequency property and multiple lexinfo:partOfSpeech
> properties, with a blank node as argument and one associated OLiA class
> each. As the blank node does not exhibit a unique reference, all blank
> nodes could in theory resolve to the same URI, so formally, the
> one-POS-per-entry constraint isn't broken. But this clearly is a hack and
> I'm not sure this should be recommended.
>
> So, we have
>
> [ a rdfs:Statement; rdfs:subject _:after; rdfs:predicate
> lexinfo:partOfSpeech; rdfs:object [ a olia:SubordinatingConjunction ];
> prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq
> "10"; lexinfo:confidence "0.1" ].
> [ a rdfs:Statement; rdfs:subject _:after; rdfs:predicate
> lexinfo:partOfSpeech; rdfs:object [ a olia:Adposition ];
> prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>;
> my:freq"90"; lexinfo:confidence "0.9" ].
>
I think you need to introduce a specific modelling as I can't see such a
modelling encouraging reuse and semantic interoperability. For OntoLex, it
would be great if you could propose such a model that could be introduced
into the lexicography module.

Regards,
John

>
> A few triples more than in my original modelling, but it would work for me.
>
> Thanks,
> Christian
>
> PS: In fact, with the blank objects, we can have something almost
> equivalent without reification:
>
> _:after lexinfo:partOfSpeech [ a olia:SubordinatingConjunction;
> prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq
> "10"; lexinfo:confidence "0.1" ].
> _:after lexinfo:partOfSpeech [ a olia:Adposition; prov:wasDerivedFrom <
> https://catalog.ldc.upenn.edu/ldc99t42>; my:freq"90"; lexinfo:confidence
> "0.9" ].
>
> This means we have multiple lexical-entry-specific POS categories. This is
> much more readable but less precise, as unifying both blank nodes just
> gives nonsense.
>
Received on Monday, 2 July 2018 11:26:09 UTC