Re: frequency dictionaries? from Christian Chiarcos on 2018-06-29 (public-ontolex@w3.org from June 2018)

From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Date: Fri, 29 Jun 2018 22:15:37 +0200
To: "John McCrae" <john.mccrae@insight-centre.org>
Cc: "public-ontolex@w3.org" <public-ontolex@w3.org>, "christian.chiarcos@web.de" <christian.chiarcos@web.de>
Message-ID: <op.zldzwbsy89jat0@kitaba.home>
Hi John,

thanks for the immediate response.

> lexinfo:frequency is an object property with three values commonlyUsed,  
> infrequentlyUsed and rarelyUsed. ... (Also please don't abuse  
> confidence... this >is the lexicographer's confidence)
Yes, isocat frequency was intended to model how common a word feels to the  
lexicographer. Not corpus frequency, of course. I use a special-purpose  
property, then.

> Modelling frequency just by giving a single integer is fairly useless...  
> knowing that 'cat' occurs 100 times is pointless if I don't know which  
> corpus or at >least how large the corpus is.

Right, hence the provenance question; and also the wish to record both  
prepositional and subordinating uses within the same lexical entry (as the  
numbers within the LexicalEntry represent the total). Only absolute  
numbers allow aggregation over multiple sources. But to conform to the  
open world assumption, one should probably provide both absolute and the  
relative frequency. The relative frequency is in fact related to  
confidence, as this is (an approximation of) the unconditioned probability  
of a property to hold in a corpus, and since the Brown corpus, these  
(should) have been guiding (some) lexicographers' intuitions.

> From https://www.w3.org/2016/05/ontolex/#lexical-entries
>
> A lexical entry represents a unit of analysis of the lexicon that  
> consists of a set of forms that are grammatically related and a set of  
> base meanings that are >associated with all of these forms. Thus, a  
> lexical entry is a word, multiword expression or affix with a single  
> part-of-speech, morphological pattern, >etymology and set of senses.

Allright.
The deeper issues here are that (a) linguists have no agreement on part of  
speech inventories, and (b) many languages feature "zero derivation",  
where words of one POS category can be used for another without difference  
in form or sense. Wrt. (a), one may claim that English does not have a  
preposition category that is distinct from subordinators, but rather that  
English preposition is a subclass of subordinating conjunction. This is a  
bit of a radical view, but it is not unheard of that languages implement  
subordination by nominalization, and this seems to be the idea behind the  
Penn tag. The solution for (a) is to use a language-specific inventory, of  
course ;)
Wrt. (b), German adjectives can be used as adverbs, and we will probably  
think of both having distinct POS; also, every infinitive is formally  
identical to the deverbal noun. Shall we duplicate all lexical entries for  
adjectives and verbs? The problem is not so much the one-POS-per-entry  
constraint (which doesn't create much overhead), but the fact that "[t]he  
lexical sense has a single lexical entry", and then the sense definitions  
need to be duplicated, as well. Why would we need to record both the  
"zero-derived" and the "underlying" lexical entry if the senses are  
identical? Well, we might want to record attestations or frequencies. If  
we don't, the current lemon definitions (plus some user knowledge about  
zero derivation) are fine.
An alternative ontolex extension would be to introduce a "zero derivation"  
property in a future morphology module that holds between lexical entries  
of different POS and that *requires* to inherit the canonical form and the  
senses of the "underlying" lexical entry. In many ways this would be like  
casting a variable in programming languages, and indeed, "castOf" could be  
a better name than "zeroDerivedFrom".

> It seems that Penn for some reason conflates two distinct part-of-speech  
> values. I would recommend introducing a new part-of-speech value and  
> using >OWL axioms to state its relation to LexInfo (perhaps there is  
> some Ontology of Linguistic Annotation that could be helpful here ;)

The OLiA Penn ontology doesn't immediately help as it implements the  
semantics of IN by means of a disjunction. As I need to express that both  
categories apply, this could be represented with intersection of OWL  
classes:

_:after lexinfo:partOfSpeech [ a olia:SubordinatingConjunction,  
olia:Adposition ] .

(The example really needs OLiA, as subordinatingConjunction doesn't have  
an accompanying class in lexinfo.)

But this is problematic because I want to count conjunction and adposition  
usage separately. (No RDF triple to reify.)

>> - In order to annotate lexinfo:confidence to parts of speech, I had to  
>> (RDF-)reify lexinfo:partOfSpeech. Is there another way?
> Yes, this is standard practice and pretty much unavoidable.

ok.

>> - I would like to aggregate counts from multiple corpora. What is the  
>> current/recommended treatment of provenance in ontolex?
> I recommend the use of the PROV-O ontology. Is there are more specific  
> issue here.

That's what I thought. prov:wasDerivedFrom would work for a reified  
lexinfo:partOfSpeech property.

I think the least abusive way of applying ontolex here would be to use an  
application-specific frequency property and multiple lexinfo:partOfSpeech  
properties, with a blank node as argument and one associated OLiA class  
each. As the blank node does not exhibit a unique reference, all blank  
nodes could in theory resolve to the same URI, so formally, the  
one-POS-per-entry constraint isn't broken. But this clearly is a hack and  
I'm not sure this should be recommended.

So, we have

[ a rdfs:Statement; rdfs:subject _:after; rdfs:predicate  
lexinfo:partOfSpeech; rdfs:object [ a olia:SubordinatingConjunction ];  
prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq  
"10"; lexinfo:confidence "0.1" ].
[ a rdfs:Statement; rdfs:subject _:after; rdfs:predicate  
lexinfo:partOfSpeech; rdfs:object [ a olia:Adposition ];  
prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq"90";  
lexinfo:confidence "0.9" ].

A few triples more than in my original modelling, but it would work for me.

Thanks,
Christian

PS: In fact, with the blank objects, we can have something almost  
equivalent without reification:

_:after lexinfo:partOfSpeech [ a olia:SubordinatingConjunction;  
prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq  
"10"; lexinfo:confidence "0.1" ].
_:after lexinfo:partOfSpeech [ a olia:Adposition; prov:wasDerivedFrom  
<https://catalog.ldc.upenn.edu/ldc99t42>; my:freq"90"; lexinfo:confidence  
"0.9" ].

This means we have multiple lexical-entry-specific POS categories. This is  
much more readable but less precise, as unifying both blank nodes just  
gives nonsense.
Received on Friday, 29 June 2018 20:16:03 UTC