Re: frequency dictionaries? from John McCrae on 2018-06-29 (public-ontolex@w3.org from June 2018)

From: John McCrae <john.mccrae@insight-centre.org>
Date: Fri, 29 Jun 2018 16:13:10 +0100
To: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Cc: "public-ontolex@w3.org" <public-ontolex@w3.org>, "christian.chiarcos@web.de" <christian.chiarcos@web.de>
Message-ID: <CAHLDFnr_BiVEKfFgDCWJzKtbfqO4YKwkHhyagiTmW=C5nro0dQ@mail.gmail.com>
Hi Christian,

On 29 June 2018 at 15:25, Christian Chiarcos <
chiarcos@informatik.uni-frankfurt.de> wrote:

> Dear all,
>
> for disambiguating NLP annotations in a SPARQL-based workflow, I was
> extracting frequency lists for function words, their morphosyntactic
> characteristics and selected semantic features from annotated corpora. One
> application was disambiguation of the IN tag in PTB annotations, which is
> used for complementizers ("that"), certain adverbs ("so") and prepositions
> ("after"). The original rationale for this grouping of various features
> under one POS tag was that English prepositions can be used as
> complementizers ("after he passed the exam"), and that complementizers can
> include adverbial elements ("so that he passed the exam"). This does not
> work the other way around, of course, although many (adverbial) discourse
> markers have prepositional counterparts (thereafter - after).
>
> Using syntax annotation, these uses can be disambiguated, and I just want
> to do that and to provide the counts as a help for disambiguation. While
> this application does not require a backend ontology, such an ontology for
> preposition senses has indeed been developed and could be linked at some
> point in the future (http://demo.ark.cs.cmu.edu/PrepWiki/,
> http://www.clres.com/prepositions.html -- neither is provided as an
> ontology in the AI sense, though). To facilitate such (re-)use of my
> frequency lists, a lemon edition would be advisable. Also note that we
> actually provide multiple word senses per lexical entry, because these can
> be extrapolated from the semantic role of the phrase a preposition or
> complementizer occurs in, again, with counts.
>
> When trying to model this in Ontolex, I faced the following difficulties:
> - We don't seem to have a property for frequency counts (lexinfo:frequency
> is an object property). I (ab)used lexinfo:confidence, but with integers
> instead of real numbers.
>
lexinfo:frequency is an object property with three values commonlyUsed,
infrequentlyUsed and rarelyUsed.

Modelling frequency just by giving a single integer is fairly useless...
knowing that 'cat' occurs 100 times is pointless if I don't know which
corpus or at least how large the corpus is.

Still a frequency count is something that should be possible to model with
OntoLex and is intended for one of the modules, so we should add this in
the future.

(Also please don't abuse confidence... this is the lexicographer's
confidence)

> - I understand that there is a preference to have one part of speech per
> lexical entry, although I don't find a document where this was *ever*
> explicitly stated. Here, I would indeed need two lexinfo parts of speech
> for a lexical entry (representing a single Penn part of speech tag). Any
> reason not to do that?
>
>From https://www.w3.org/2016/05/ontolex/#lexical-entries

A lexical entry represents a unit of analysis of the lexicon that consists
of a set of forms that are grammatically related and a set of base meanings
that are associated with all of these forms. Thus, a lexical entry is a
word, multiword expression or affix with *a single part-of-speech,*
morphological pattern, etymology and set of senses.

It seems that Penn for some reason conflates two distinct part-of-speech
values. I would recommend introducing a new part-of-speech value and using
OWL axioms to state its relation to LexInfo (perhaps there is some Ontology
of Linguistic Annotation that could be helpful here ;)

> - In order to annotate lexinfo:confidence to parts of speech, I had to
> (RDF-)reify lexinfo:partOfSpeech. Is there another way?
>
Yes, this is standard practice and pretty much unavoidable.

> - I would like to aggregate counts from multiple corpora. What is the
> current/recommended treatment of provenance in ontolex?
>
I recommend the use of the PROV-O ontology. Is there are more specific
issue here.

>
> I attach a preliminary preposition dictionary bootstrapped from the Penn
> Treebank (please forgive my excessive use of blank nodes, this is
> temporary, of course). Suggestions for an alternative rendering in lemon
> would be highly welcome.
>
> Thanks,
> Christian
>
> PS: Apologies for missing the OntoLex telcos lately. Mondays just don't
> work for me.
> PPS: Some may remember be as being skeptical about having lexical senses
> alongside lexical concepts. I'm correcting myself: Word-specific frequency
> counts (or "confidence") per word sense is indeed a very good justification
> for distinguishing both.

:)

Regards,
John

>
> --
> Prof. Dr. Christian Chiarcos
> Applied Computational Linguistics
> Johann Wolfgang Goethe Universität Frankfurt a. M.
> 60054 Frankfurt am Main, Germany
>
> office: Robert-Mayer-Str. 10, #401b
> mail: chiarcos@informatik.uni-frankfurt.de
> web: http://acoli.cs.uni-frankfurt.de
> tel: +49-(0)69-798-22463
> fax: +49-(0)69-798-28931
Received on Friday, 29 June 2018 15:13:35 UTC