frequency dictionaries?

Dear all,

for disambiguating NLP annotations in a SPARQL-based workflow, I was  
extracting frequency lists for function words, their morphosyntactic  
characteristics and selected semantic features from annotated corpora. One  
application was disambiguation of the IN tag in PTB annotations, which is  
used for complementizers ("that"), certain adverbs ("so") and prepositions  
("after"). The original rationale for this grouping of various features  
under one POS tag was that English prepositions can be used as  
complementizers ("after he passed the exam"), and that complementizers can  
include adverbial elements ("so that he passed the exam"). This does not  
work the other way around, of course, although many (adverbial) discourse  
markers have prepositional counterparts (thereafter - after).

Using syntax annotation, these uses can be disambiguated, and I just want  
to do that and to provide the counts as a help for disambiguation. While  
this application does not require a backend ontology, such an ontology for  
preposition senses has indeed been developed and could be linked at some  
point in the future (http://demo.ark.cs.cmu.edu/PrepWiki/,  
http://www.clres.com/prepositions.html -- neither is provided as an  
ontology in the AI sense, though). To facilitate such (re-)use of my  
frequency lists, a lemon edition would be advisable. Also note that we  
actually provide multiple word senses per lexical entry, because these can  
be extrapolated from the semantic role of the phrase a preposition or  
complementizer occurs in, again, with counts.

When trying to model this in Ontolex, I faced the following difficulties:
- We don't seem to have a property for frequency counts (lexinfo:frequency  
is an object property). I (ab)used lexinfo:confidence, but with integers  
instead of real numbers.
- I understand that there is a preference to have one part of speech per  
lexical entry, although I don't find a document where this was *ever*  
explicitly stated. Here, I would indeed need two lexinfo parts of speech  
for a lexical entry (representing a single Penn part of speech tag). Any  
reason not to do that?
- In order to annotate lexinfo:confidence to parts of speech, I had to  
(RDF-)reify lexinfo:partOfSpeech. Is there another way?
- I would like to aggregate counts from multiple corpora. What is the  
current/recommended treatment of provenance in ontolex?

I attach a preliminary preposition dictionary bootstrapped from the Penn  
Treebank (please forgive my excessive use of blank nodes, this is  
temporary, of course). Suggestions for an alternative rendering in lemon  
would be highly welcome.

Thanks,
Christian

PS: Apologies for missing the OntoLex telcos lately. Mondays just don't  
work for me.
PPS: Some may remember be as being skeptical about having lexical senses  
alongside lexical concepts. I'm correcting myself: Word-specific frequency  
counts (or "confidence") per word sense is indeed a very good  
justification for distinguishing both.
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931

Received on Friday, 29 June 2018 14:26:11 UTC