- From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
- Date: Fri, 29 Jun 2018 16:25:48 +0200
- To: "public-ontolex@w3.org" <public-ontolex@w3.org>
- Cc: "christian.chiarcos@web.de" <christian.chiarcos@web.de>
- Message-ID: <op.zldjpaf689jat0@kitaba.home>
Dear all,
for disambiguating NLP annotations in a SPARQL-based workflow, I was
extracting frequency lists for function words, their morphosyntactic
characteristics and selected semantic features from annotated corpora. One
application was disambiguation of the IN tag in PTB annotations, which is
used for complementizers ("that"), certain adverbs ("so") and prepositions
("after"). The original rationale for this grouping of various features
under one POS tag was that English prepositions can be used as
complementizers ("after he passed the exam"), and that complementizers can
include adverbial elements ("so that he passed the exam"). This does not
work the other way around, of course, although many (adverbial) discourse
markers have prepositional counterparts (thereafter - after).
Using syntax annotation, these uses can be disambiguated, and I just want
to do that and to provide the counts as a help for disambiguation. While
this application does not require a backend ontology, such an ontology for
preposition senses has indeed been developed and could be linked at some
point in the future (http://demo.ark.cs.cmu.edu/PrepWiki/,
http://www.clres.com/prepositions.html -- neither is provided as an
ontology in the AI sense, though). To facilitate such (re-)use of my
frequency lists, a lemon edition would be advisable. Also note that we
actually provide multiple word senses per lexical entry, because these can
be extrapolated from the semantic role of the phrase a preposition or
complementizer occurs in, again, with counts.
When trying to model this in Ontolex, I faced the following difficulties:
- We don't seem to have a property for frequency counts (lexinfo:frequency
is an object property). I (ab)used lexinfo:confidence, but with integers
instead of real numbers.
- I understand that there is a preference to have one part of speech per
lexical entry, although I don't find a document where this was *ever*
explicitly stated. Here, I would indeed need two lexinfo parts of speech
for a lexical entry (representing a single Penn part of speech tag). Any
reason not to do that?
- In order to annotate lexinfo:confidence to parts of speech, I had to
(RDF-)reify lexinfo:partOfSpeech. Is there another way?
- I would like to aggregate counts from multiple corpora. What is the
current/recommended treatment of provenance in ontolex?
I attach a preliminary preposition dictionary bootstrapped from the Penn
Treebank (please forgive my excessive use of blank nodes, this is
temporary, of course). Suggestions for an alternative rendering in lemon
would be highly welcome.
Thanks,
Christian
PS: Apologies for missing the OntoLex telcos lately. Mondays just don't
work for me.
PPS: Some may remember be as being skeptical about having lexical senses
alongside lexical concepts. I'm correcting myself: Word-specific frequency
counts (or "confidence") per word sense is indeed a very good
justification for distinguishing both.
--
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931
Attachments
- application/x-gzip attachment: ptb-prepdict.ttl.gz
Received on Friday, 29 June 2018 14:26:11 UTC