Examples for frequency dictionaries from Christian Chiarcos on 2018-10-31 (public-ontolex@w3.org from October 2018)

From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Date: Wed, 31 Oct 2018 13:59:07 +0100
To: "public-ontolex@w3.org" <public-ontolex@w3.org>
Cc: "christian.chiarcos@web.de" <christian.chiarcos@web.de>
Message-ID: <op.zrq2ct0289jat0@kitaba>

Dear all,

we discussed the notion of frequency dictionaries before. In case there
still is time at the f2f meeting to look onto actual example data, you
might want to consider the following dictionaries that illustrate
frequency information:

Electronic Penn Sumerian Dictionary:
http://oracc.museum.upenn.edu/epsd2/sux, look into a sample entry such as
"a [WATER]". A classical corpus-based lexicon, with direct links to corpus
examples. We work on this in the scope of a joint NSF/DFG/SSHRC project
and in some student theses.

Wortschatz collocation dictionary:
http://corpora.uni-leipzig.de/de/res?corpusId=deu_newscrawl_2011&word=Wasser.
Below "Dornseiff-Bedeutungsgruppen:" (~ LexicalConcept), all information
provided is distributional. There was an early attempt to convert this to
LLOD by the AKSW people in 2013, but this definitely needs to be reworked.

FrameNet realization statistics: providing aggregate counts
(https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=lexentry)
and attestations ("annotations",
https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=annotation).
FrameNet modelling in LLOD is a long story, but focused on the frame
inventory rather than on representing (links with) annotations or
distributional information.

Note that all of these illustrate *corpus-driven lexicography*. More
insight about what kind of frequency-based information a lexicographer
would like to see can be obtained from the corpus statistics of tools such
as WordSmith
(https://www.lancaster.ac.uk/fss/courses/ling/corpus/blue/l03_top.htm) or
SketchEngine (https://www.sketchengine.eu/). Also note that the fact that
these tools are capable of generating such information dynamically from a
corpus does not mean that this information is beyond OntoLex, as we may
think of having OntoLex-compliant lexicographical web services at some
point. (In fact, we should, as this is a natural extension of the
lexicography use case in the context of the Web of Data.)

Aside from corpus-driven lexicography, there are other uses of
frequency/distributional information, e.g., in NLP. Stop-word lists are a
trivial example (e.g., https://www.ranks.nl/stopwords), pre-trained word
embeddings (say, https://nlp.stanford.edu/projects/glove/) are
technologically more advanced, but effectively similar in structure.

Unfortunately, I won't attend the f2f meeting in person, so I won't be
able to introduce and describe these resources, but discussing these and
other examples on the list and in future telcos would be a basis for
developing a principled approach to represent frequency information in the
future.

Best,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931

Received on Wednesday, 31 October 2018 12:59:30 UTC