Re: Examples for frequency dictionaries from John P. McCrae on 2018-10-31 (public-ontolex@w3.org from October 2018)

From: John P. McCrae <john.mccrae@insight-centre.org>
Date: Wed, 31 Oct 2018 14:07:41 +0000
To: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Cc: public-ontolex <public-ontolex@w3.org>, Christian Chiarcos <christian.chiarcos@web.de>
Message-ID: <CAHLDFnr3hGVdrLYw36kwLBD4zpvoeDOKLL9p4Q2GPwnQN4pOGQ@mail.gmail.com>

Hi Christian,

Thanks for this, we will use this as a basis for discussion on Leiden.

Regards,
John

On Wed 31 Oct 2018, 13:00 Christian Chiarcos <
chiarcos@informatik.uni-frankfurt.de wrote:

> Dear all,
>
> we discussed the notion of frequency dictionaries before. In case there
> still is time at the f2f meeting to look onto actual example data, you
> might want to consider the following dictionaries that illustrate
> frequency information:
>
> Electronic Penn Sumerian Dictionary:
> http://oracc.museum.upenn.edu/epsd2/sux, look into a sample entry such as
> "a [WATER]". A classical corpus-based lexicon, with direct links to corpus
> examples. We work on this in the scope of a joint NSF/DFG/SSHRC project
> and in some student theses.
>
> Wortschatz collocation dictionary:
>
> http://corpora.uni-leipzig.de/de/res?corpusId=deu_newscrawl_2011&word=Wasser
> .
> Below "Dornseiff-Bedeutungsgruppen:" (~ LexicalConcept), all information
> provided is distributional. There was an early attempt to convert this to
> LLOD by the AKSW people in 2013, but this definitely needs to be reworked.
>
> FrameNet realization statistics: providing aggregate counts
> (
> https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=lexentry
> )
> and attestations ("annotations",
>
> https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=annotation
> ).
> FrameNet modelling in LLOD is a long story, but focused on the frame
> inventory rather than on representing (links with) annotations or
> distributional information.
>
> Note that all of these illustrate *corpus-driven lexicography*. More
> insight about what kind of frequency-based information a lexicographer
> would like to see can be obtained from the corpus statistics of tools such
> as WordSmith
> (https://www.lancaster.ac.uk/fss/courses/ling/corpus/blue/l03_top.htm) or
> SketchEngine (https://www.sketchengine.eu/). Also note that the fact that
> these tools are capable of generating such information dynamically from a
> corpus does not mean that this information is beyond OntoLex, as we may
> think of having OntoLex-compliant lexicographical web services at some
> point. (In fact, we should, as this is a natural extension of the
> lexicography use case in the context of the Web of Data.)
>
> Aside from corpus-driven lexicography, there are other uses of
> frequency/distributional information, e.g., in NLP. Stop-word lists are a
> trivial example (e.g., https://www.ranks.nl/stopwords), pre-trained word
> embeddings (say, https://nlp.stanford.edu/projects/glove/) are
> technologically more advanced, but effectively similar in structure.
>
> Unfortunately, I won't attend the f2f meeting in person, so I won't be
> able to introduce and describe these resources, but discussing these and
> other examples on the list and in future telcos would be a basis for
> developing a principled approach to represent frequency information in the
> future.
>
> Best,
> Christian
> --
> Prof. Dr. Christian Chiarcos
> Applied Computational Linguistics
> Johann Wolfgang Goethe Universität Frankfurt a. M.
> 60054 Frankfurt am Main, Germany
>
> office: Robert-Mayer-Str. 10, #401b
> mail: chiarcos@informatik.uni-frankfurt.de
> web: http://acoli.cs.uni-frankfurt.de
> tel: +49-(0)69-798-22463
> fax: +49-(0)69-798-28931
>
>

Received on Wednesday, 31 October 2018 14:08:28 UTC