- From: John P. McCrae <john.mccrae@insight-centre.org>
- Date: Wed, 31 Oct 2018 14:07:41 +0000
- To: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
- Cc: public-ontolex <public-ontolex@w3.org>, Christian Chiarcos <christian.chiarcos@web.de>
- Message-ID: <CAHLDFnr3hGVdrLYw36kwLBD4zpvoeDOKLL9p4Q2GPwnQN4pOGQ@mail.gmail.com>
Hi Christian, Thanks for this, we will use this as a basis for discussion on Leiden. Regards, John On Wed 31 Oct 2018, 13:00 Christian Chiarcos < chiarcos@informatik.uni-frankfurt.de wrote: > Dear all, > > we discussed the notion of frequency dictionaries before. In case there > still is time at the f2f meeting to look onto actual example data, you > might want to consider the following dictionaries that illustrate > frequency information: > > Electronic Penn Sumerian Dictionary: > http://oracc.museum.upenn.edu/epsd2/sux, look into a sample entry such as > "a [WATER]". A classical corpus-based lexicon, with direct links to corpus > examples. We work on this in the scope of a joint NSF/DFG/SSHRC project > and in some student theses. > > Wortschatz collocation dictionary: > > http://corpora.uni-leipzig.de/de/res?corpusId=deu_newscrawl_2011&word=Wasser > . > Below "Dornseiff-Bedeutungsgruppen:" (~ LexicalConcept), all information > provided is distributional. There was an early attempt to convert this to > LLOD by the AKSW people in 2013, but this definitely needs to be reworked. > > FrameNet realization statistics: providing aggregate counts > ( > https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=lexentry > ) > and attestations ("annotations", > > https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=annotation > ). > FrameNet modelling in LLOD is a long story, but focused on the frame > inventory rather than on representing (links with) annotations or > distributional information. > > Note that all of these illustrate *corpus-driven lexicography*. More > insight about what kind of frequency-based information a lexicographer > would like to see can be obtained from the corpus statistics of tools such > as WordSmith > (https://www.lancaster.ac.uk/fss/courses/ling/corpus/blue/l03_top.htm) or > SketchEngine (https://www.sketchengine.eu/). Also note that the fact that > these tools are capable of generating such information dynamically from a > corpus does not mean that this information is beyond OntoLex, as we may > think of having OntoLex-compliant lexicographical web services at some > point. (In fact, we should, as this is a natural extension of the > lexicography use case in the context of the Web of Data.) > > Aside from corpus-driven lexicography, there are other uses of > frequency/distributional information, e.g., in NLP. Stop-word lists are a > trivial example (e.g., https://www.ranks.nl/stopwords), pre-trained word > embeddings (say, https://nlp.stanford.edu/projects/glove/) are > technologically more advanced, but effectively similar in structure. > > Unfortunately, I won't attend the f2f meeting in person, so I won't be > able to introduce and describe these resources, but discussing these and > other examples on the list and in future telcos would be a basis for > developing a principled approach to represent frequency information in the > future. > > Best, > Christian > -- > Prof. Dr. Christian Chiarcos > Applied Computational Linguistics > Johann Wolfgang Goethe Universität Frankfurt a. M. > 60054 Frankfurt am Main, Germany > > office: Robert-Mayer-Str. 10, #401b > mail: chiarcos@informatik.uni-frankfurt.de > web: http://acoli.cs.uni-frankfurt.de > tel: +49-(0)69-798-22463 > fax: +49-(0)69-798-28931 > >
Received on Wednesday, 31 October 2018 14:08:28 UTC