- From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
- Date: Wed, 31 Oct 2018 13:59:07 +0100
- To: "public-ontolex@w3.org" <public-ontolex@w3.org>
- Cc: "christian.chiarcos@web.de" <christian.chiarcos@web.de>
Dear all, we discussed the notion of frequency dictionaries before. In case there still is time at the f2f meeting to look onto actual example data, you might want to consider the following dictionaries that illustrate frequency information: Electronic Penn Sumerian Dictionary: http://oracc.museum.upenn.edu/epsd2/sux, look into a sample entry such as "a [WATER]". A classical corpus-based lexicon, with direct links to corpus examples. We work on this in the scope of a joint NSF/DFG/SSHRC project and in some student theses. Wortschatz collocation dictionary: http://corpora.uni-leipzig.de/de/res?corpusId=deu_newscrawl_2011&word=Wasser. Below "Dornseiff-Bedeutungsgruppen:" (~ LexicalConcept), all information provided is distributional. There was an early attempt to convert this to LLOD by the AKSW people in 2013, but this definitely needs to be reworked. FrameNet realization statistics: providing aggregate counts (https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=lexentry) and attestations ("annotations", https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu6387.xml?mode=annotation). FrameNet modelling in LLOD is a long story, but focused on the frame inventory rather than on representing (links with) annotations or distributional information. Note that all of these illustrate *corpus-driven lexicography*. More insight about what kind of frequency-based information a lexicographer would like to see can be obtained from the corpus statistics of tools such as WordSmith (https://www.lancaster.ac.uk/fss/courses/ling/corpus/blue/l03_top.htm) or SketchEngine (https://www.sketchengine.eu/). Also note that the fact that these tools are capable of generating such information dynamically from a corpus does not mean that this information is beyond OntoLex, as we may think of having OntoLex-compliant lexicographical web services at some point. (In fact, we should, as this is a natural extension of the lexicography use case in the context of the Web of Data.) Aside from corpus-driven lexicography, there are other uses of frequency/distributional information, e.g., in NLP. Stop-word lists are a trivial example (e.g., https://www.ranks.nl/stopwords), pre-trained word embeddings (say, https://nlp.stanford.edu/projects/glove/) are technologically more advanced, but effectively similar in structure. Unfortunately, I won't attend the f2f meeting in person, so I won't be able to introduce and describe these resources, but discussing these and other examples on the list and in future telcos would be a basis for developing a principled approach to represent frequency information in the future. Best, Christian -- Prof. Dr. Christian Chiarcos Applied Computational Linguistics Johann Wolfgang Goethe Universität Frankfurt a. M. 60054 Frankfurt am Main, Germany office: Robert-Mayer-Str. 10, #401b mail: chiarcos@informatik.uni-frankfurt.de web: http://acoli.cs.uni-frankfurt.de tel: +49-(0)69-798-22463 fax: +49-(0)69-798-28931
Received on Wednesday, 31 October 2018 12:59:30 UTC