- From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
- Date: Fri, 29 Jun 2018 22:15:37 +0200
- To: "John McCrae" <john.mccrae@insight-centre.org>
- Cc: "public-ontolex@w3.org" <public-ontolex@w3.org>, "christian.chiarcos@web.de" <christian.chiarcos@web.de>
- Message-ID: <op.zldzwbsy89jat0@kitaba.home>
Hi John, thanks for the immediate response. > lexinfo:frequency is an object property with three values commonlyUsed, > infrequentlyUsed and rarelyUsed. ... (Also please don't abuse > confidence... this >is the lexicographer's confidence) Yes, isocat frequency was intended to model how common a word feels to the lexicographer. Not corpus frequency, of course. I use a special-purpose property, then. > Modelling frequency just by giving a single integer is fairly useless... > knowing that 'cat' occurs 100 times is pointless if I don't know which > corpus or at >least how large the corpus is. Right, hence the provenance question; and also the wish to record both prepositional and subordinating uses within the same lexical entry (as the numbers within the LexicalEntry represent the total). Only absolute numbers allow aggregation over multiple sources. But to conform to the open world assumption, one should probably provide both absolute and the relative frequency. The relative frequency is in fact related to confidence, as this is (an approximation of) the unconditioned probability of a property to hold in a corpus, and since the Brown corpus, these (should) have been guiding (some) lexicographers' intuitions. > From https://www.w3.org/2016/05/ontolex/#lexical-entries > > A lexical entry represents a unit of analysis of the lexicon that > consists of a set of forms that are grammatically related and a set of > base meanings that are >associated with all of these forms. Thus, a > lexical entry is a word, multiword expression or affix with a single > part-of-speech, morphological pattern, >etymology and set of senses. Allright. The deeper issues here are that (a) linguists have no agreement on part of speech inventories, and (b) many languages feature "zero derivation", where words of one POS category can be used for another without difference in form or sense. Wrt. (a), one may claim that English does not have a preposition category that is distinct from subordinators, but rather that English preposition is a subclass of subordinating conjunction. This is a bit of a radical view, but it is not unheard of that languages implement subordination by nominalization, and this seems to be the idea behind the Penn tag. The solution for (a) is to use a language-specific inventory, of course ;) Wrt. (b), German adjectives can be used as adverbs, and we will probably think of both having distinct POS; also, every infinitive is formally identical to the deverbal noun. Shall we duplicate all lexical entries for adjectives and verbs? The problem is not so much the one-POS-per-entry constraint (which doesn't create much overhead), but the fact that "[t]he lexical sense has a single lexical entry", and then the sense definitions need to be duplicated, as well. Why would we need to record both the "zero-derived" and the "underlying" lexical entry if the senses are identical? Well, we might want to record attestations or frequencies. If we don't, the current lemon definitions (plus some user knowledge about zero derivation) are fine. An alternative ontolex extension would be to introduce a "zero derivation" property in a future morphology module that holds between lexical entries of different POS and that *requires* to inherit the canonical form and the senses of the "underlying" lexical entry. In many ways this would be like casting a variable in programming languages, and indeed, "castOf" could be a better name than "zeroDerivedFrom". > It seems that Penn for some reason conflates two distinct part-of-speech > values. I would recommend introducing a new part-of-speech value and > using >OWL axioms to state its relation to LexInfo (perhaps there is > some Ontology of Linguistic Annotation that could be helpful here ;) The OLiA Penn ontology doesn't immediately help as it implements the semantics of IN by means of a disjunction. As I need to express that both categories apply, this could be represented with intersection of OWL classes: _:after lexinfo:partOfSpeech [ a olia:SubordinatingConjunction, olia:Adposition ] . (The example really needs OLiA, as subordinatingConjunction doesn't have an accompanying class in lexinfo.) But this is problematic because I want to count conjunction and adposition usage separately. (No RDF triple to reify.) >> - In order to annotate lexinfo:confidence to parts of speech, I had to >> (RDF-)reify lexinfo:partOfSpeech. Is there another way? > Yes, this is standard practice and pretty much unavoidable. ok. >> - I would like to aggregate counts from multiple corpora. What is the >> current/recommended treatment of provenance in ontolex? > I recommend the use of the PROV-O ontology. Is there are more specific > issue here. That's what I thought. prov:wasDerivedFrom would work for a reified lexinfo:partOfSpeech property. I think the least abusive way of applying ontolex here would be to use an application-specific frequency property and multiple lexinfo:partOfSpeech properties, with a blank node as argument and one associated OLiA class each. As the blank node does not exhibit a unique reference, all blank nodes could in theory resolve to the same URI, so formally, the one-POS-per-entry constraint isn't broken. But this clearly is a hack and I'm not sure this should be recommended. So, we have [ a rdfs:Statement; rdfs:subject _:after; rdfs:predicate lexinfo:partOfSpeech; rdfs:object [ a olia:SubordinatingConjunction ]; prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq "10"; lexinfo:confidence "0.1" ]. [ a rdfs:Statement; rdfs:subject _:after; rdfs:predicate lexinfo:partOfSpeech; rdfs:object [ a olia:Adposition ]; prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq"90"; lexinfo:confidence "0.9" ]. A few triples more than in my original modelling, but it would work for me. Thanks, Christian PS: In fact, with the blank objects, we can have something almost equivalent without reification: _:after lexinfo:partOfSpeech [ a olia:SubordinatingConjunction; prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq "10"; lexinfo:confidence "0.1" ]. _:after lexinfo:partOfSpeech [ a olia:Adposition; prov:wasDerivedFrom <https://catalog.ldc.upenn.edu/ldc99t42>; my:freq"90"; lexinfo:confidence "0.9" ]. This means we have multiple lexical-entry-specific POS categories. This is much more readable but less precise, as unifying both blank nodes just gives nonsense.
Received on Friday, 29 June 2018 20:16:03 UTC