- From: Katrien Depuydt <Katrien.Depuydt@ivdnt.org>
- Date: Thu, 24 Apr 2025 14:02:08 +0000
- To: 'Sander Stolk' <ssstolk@gmail.com>, Fahad Khan <anasfkhan81@gmail.com>
- CC: "John P. McCrae" <john.mccrae@insight-centre.org>, "public-ontolex@w3.org" <public-ontolex@w3.org>
- Message-ID: <684a6dcc0c5b4b10b1c02d2baf10961c@ivdnt.org>
Dear Sander, First of all, so sorry to hear that you have not been well. Hope your health continues to improve and please do take good care of yourself. I have a few comments on what you have written and have put them inline. Kind regards, Katrien Katrien Depuydt Senior onderzoeker/taalkundige Senior researcher/linguist Tel.: +31 71 527 24 79 Mob.: +31 6 53627318 Instituut voor de Nederlandse Taal / Dutch Language Institute Postbus 9500 / P.O. Box 9500 2300 RA Leiden / NL 2300 RA Leiden Bezoekadres/address: Rapenburg 61 2311 GJ Leiden Van: Sander Stolk <ssstolk@gmail.com> Verzonden: woensdag 23 april 2025 17:29 Aan: Fahad Khan <anasfkhan81@gmail.com> CC: John P. McCrae <john.mccrae@insight-centre.org>; public-ontolex@w3.org Onderwerp: Re: OntoLex FrAC Module Public Review Dear all, Apologies for having remained out of touch over the last year and a half or so. I have been severely ill and am just starting to recover. It is only by chance that I noticed the call in time and am, finally, also able to summon the energy to make these observations and to write them down. You will find my comments below, both on the model and on smaller editorial matters concerning the documentation. I hope these are helpful and, more importantly in my eyes, I hope that you are all doing well. Know that I have very much enjoyed working together with you on all aspects concerning linguistic linked data, including on FrAC, and would welcome doing so again in the future if possible. It is great to see the specification has gotten to this stage; a significant development and a very welcome ontology indeed! All best wishes, Sander --- modelling: * Consider rephrasing "frequency" to "count". In many pieces of software, including linguistic software but also in software aimed at everyday users, the term "count" is used (e.g., "word count") instead of "(absolute) frequency". Using this term would avoid ambiguity between absolute and relative frequency in terms of what is denoted by the property and class in the ontology. I am not sure I agree here. I would prefer to keep the term frequency, and, as mentioned also by Fahad, include both absolute and relative frequency into the model. Or an alternative: a fraction, eg. 2/200.000 * Currently the ontology requires min 1 dct:description for Observations. Is that necessary? I imagine that capturing a description is not always needed for captured frequencies. * I am unsure whether the property "total" is needed. The model already states that an Observation, such as a Frequency, will indicate wherein the observation was made (e.g., a corpus). Thus, there is already a link between Frequency and the anyURI/corpus that can be used. Moreover, any absolute frequency is a total of sorts: all nouns in a corpus, all weak verbs, etc. The property 'total' therefore can feel both underused and overused, in terms of in what contexts to apply it, and it seems moreso to have been added to the specification as a way to distinguish matters for those publishing a corpus and those utilizing that corpus for observations? The descriptions in the specification do not clearly distinguish the two modelling methods, as it emphasizes "elements" being counted, regardless of whether "total" is used or the other modelling pattern. Quotes below to illustrate: * definition of total: "The object property total assigns any potential FrAC data source [...] the total number of elements that it contains as a frac:Frequency object." * in introduction: "OntoLex-Lemon provides a core vocabulary to represent linguistic information associated with ontology and vocabulary elements". * elsewhere: "Elements [ontolex:LexicalEntry etc] that FrAC properties apply to must be observable in a corpus or another linguistic data source." In other words, the specification itself currently does not yet do a very good job at distinguishing why there is a distinction between 'total' and 'other', considering they are all deemed "(linguistic) elements". Perhaps, instead of the pattern found at example 3, which is: <https://wordnetcode.princeton.edu/glosstag.shtml> a dct:Collection ; frac:total [ a frac:Frequency ; rdf:value 1634691 ; frac:unit "tokens" ] . the following could be used just as well? If so, it would be possible to discard the "total" property and do away with any ambiguity of when to use which modelling pattern for such quantifications. [ a frac:Frequency ; rdf:value 1634691 ; frac:unit "tokens" ] frac:observedIn <https://wordnetcode.princeton.edu/glosstag.shtml> . <https://wordnetcode.princeton.edu/glosstag.shtml> a dct:Collection . I understand why this comment has been made. There is a way needed to indicate what the size of the corpus was based on which frequency information was extracted. Maybe a link to a corpus is not enough. What if I decide to use a corpus but count the number of verbs in relation to the nouns in that corpus. Then it should be possible to indicate that my “subcorpus” consists of the nouns and verbs of a corpus that I have been referring to… document/editorial: - schema image (Figure 1) shows rdf:value property for Observation but also for subclasses Attestation and Frequency but not Collocation. Moreover, unlike rdf:value property, dct:description at Observation is not repeated for any of its subclasses. - schema image (Figure 1) does not show property "unit" - documentation states rdfs:range of property "unit" is frac:Frequency, although I believe it should be the domain rather than the range? - "If a future community standard provides reference URIs for such datatypes, frac:unit should be used as a datatype property." Replace datatype with object here. - I recommend doing away with syntax highlighting of Turtle snippets if the highlighting is not intended for Turtle. As it is, sometimes @prefix is highlighted, sometimes it is not. Only parts of URIs are highlighted. Literals are sometimes highlighted, sometimes not. And so on. If highlighting is meaningless and/or seemingly inconsistent, then it is best avoided. - perhaps good to add something outside of ontolex, e.g., "noun", as an example of an Observable? - "Lexicographers use (corpus) frequency and distribution information while compiling lexical entries, as a qualitative assessment of their resources." Replace qualitative with quantitative here? - Caption for example 2 duplicates the word example ("Example 2: Example: Frequency of the Sumerian word _a_ 'water'") and seems to suggest it is about another word than the example truly contains (kal-ga instead of a). The duplication of the word "example" occurs for other examples too. - "the existence of a certain lexical phenomena" --> "the existence of a certain lexical phenomenon" - "In scholarly dictionaries, attestations are a representative selection from the occurrences of a headword in a textual corpus." Or a specific sense (the case for the Dictionary of Old English), or a specific form, etc. So not limited to headword. - "The property frac:attestation associates an attestation to the frac:Observable." Please rephrase so that it indicates the direction of the property (i.e., domain and range). --> "The property frac:attestation establishes a relation between a frac:Observable and an attestation thereof." - Example 4 does not type frac:Attestation; example 5 does. Do we have a preference for its explicit inclusion or absence? - "frac:locus normally refers to a location identified by RFC5147 character offsets, NIF URIs, Open Annotation or Text Fragments references, whereas frac:observedIn refers to dct:Texts or dct:Collections." Perhaps references can be made to these standards, or to their respective subsections of section 6? Additionally, sections 6 changes the naming from Open Annotation to Web Annotation. It would be good to scan the document for consistency in order to avoid confusion. - The definition of "Collocation" currently lists "SubClassOf: frac:Observation, rdfs:Container, frac:Observable". I suspect frac:Observable has to be removed here. On Tue, 22 Apr 2025 at 14:23, Fahad Khan <anasfkhan81@gmail.com<mailto:anasfkhan81@gmail.com>> wrote: Dear John, All, Here are my comments on the FrAC draft: https://docs.google.com/document/d/148Mtlag7bvl-GCpOpXRxPPQUj1fSTBa7yZvHek0rPQY/edit?usp=sharing Cheers, Fahad Il giorno lun 21 apr 2025 alle ore 09:02 John P. McCrae <john.mccrae@insight-centre.org<mailto:john.mccrae@insight-centre.org>> ha scritto: Hi, I have one further comment on the public review version: * The range of "unit" is given as a string. It would be much better if these could be replaced by elements from a standard vocabulary such as LexInfo Regards, John Ar Déar 17 Aib 2025 ag 10:19, scríobh John P. McCrae <john.mccrae@insight-centre.org<mailto:john.mccrae@insight-centre.org>>: Hi all, This is a reminder that we have one week left for public review comments. If you would find some time to review and make any comments that would be really appreciated. As there is already one comment, we will be reopening this module but if you have any comments now would be the best time to make it so that we can handle all comments in the 2nd public review. Regards, John Ar Déar 23 Ean 2025 ag 11:38, scríobh John McCrae <john.mccrae@insight-centre.org<mailto:john.mccrae@insight-centre.org>>: Dear OntoLex CG Members, The group working on the Frequency, Attestation and Corpus Information (FrAC) module has completed the first draft of the specification, which is now available for public review. All comments on the specification must be made as a post to public-ontolex@w3.org<mailto:public-ontolex@w3.org> by April 23rd in order to be considered. No changes to the specification are allowed except in response to comments on the public review. If no comments are received the specification will be published as is. The specification files are available on GitHub at: https://ontolex.github.io/frequency-attestation-corpus-information/ https://github.com/ontolex/frequency-attestation-corpus-information/blob/master/owl/frac.ttl I also attach them to this email. Regards, John, Christian and Max ________________________________ Denk je aan het milieu? Print alleen als het nodig is. Aan dit bericht kunnen geen rechten worden ontleend. Het bericht is alleen bestemd voor de geadresseerde. Indien het bericht niet voor u is bestemd, verzoeken wij u dit aan ons te melden en het bericht te verwijderen. This message shall not constitute any obligations. This message is intended solely for the addressee. If you have received this message in error, please inform us and delete the message. ________________________________
Received on Thursday, 24 April 2025 14:02:15 UTC