- From: Christian Chiarcos <christian.chiarcos@gmail.com>
- Date: Tue, 4 Jul 2023 11:54:30 +0200
- To: Ilan Kernerman <ilan@lexicala.com>
- Cc: Fahad Khan <fahad.khan@ilc.cnr.it>, public-ontolex <public-ontolex@w3.org>
- Message-ID: <CAC1YGdjH-wJs+tL-2HcX5OhTvNz1LbDxqMy5y9LogoG+DuPeyQ@mail.gmail.com>
Dear Ilan, I support keeping the one-pos-per-entry principle – which IMHO makes > Ontolex/Lexicog more thorough, consistent, open and useful, despite such > constraints – and seeking solutions to specific clashes, like what you > suggest. > Whether we like that or not, there's no real alternative right now ;) > ... bringing us back to asking what is the ultimate purpose of Ontolex – > to provide automated 1-to-1 replications for (often) imperfect dictionaries > or try to design the utmost up-to-date semantic representation of lexical > data for actual use today? > I think it should cover both, because the ultimate goal may be the latter, but, for the foreseeable future, the bulk of data is more in line with the first. The macro- and micro-structure of good old dictionaries has also been > determined by real constraints, such as their specific media and space > limits, resulting in entries with unsystematic structures (like this, > otherwise beautiful, one). If the media is the message, it will be > necessary to adapt. > > > > Moreover, I doubt multi-pos-per-entry would enable “more efficient > modelling”, and how capable existing tools are for “representing such data > without requiring human re-interpretation”, unless the goal were only to > mirror the original entry rather than also broaden its scope. > By "more efficient", I mean less redundancy, less verbosity and a more compact documentation (fewer standards/modules to look into). As for the capability of tools, I think of template-based approaches based on the analysis of layout, abbreviations and sequential order, alone (like XSLT, regular expressions, CFGs etc.). As for "only to mirror", I still think we need to permit the option to let data providers do exactly that, because that's where most data will be coming from and because it keeps the entry barrier low (which is high enough already). > It usually requires extra time and manual work to deal with such cases > manually, but maybe it’s good that not everything can be automized, yet ;)? > And perhaps some amount of duplication of senses is unavoidable and, > actually, the advantages exceed the drawbacks? > I would agree. But we can actually have both mirrorred and enriched resources without any risks to data quality if we provide the means to distinguish them clearly. This is why, for this particular case, I would prefer the lexicog solution over the underspecified lexical entry option that Fahad was proposing, because that would allow us to spot "semistructured" entries (which this one clearly is) immediately. In an additional step, then, "proper" lexical entries can be inferred, if POS information is provided and different pieces of information are assigned appropriately. (The situation is very different for languages where expertise on POS tagging is hard to provide or not provided by default, because then, a lexical entry cannot be expected to have a POS, in the first place.) > One way or the other, I would consider guidelines as a means, not an end, > which should be open for reconsideration if they can be improved. > Of course, but I'm actually more on the conservative side of things here and think we should limit ourselves to minor rewording, clarifying comments and obvious typos. Any reconsideration comes with the requirement to either update the existing data or with a risk for compatibility. This could be done with major version updates, but not along the way. Thanks a lot, Christian >
Received on Tuesday, 4 July 2023 09:54:47 UTC