- From: Frans Knibbe <frans.knibbe@geodan.nl>
- Date: Sat, 24 Nov 2018 15:02:07 +0100
- To: christian.chiarcos@web.de
- Cc: semantic-web@w3.org
- Message-ID: <CAFVDz42C0rOjk-S2NKH0xrgyo6our8ot-aTGOPTHNn9kRJH7Pg@mail.gmail.com>
Op vr 23 nov. 2018 om 16:57 schreef Christian Chiarcos < christian.chiarcos@web.de>: > Am Fr., 23. Nov. 2018 um 15:55 Uhr schrieb Christian Chiarcos < > christian.chiarcos@web.de>: > >> A much more convenient solution would be to identify the language by >> means of a URI. This can be an ISO 639 category (see under >> http://id.loc.gov/vocabulary/iso639-2.html and >> http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf. >> http://www.lexvo.org/), or provided by another authority (e.g., >> https://glottolog.org/). Other properties (e.g., xsd datatypes) could >> also be stated about a literal. Two strings could be considered identical >> if the values are the same and the properties of one are a proper subset of >> the properties of the other. >> >> Not sure what the right data structure or representation should be. Maybe >> a kind of container structure for literal metadata (similar to the @ >> notation and the lang() properties that we have now). >> > > Thinking about this, a downward-compatible notation is possible: > - take @ as a short-hand for ^^xsd:string, with language identifiers > following > - if the language identifier is not a URI, it must be BCP47 > - BCP47 codes can be decomposed in the background into their sub-properties > - permit multiple language URIs/BCP47 codes (if you want to provide both a > BCP47 code [indicating region and script] and a URI [unambiguously > identifying the language]) > - let plain literals be untyped > > If literals can carry any number of properties, we get (something like) > the following pairs of literals and properties: > > 1. "рука"@sr-RS-Cyrl > => [ rdf:value "рука"; a xsd:string; dct:language < > http://id.loc.gov/vocabulary/iso639-1/sr>; dct:coverage < > http://lexvo.org/id/iso3166/RS>; <http://lexvo.org/ontology#usesScript> < > http://lexvo.org/id/script/Cyrl> ] > > 2. "рука" > => [ rdf:value "рука" ] > > 3. "рука"@sr > => [ rdf:value "рука"; a xsd:string; dct:language < > http://id.loc.gov/vocabulary/iso639-1/sr>] > > 4. "рука"^^xsd:str > => [ rdf:value "рука"; a xsd:string ] > > 5. "рука"@<https://glottolog.org/resource/languoid/id/serb1264> > => [ rdf:value "рука"; a xsd:string; dct:language < > https://glottolog.org/resource/languoid/id/serb1264>] > > 6. "рука"@sr-Cyrs > => [ rdf:value "рука"; a xsd:string; dct:language < > http://id.loc.gov/vocabulary/iso639-1/sr>; > http://lexvo.org/ontology#usesScript> <http://lexvo.org/id/script/Cyrs> ] > (Serbian in Cyrillic/Old Church Slavonian variant) > Thanks for those examples. Actually, I think the more elaborate translations of the shorthand notations are much more likeable. Perhaps there will be a need to say even more about text strings, like the aforementioned pronunciation. I know my language has had several spelling reforms over the years. In some cases it might be necessary to indicate the version of spelling that is used for a string. So in theory there is a significant stack of statements that can be made about a text string. There is also the matter of an appropriate level for making statements about text strings. If I have a dataset with lots of text strings in one particular language, it would be much more efficient to declare the language used only once, at the class and/or metadata level. Using plain properties to indicate language enables doing that. When I first came into contact with RDF, I liked the idea of being able to use a single pattern to say anything about anything. But on closer inspection, RDF seems to violate its own doctrine by having separate systems for data types and languages of literals. I wonder if is really necessary to have those two different systems within the system. Or is it just an artefact of RDF’s historical development? I that case, how about letting those features just die out, recommending not to use them any longer? Probably I can’t see all consequences of such a move, but putting special notations for language and data types of literals on the deprecation track seems worth considering. Given the level of deep thought behind the semweb, it probably already has been considered... Greetings, Frans > Assume that equality checks whether values are identical and the > properties of one string are a subset of the properties of the other, the > strings 1-4 are equal. > For String 5, it's more complicated, but > https://glottolog.org/resource/languoid/id/serb1264 does also provide a > ISO639 code. Unfortunately, not with a owl:sameAs link to the ISO639-1/2 > maintainers, but only as a string value, but this could be requested from > the glottolog maintainers. > String 6 would be equal to 2,3,4, but not to 1. > > This creates some overhead, but the nice thing about this is that we no > longer need to cast between language-specific and plain literals, nor > between xsd:string and plain literals. An (unintended?) side-effect would > be that a plain literal can match against any language. > > [BTW: No need to model this as blank nodes, but it kind of feels natural > here ;) ] > > Best, > Christian > -- > Prof. Dr. Christian Chiarcos > Applied Computational Linguistics > Johann Wolfgang Goethe Universität Frankfurt a. M. > 60054 Frankfurt am Main, Germany > > office: Robert-Mayer-Str. 10, #401b > mail: chiarcos@informatik.uni-frankfurt.de > web: http://acoli.cs.uni-frankfurt.de > tel: +49-(0)69-798-22463 > fax: +49-(0)69-798-28931 > >>
Received on Saturday, 24 November 2018 14:02:28 UTC