- From: Christian Chiarcos <christian.chiarcos@web.de>
- Date: Fri, 23 Nov 2018 16:53:03 +0100
- To: andy@seaborne.org
- Cc: hugh@glasers.org, SW-forum <semantic-web@w3.org>, w.g.j.beek@vu.nl
- Message-ID: <CAC1YGdh_9DGjO1PeLAHP6na-ULRRBzw=X44kZNwon=hEuE6o3Q@mail.gmail.com>
Am Fr., 23. Nov. 2018 um 15:55 Uhr schrieb Christian Chiarcos < christian.chiarcos@web.de>: > A much more convenient solution would be to identify the language by means > of a URI. This can be an ISO 639 category (see under > http://id.loc.gov/vocabulary/iso639-2.html and > http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf. > http://www.lexvo.org/), or provided by another authority (e.g., > https://glottolog.org/). Other properties (e.g., xsd datatypes) could > also be stated about a literal. Two strings could be considered identical > if the values are the same and the properties of one are a proper subset of > the properties of the other. > > Not sure what the right data structure or representation should be. Maybe > a kind of container structure for literal metadata (similar to the @ > notation and the lang() properties that we have now). > Thinking about this, a downward-compatible notation is possible: - take @ as a short-hand for ^^xsd:string, with language identifiers following - if the language identifier is not a URI, it must be BCP47 - BCP47 codes can be decomposed in the background into their sub-properties - permit multiple language URIs/BCP47 codes (if you want to provide both a BCP47 code [indicating region and script] and a URI [unambiguously identifying the language]) - let plain literals be untyped If literals can carry any number of properties, we get (something like) the following pairs of literals and properties: 1. "рука"@sr-RS-Cyrl => [ rdf:value "рука"; a xsd:string; dct:language < http://id.loc.gov/vocabulary/iso639-1/sr>; dct:coverage < http://lexvo.org/id/iso3166/RS>; <http://lexvo.org/ontology#usesScript> < http://lexvo.org/id/script/Cyrl> ] 2. "рука" => [ rdf:value "рука" ] 3. "рука"@sr => [ rdf:value "рука"; a xsd:string; dct:language < http://id.loc.gov/vocabulary/iso639-1/sr>] 4. "рука"^^xsd:str => [ rdf:value "рука"; a xsd:string ] 5. "рука"@<https://glottolog.org/resource/languoid/id/serb1264> => [ rdf:value "рука"; a xsd:string; dct:language < https://glottolog.org/resource/languoid/id/serb1264>] 6. "рука"@sr-Cyrs => [ rdf:value "рука"; a xsd:string; dct:language < http://id.loc.gov/vocabulary/iso639-1/sr>; http://lexvo.org/ontology#usesScript> <http://lexvo.org/id/script/Cyrs> ] (Serbian in Cyrillic/Old Church Slavonian variant) Assume that equality checks whether values are identical and the properties of one string are a subset of the properties of the other, the strings 1-4 are equal. For String 5, it's more complicated, but https://glottolog.org/resource/languoid/id/serb1264 does also provide a ISO639 code. Unfortunately, not with a owl:sameAs link to the ISO639-1/2 maintainers, but only as a string value, but this could be requested from the glottolog maintainers. String 6 would be equal to 2,3,4, but not to 1. This creates some overhead, but the nice thing about this is that we no longer need to cast between language-specific and plain literals, nor between xsd:string and plain literals. An (unintended?) side-effect would be that a plain literal can match against any language. [BTW: No need to model this as blank nodes, but it kind of feels natural here ;) ] Best, Christian -- Prof. Dr. Christian Chiarcos Applied Computational Linguistics Johann Wolfgang Goethe Universität Frankfurt a. M. 60054 Frankfurt am Main, Germany office: Robert-Mayer-Str. 10, #401b mail: chiarcos@informatik.uni-frankfurt.de web: http://acoli.cs.uni-frankfurt.de tel: +49-(0)69-798-22463 fax: +49-(0)69-798-28931 >
Received on Friday, 23 November 2018 15:53:37 UTC