Re: "Language-tagged strings Re: Toward easier RDF: a proposal" from Christian Chiarcos on 2018-11-23 (semantic-web@w3.org from November 2018)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Fri, 23 Nov 2018 16:53:03 +0100
To: andy@seaborne.org
Cc: hugh@glasers.org, SW-forum <semantic-web@w3.org>, w.g.j.beek@vu.nl
Message-ID: <CAC1YGdh_9DGjO1PeLAHP6na-ULRRBzw=X44kZNwon=hEuE6o3Q@mail.gmail.com>

Am Fr., 23. Nov. 2018 um 15:55 Uhr schrieb Christian Chiarcos <
christian.chiarcos@web.de>:

> A much more convenient solution would be to identify the language by means
> of a URI. This can be an ISO 639 category (see under
> http://id.loc.gov/vocabulary/iso639-2.html and
> http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf.
> http://www.lexvo.org/), or provided by another authority (e.g.,
> https://glottolog.org/). Other properties (e.g., xsd datatypes) could
> also be stated about a literal. Two strings could be considered identical
> if the values are the same and the properties of one are a proper subset of
> the properties of the other.
>
> Not sure what the right data structure or representation should be. Maybe
> a kind of container structure for literal metadata (similar to the @
> notation and the lang() properties that we have now).
>

Thinking about this, a downward-compatible notation is possible:
- take @ as a short-hand for ^^xsd:string, with language identifiers
following
- if the language identifier is not a URI, it must be BCP47
- BCP47 codes can be decomposed in the background into their sub-properties
- permit multiple language URIs/BCP47 codes (if you want to provide both a
BCP47 code [indicating region and script] and a URI [unambiguously
identifying the language])
- let plain literals be untyped

If literals can carry any number of properties, we get (something like) the
following pairs of literals and properties:

1. "рука"@sr-RS-Cyrl
=> [ rdf:value "рука"; a xsd:string; dct:language <
http://id.loc.gov/vocabulary/iso639-1/sr>; dct:coverage <
http://lexvo.org/id/iso3166/RS>; <http://lexvo.org/ontology#usesScript> <
http://lexvo.org/id/script/Cyrl> ]

2. "рука"
=> [ rdf:value "рука" ]

3. "рука"@sr
=> [ rdf:value "рука"; a xsd:string; dct:language <
http://id.loc.gov/vocabulary/iso639-1/sr>]

4. "рука"^^xsd:str
=> [ rdf:value "рука"; a xsd:string ]

5. "рука"@<https://glottolog.org/resource/languoid/id/serb1264>
=> [ rdf:value "рука"; a xsd:string; dct:language <
https://glottolog.org/resource/languoid/id/serb1264>]

6. "рука"@sr-Cyrs
=> [ rdf:value "рука"; a xsd:string; dct:language <
http://id.loc.gov/vocabulary/iso639-1/sr>;
http://lexvo.org/ontology#usesScript> <http://lexvo.org/id/script/Cyrs> ]
(Serbian in Cyrillic/Old Church Slavonian variant)

Assume that equality checks whether values are identical and the properties
of one string are a subset of the properties of the other, the strings 1-4
are equal.
For String 5, it's more complicated, but
https://glottolog.org/resource/languoid/id/serb1264 does also provide a
ISO639 code. Unfortunately, not with a owl:sameAs link to the ISO639-1/2
maintainers, but only as a string value, but this could be requested from
the glottolog maintainers.
String 6 would be equal to 2,3,4, but not to 1.

This creates some overhead, but the nice thing about this is that we no
longer need to cast between language-specific and plain literals, nor
between xsd:string and plain literals. An (unintended?) side-effect would
be that a plain literal can match against any language.

[BTW: No need to model this as blank nodes, but it kind of feels natural
here ;) ]

Best,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931

>

Received on Friday, 23 November 2018 15:53:37 UTC