Re: "Language-tagged strings Re: Toward easier RDF: a proposal" from Frans Knibbe on 2018-11-24 (semantic-web@w3.org from November 2018)

From: Frans Knibbe <frans.knibbe@geodan.nl>
Date: Sat, 24 Nov 2018 15:02:07 +0100
To: christian.chiarcos@web.de
Cc: semantic-web@w3.org
Message-ID: <CAFVDz42C0rOjk-S2NKH0xrgyo6our8ot-aTGOPTHNn9kRJH7Pg@mail.gmail.com>
Op vr 23 nov. 2018 om 16:57 schreef Christian Chiarcos <
christian.chiarcos@web.de>:

> Am Fr., 23. Nov. 2018 um 15:55 Uhr schrieb Christian Chiarcos <
> christian.chiarcos@web.de>:
>
>> A much more convenient solution would be to identify the language by
>> means of a URI. This can be an ISO 639 category (see under
>> http://id.loc.gov/vocabulary/iso639-2.html and
>> http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf.
>> http://www.lexvo.org/), or provided by another authority (e.g.,
>> https://glottolog.org/). Other properties (e.g., xsd datatypes) could
>> also be stated about a literal. Two strings could be considered identical
>> if the values are the same and the properties of one are a proper subset of
>> the properties of the other.
>>
>> Not sure what the right data structure or representation should be. Maybe
>> a kind of container structure for literal metadata (similar to the @
>> notation and the lang() properties that we have now).
>>
>
> Thinking about this, a downward-compatible notation is possible:
> - take @ as a short-hand for ^^xsd:string, with language identifiers
> following
> - if the language identifier is not a URI, it must be BCP47
> - BCP47 codes can be decomposed in the background into their sub-properties
> - permit multiple language URIs/BCP47 codes (if you want to provide both a
> BCP47 code [indicating region and script] and a URI [unambiguously
> identifying the language])
> - let plain literals be untyped
>
> If literals can carry any number of properties, we get (something like)
> the following pairs of literals and properties:
>
> 1. "рука"@sr-RS-Cyrl
> => [ rdf:value "рука"; a xsd:string; dct:language <
> http://id.loc.gov/vocabulary/iso639-1/sr>; dct:coverage <
> http://lexvo.org/id/iso3166/RS>; <http://lexvo.org/ontology#usesScript> <
> http://lexvo.org/id/script/Cyrl> ]
>
> 2. "рука"
> => [ rdf:value "рука" ]
>
> 3. "рука"@sr
> => [ rdf:value "рука"; a xsd:string; dct:language <
> http://id.loc.gov/vocabulary/iso639-1/sr>]
>
> 4. "рука"^^xsd:str
> => [ rdf:value "рука"; a xsd:string ]
>
> 5. "рука"@<https://glottolog.org/resource/languoid/id/serb1264>
> => [ rdf:value "рука"; a xsd:string; dct:language <
> https://glottolog.org/resource/languoid/id/serb1264>]
>
> 6. "рука"@sr-Cyrs
> => [ rdf:value "рука"; a xsd:string; dct:language <
> http://id.loc.gov/vocabulary/iso639-1/sr>;
> http://lexvo.org/ontology#usesScript> <http://lexvo.org/id/script/Cyrs> ]
> (Serbian in Cyrillic/Old Church Slavonian variant)
>


Thanks for those examples. Actually, I think the more elaborate
translations of the shorthand notations are much more likeable. Perhaps
there will be a need to say even more about text strings, like the
aforementioned pronunciation. I know my language has had several spelling
reforms over the years. In some cases it might be necessary to indicate the
version of spelling that is used for a string. So in theory there is a
significant stack of statements that can be made about a text string.

There is also the matter of an appropriate level for making statements
about text strings. If I have a dataset with lots of text strings in one
particular language, it would be much more efficient to declare the
language used only once, at the class and/or metadata level. Using plain
properties to indicate language enables doing that.

When I first came into contact with RDF, I liked the idea of being able to
use a single pattern to say anything about anything. But on closer
inspection, RDF seems to violate its own doctrine by having separate
systems for data types and languages of literals. I wonder if is really
necessary to have those two different systems within the system. Or is it
just an artefact of RDF’s historical development? I that case, how about
letting those features just die out, recommending not to use them any
longer? Probably I can’t see all consequences of such a move, but putting
special notations for language and data types of literals on the
deprecation track seems worth considering. Given the level of deep thought
behind the semweb, it probably already has been considered...


Greetings,

Frans


> Assume that equality checks whether values are identical and the
> properties of one string are a subset of the properties of the other, the
> strings 1-4 are equal.
> For String 5, it's more complicated, but
> https://glottolog.org/resource/languoid/id/serb1264 does also provide a
> ISO639 code. Unfortunately, not with a owl:sameAs link to the ISO639-1/2
> maintainers, but only as a string value, but this could be requested from
> the glottolog maintainers.
> String 6 would be equal to 2,3,4, but not to 1.
>
> This creates some overhead, but the nice thing about this is that we no
> longer need to cast between language-specific and plain literals, nor
> between xsd:string and plain literals. An (unintended?) side-effect would
> be that a plain literal can match against any language.
>
> [BTW: No need to model this as blank nodes, but it kind of feels natural
> here ;) ]
>
> Best,
> Christian
> --
> Prof. Dr. Christian Chiarcos
> Applied Computational Linguistics
> Johann Wolfgang Goethe Universität Frankfurt a. M.
> 60054 Frankfurt am Main, Germany
>
> office: Robert-Mayer-Str. 10, #401b
> mail: chiarcos@informatik.uni-frankfurt.de
> web: http://acoli.cs.uni-frankfurt.de
> tel: +49-(0)69-798-22463
> fax: +49-(0)69-798-28931
>
>>
Received on Saturday, 24 November 2018 14:02:28 UTC