Re: "Language-tagged strings Re: Toward easier RDF: a proposal"

Am Sa., 24. Nov. 2018 um 15:02 Uhr schrieb Frans Knibbe <
frans.knibbe@geodan.nl>:

> Op vr 23 nov. 2018 om 16:57 schreef Christian Chiarcos <
> christian.chiarcos@web.de>:
>
>> A much more convenient solution would be to identify the language by
>>> means of a URI. (...)
>>>
>> Thinking about this, a downward-compatible notation is possible:
>> - take @ as a short-hand for ^^xsd:string, with language identifiers
>> following
>> - if the language identifier is not a URI, it must be BCP47
>> - BCP47 codes can be decomposed in the background into their
>> sub-properties
>> - permit multiple language URIs/BCP47 codes (if you want to provide both
>> a BCP47 code [indicating region and script] and a URI [unambiguously
>> identifying the language])
>> - let plain literals be untyped
>>
>> If literals can carry any number of properties, we get (something like)
>> the following pairs of literals and properties:
>>
>> 1. "рука"@sr-RS-Cyrl
>> => [ rdf:value "рука"; a xsd:string; dct:language <
>> http://id.loc.gov/vocabulary/iso639-1/sr>; dct:coverage <
>> http://lexvo.org/id/iso3166/RS>; <http://lexvo.org/ontology#usesScript> <
>> http://lexvo.org/id/script/Cyrl> ]
>>
>

> Thanks for those examples. Actually, I think the more elaborate
> translations of the shorthand notations are much more likeable.
>

Thank you. It does create some overhead, though. In any case, the shorthand
notations should definitely be preserved.


> Perhaps there will be a need to say even more about text strings, like the
> aforementioned pronunciation. I know my language has had several spelling
> reforms over the years. In some cases it might be necessary to indicate the
> version of spelling that is used for a string. So in theory there is a
> significant stack of statements that can be made about a text string.
>

Absolutely. This is one of the reasons why BCP47 isn't very satisfying for
multilingual NLP and minority languages -- nor for historical data.


> There is also the matter of an appropriate level for making statements
> about text strings. If I have a dataset with lots of text strings in one
> particular language, it would be much more efficient to declare the
> language used only once, at the class and/or metadata level. Using plain
> properties to indicate language enables doing that.
>

Yes. This is possible already now (using the pointers to ISO639 URIs in my
earlier mail), and it is recommended practice to do so in OntoLex/lemon
(lexicon model for ontologies; see esp.
https://www.w3.org/2016/05/ontolex/#lexicon-and-lexicon-metadata). OntoLex
is not a W3C recommendation, but a W3C community group report, but it would
be the most suitable basis for future standardization efforts in this
direction. But there is no relation between the global language property
assigned to a lexicon (i.e., an ontology lexicalization profile) and the
language tags, so string comparison between two lemon sources currently
requires casting to xsd:string to be on the safe side. I would not
generally eliminate language tags (even though resource language can be
asserted otherwise), because they're handy and widely used, but I would
like to improve the way they are treated in string comparison.


> When I first came into contact with RDF, I liked the idea of being able to
> use a single pattern to say anything about anything. But on closer
> inspection, RDF seems to violate its own doctrine by having separate
> systems for data types and languages of literals. I wonder if is really
> necessary to have those two different systems within the system. Or is it
> just an artefact of RDF’s historical development?
>

I keep asking myself the same question.


> I that case, how about letting those features just die out, recommending
> not to use them any longer? Probably I can’t see all consequences of such a
> move, but putting special notations for language and data types of literals
> on the deprecation track seems worth considering.
>

For reasons of backward-compatiblity, I would prefer not to deprecate the
notations, but rather to preserve them as syntactic sugar for (a portion,
at least, of) whatever follow-up solution will emerge. (We could, however,
deprecate SPARQL STRLANG, STRDT, LANGMATCHES and LANG and replace their use
in FILTER and BIND by ordinary RDF statements.) The triplification idea
above, and the re-conception of equality tests as same value + subset of
metadata properties is a possibility to ease working with language tags,
but certainly not the only one. In either way, it would not break existing
technology (not at the level of RDF(S) processing, at least; however,
modelling typed strings as individuals/blank nodes means that an
owl:DatatypeProperty with range restrictions or typed values becomes a
[special] kind of ObjectProperty, and fixing this may require [minor]
adjustments to OWL2).

Best,
Christian
--
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931

>

Received on Saturday, 24 November 2018 15:51:14 UTC