Re: "Language-tagged strings Re: Toward easier RDF: a proposal" from Christian Chiarcos on 2018-11-24 (semantic-web@w3.org from November 2018)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Sat, 24 Nov 2018 21:16:59 +0100
To: Andy Seaborne <andy@seaborne.org>
Cc: Hugh Glaser <hugh@glasers.org>, SW-forum <semantic-web@w3.org>
Message-ID: <CAC1YGdjKhy2i9bosPkv3SsqUurtQ7XNM0sLhDCs3yNWD2An-Kg@mail.gmail.com>

Am Sa., 24. Nov. 2018 um 18:42 Uhr schrieb Andy Seaborne <andy@seaborne.org
>:

> "chat"^^xsd:string is a string of characters.
>
> I think of language as a bit like units 23 lb != 23 kg. and neither
> aren't 23.


This is an oversimplification, because we don't have subtypes of kilogram.
But we do have region and script codes that combine with language tags to
form a complex language tag that is a specialization of the original
language tag. It would be nice to recognize that "cat"@en and "cat"@en-US
are the same thing, whereas "cat"@en-US and "cat"@en-NZL are not. And it
would be nice to say that Resian (a specific variety of Slovene spoken in
Italy) is something different than sl-IT (standard Slovene, happen to be
written in Italy) -- BCP47 conflates these, but a (more easily extensible)
URI-based solution (by reference to, say,
https://glottolog.org/resource/languoid/id/resi1246) would support that.
And it would be nice if we could interpret "cat" as an underspecification
of either "cat"@en or "cat"@en-NZ and match them without having to explain
RDF novices that a string is not a string if it has a language, and that
New Zealand English is just as different from "generic" English as are
Scots and Vietnamese from each other. Triple-based language (region,
script, etc.) identification would be more appropriate, because we can link
them with information about the language variety intended.


> "chat"@en and "chat"@fr are different.
>

>   "chat" rdf:lang "en" .
>   "chat" rdf:lang "fr" .
>
> makes every use of "chat" both @en and @fr.
>

I think the only way to avoid this would be if subject literals are be
taken as a notational short-hand for a blank node that carries the literal
as an rdf:value. (And, in a separate step, a problem-specific bnode
skolemization routine could be provided to give it a proper URI.)

 >> I often end up adding @en to all the strings, or removing region
>
tags >> etc., just so I can do things more easily, which is surely a Bad
>  >> Thing.
>
> I don't think it is bad.
>

It is, because such an extra step is very hard to justify to newcomers. You
explain to them that SPARQL is actually quite intuitive if you understand
Turtle and SQL, but in the next second, you need to introduce an extra
construct just to make them match a value on real-world data. You basically
loose the next generation, because the very first thing they learn about
SPARQL or RDF is that it is a nice concept with an idiosyncratic
implementation -- and this is not the last idiosyncrasy they'll encounter.

Best,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931

Received on Saturday, 24 November 2018 20:17:32 UTC