Re: "Language-tagged strings Re: Toward easier RDF: a proposal" from Christian Chiarcos on 2018-11-23 (semantic-web@w3.org from November 2018)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Fri, 23 Nov 2018 15:55:31 +0100
To: andy@seaborne.org
Cc: hugh@glasers.org, SW-forum <semantic-web@w3.org>, w.g.j.beek@vu.nl
Message-ID: <CAC1YGdhZJLd8FeEjxqq1947r-kJBVjejB4WNuCFfy_8vnQ3jvw@mail.gmail.com>
Dear all,

language codes do matter, but are pretty inconvenient for multiple reasons:
- comparability with untyped/plain strings (of course, and most obviously
and counter-intuitive to RDF novices),
- complexity (BCP47 defines (a) complex selection rules among ISO 639
language tags, and (b) complex rules for composition, e.g., with script and
region codes), and
- confusability (having 2-letter codes aside with 3-letter codes for the
same language can let people used to work with 3-letter codes chose
2-letter codes, which is an easy error to make, but can result in failure
to compare, e.g., "cat"@eng and "cat"@en. Not sure what should happen when
you compare "рука"@sr-Cyrl with "рука"@sr. Both are identical, the first is
just more explicit in stating that this is Cyrillic.)
- coverage (for many applications, ISO639 simply isn't fine-grained or
well-defined enough, and its extension is slow, bureaucratic and doubtful).

A much more convenient solution would be to identify the language by means
of a URI. This can be an ISO 639 category (see under
http://id.loc.gov/vocabulary/iso639-2.html and
http://id.loc.gov/vocabulary/iso639-1.html; for ISO 639, cf.
http://www.lexvo.org/), or provided by another authority (e.g.,
https://glottolog.org/). Other properties (e.g., xsd datatypes) could also
be stated about a literal. Two strings could be considered identical if the
values are the same and the properties of one are a proper subset of the
properties of the other.

Not sure what the right data structure or representation should be. Maybe a
kind of container structure for literal metadata (similar to the @ notation
and the lang() properties that we have now).

Best,
Christian

Am Fr., 23. Nov. 2018 um 15:03 Uhr schrieb Andy Seaborne <andy@seaborne.org
>:

>
>
> On 23/11/2018 12:03, Hugh Glaser wrote:
> > Ah, good topic.
> >
> > So another thing I don't understand (:-)) is why we have to have
> language tags on strings at all, and indeed datatypes.
>
> As someone who works with a product that is used by users in different
> geographies, I can say that language tags matter.
>
> And I live near "cymru"@cy.
>
> > (OK, it's because of XML heritage or something, I guess.)
> > But we have a perfectly good way of representing knowledge about things.
> > It is a real pain to create these 3 component literals and to query for
> different languages and datatypes in SPARQL.
> > And worse still, if you want to query for strings that may or may not
> have language tags on, you need to do some real messing about.
>
> STR(?var) in SPARQL.
>
> xsd:string("abc"@en) if you are lucky.
>
> > I often end up adding @en to all the strings, or removing region tags
> etc., just so I can do things more easily, which is surely a Bad Thing.
> >
> > Surely languages and datatypes should simply be RDF properties of
> Literals, which are 1 component things?
> > Much easier to explain to developers, and for them to use.
> > (If indeed they want to use raw RDF.)
>
> As in:
>
>    "chat" rdf:lang "en" .
>
> ?
>
> That would make all occurrences of "chat" @en.
>
> They really are different literals.
>
>      Andy
>
>
> >
> >> On 23 Nov 2018, at 11:48, Andy Seaborne <andy@seaborne.org> wrote:
> >>
> >> The RDF 1.1 WG did spend some time of this - both on putting the
> langtag into the lexical space and putting the lang tag into the datatype.
> Both are not so easy; in the end the rdf@langString at least meant all
> literals had a datatype.
> >>
> >> With the lexical form is a pair (string, lang) and squeezing that into
> a single string, it gets a bit unintuitive when strlen("hello@en") is 8,
> not 5. SeeAlso rdf:plainLiteral.
> >>
> >> For datatypes, language tags have their own structure and hierarchy
> (lang-script-region-...) for their requirements which does not really fit
> with datatype subtyping very well.
> >>
> >> I don't think changes would simplify.
> >>
> >> We have what we have and people have been explaining to the wider
> community (i.e. it's not just people on this list affected). So
> "technically better" isn't the criterion, it should be "unlocks potential
> that is currently, provably blocked".
> >>
> >>     Andy
> >>
> >> On 23/11/2018 08:42, Wouter Beek wrote:
> >>> Dear David, others,
> >>> As another attempt at simplifying RDF, would it be possible to do away
> >>> with the special status of language-tagged strings?
> >>> In RDF 1.1 literals consist of 3 components: lexical form, datatype
> >>> IRI, and language tag.  The last component is only used in
> >>> language-tagged strings.  Would it be possible to define
> >>> `rdf:langString' as a regular datatype IRI and have literals consist
> >>> of 2 components instead?
> >>> RDF 1.1 Concepts and Abstract Syntax currently contains many caveats
> >>> to accommodate the idiosyncratic nature of language-tagged strings,
> >>> e.g.,:
> >>>> Language-tagged strings have the datatype IRI
> http://www.w3.org/1999/02/22-rdf-syntax-ns#langString. No datatype is
> formally defined for this IRI because the definition of datatypes does not
> accommodate language tags in the lexical space. The value space associated
> with this datatype IRI is the set of all pairs of strings and language tags.
> >>> Would it be possible to define a regular lexical space, e.g.,
> >>> containing "hello@en"^^rdf:langString, together with a value-2-lexical
> >>> and a lexical-2-value mapping?
> >>> The N3 and SPARQL notation "hello"@en will of course still be
> >>> available, and will be syntactic sugar for "hello@en"^^rdf:langString.
> >>> ---
> >>> Best regards,
> >>> Wouter Beek.
> >>> Email: w.g.j.beek@vu.nl
> >>> WWW: https://wouterbeek.org
> >>> Tel: +31647674624
> >>
> >
>
> --
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931
Received on Friday, 23 November 2018 14:56:05 UTC