Rethinking how literals are defined from Markus Lanthaler on 2013-11-13 (public-rdf-wg@w3.org from November 2013)

From: Markus Lanthaler <markus.lanthaler@gmx.net>
Date: Wed, 13 Nov 2013 10:23:06 +0100
To: "'RDF WG'" <public-rdf-wg@w3.org>
Message-ID: <00bf01cee051$f707f630$e517e290$@lanthaler@gmx.net>

Hi,

I've just had a look at the section defining literals in RDF Concepts [1]
and believe it needs some love. Currently it says:

  A literal in an RDF graph consists of two or three elements:
    . a lexical form, being a Unicode [UNICODE] string, which 
      SHOULD be in Normal Form C [NFC],
    . a datatype IRI, being an IRI identifying a datatype that
      Determines how the lexical form maps to a literal value.

The third element, the language tag, isn't described at all in that list.
IMO we should add it. Then Concepts goes on and says:

  A literal is a language-tagged string if and only if its datatype IRI
  is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, and only in
  this case the third element is present:
    . a non-empty language tag as defined by [BCP47]. The language tag
      MUST be well-formed according to section 2.2.9 of [BCP47]. Lexical
      representations of language tags MAY be converted to lower case. The
      value space of language tags is always in lower case.
    . A badly formed language tag MUST be treated as a syntax error.

  Implementors might wish to note that language tags conform to the regular
  expression '@' [a-zA-Z]{1,8} ('-' [a-zA-Z0-9]{1,8})* before normalizing
  to lowercase.

Not only does this contain grammatical glitches and a wrong regex (as
pointed out earlier) but it probably also confuses readers. None (!) of our
syntaxes allows to serialize a literal with both a datatype *and* and
language tag. In fact, apart from JSON-LD and Turtle none of the syntax
specs even mention rdf:langString which has to be fixed. Despite that, using
a datatype and a language tag always results in a syntax error, even if you
would use rdf:langString as datatype.

The statement that follows the description above is even made worse by the
sentence that follows it:

  Concrete syntaxes MAY support simple literals, consisting of only a 
  lexical form without any datatype IRI or language tag.

This leaves the impression that it is fine to serialize a literal without
datatype and without language tag but doesn't mention that it is also fine
to serialize it with just a language tag and thus the natural conclusion
seems to be that that's not allowed.

I know why rdf:langString has been introduced in the first place and you
know that I'm not happy with restricting language-tagging to that type - but
there's very little we can do about that at this stage given our charter I
think. What we could do though, is to define language-tagged strings so that
the datatype is implicit, i.e., a valid language-tagged string consists of a
lexical form and a language tag and always has the implicit type
rdf:langString (which formally isn't a datatype anyway).

Perhaps it would also make sense to introduce a term like "typed value" (as
used in JSON-LD, but I would be fine with typed literal as well) to make it
easier to talk about literals which are not language-tagged strings.

Thoughts?


Cheers,
Markus


[1] http://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal


--
Markus Lanthaler
@markuslanthaler

Received on Wednesday, 13 November 2013 09:23:40 UTC