RE: Rethinking how literals are defined from Markus Lanthaler on 2013-11-18 (public-rdf-wg@w3.org from November 2013)

From: Markus Lanthaler <markus.lanthaler@gmx.net>
Date: Mon, 18 Nov 2013 15:32:34 +0100
To: "'Richard Cyganiak'" <richard@cyganiak.de>
Cc: "'RDF Working Group WG'" <public-rdf-wg@w3.org>
Message-ID: <026c01cee46b$06401660$12c04320$@lanthaler@gmx.net>
On Thursday, November 14, 2013 10:46 PM, Richard Cyganiak wrote:
> Markus,
> 
> The section in the way it's currently written is the result of long and
> protracted arguments. If you think you can improve the wording, it
> would be helpful if you could make a concrete proposal.

Fair enough. What about replacing it with:

--------------%<-----------------------
Literals are used for values such as strings, numbers, and dates.

A literal in an RDF graph consists of two or three elements:

 - a lexical form, being a Unicode [UNICODE] string, which SHOULD be in
   Normal Form C [NFC],

 - a datatype IRI, being an IRI identifying a datatype that determines
   how the lexical form maps to a literal value, and

 - if and only if the datatype IRI is rdf:langString, optionally a
   non-empty language tag as defined by [BCP47]. The language tag MUST be
   well-formed according to section 2.2.9 of [BCP47].

A literal is a language-tagged string if the third element is present.
Lexical representations of language tags MAY be converted to lower case.
The value space of language tags is always in lower case.

Please note that concrete syntaxes MAY support simple literals consisting
of only a lexical form without any datatype IRI or language tag. Simple
literals are syntactic sugar for abstract syntax literals with the datatype
IRI rdf:string. Similarly, most concrete syntaxes represent language-tagged
strings without the datatype IRI because it always equals rdf:langString.

The literal value associated with a literal is:

  1. If the literal is a language-tagged string, then the literal value is
     a pair consisting of its lexical form and its language tag, in that
     order.
  2. If the literal's datatype IRI is in the set of recognized datatype
     IRIs, let d be the referent of the datatype IRI.
     a) If the literal's lexical form is in the lexical space of d, then
        the literal value is the result of applying the lexical-to-value
        mapping of d to the lexical form.
     b) Otherwise, the literal is ill-typed and no literal value can be
        associated with the literal. Such a case produces a semantic
        inconsistency but is not syntactically ill-formed. Implementations
        MUST accept ill-typed literals and produce RDF graphs from them.
        Implementations MAY produce warnings when encountering ill-typed
        literals.
  3. If the literal's datatype IRI is not recognized by an implementation,
     then the literal value is not defined by this specification.

Literal term equality: Two literals are term-equal (the same RDF literal)
if and only if the two lexical forms, the two datatype IRIs, and the two
language tags (if any) compare equal, character by character. Thus, two
literals can have the same value without being the same RDF term. For
example: 

    "1"^^xs:integer
    "01"^^xs:integer   

denote the same value, but are not the same literal RDF terms and are not
term-equal because their lexical form differs.

--------------%<-----------------------

Hopefully this makes everything a bit easier to understand and more
consistent. I tried to change as little as possible. The only notable
changes are that I removed

  "A badly formed language tag MUST be treated as a syntax error."

as I don't believe this belongs into Concepts and also duplicates the other
normative statement "[a] language tag MUST be well-formed".

I also removed "Multiple literals may have the same lexical form" as it
doesn't add anything and falls out naturally of the definition of literal
term equality.

I'm not sure about statement 3) above:

  If the literal's datatype IRI **is not recognized by an implementation**,
  then the literal value is not defined by this specification.

Wouldn't it be better to say "... is not in the set of recognized datatype
IRIs" with "recognized datatype" being linked to 

  http://www.w3.org/TR/rdf11-concepts/#dfn-recognized-datatype-iris


> Concrete syntaxes need to say that the datatype of a literal is
> implicitly rdf:langString if a language tag is present, and that it is
> implicitly xsd:string if neither datatype nor language string are
> present.

Right, but apart from Turtle and JSON-LD none of the syntaxes does so. This
needs to be fixed.


> I agree that the regex is entirely counterproductive and should be
> removed.

OK, removed in the proposal above.



--
Markus Lanthaler
@markuslanthaler
Received on Monday, 18 November 2013 14:33:09 UTC