Unicode considerations for RDF

Gregg Kellogg

★ W3C TPAC 2023, Seville, 11–15 September ★

RDF Updates for Internationalization

Unicode Strings
Base Direction
BCP47 Case Insensitivity

Unicode strings

RDF 1.1 was fuzzy in its nomenclature on the use of [Uu]nicode strings.

PR w3c/rdf-concepts#59 (for issue w3c/rdf-concepts#51) defines an RDF string term based on Unicode code points restricted to be Unicode scalar values.

String/IRI equality is based on two strings having the same code points, while allowing un-decoded code units to be used as well.

Lexical form of a literal is a sequence of code points (scalar values) which excludes surrogates.

See also XML 1.1 Char production, the relationship to D80 defintion of Unicode string, and DOMString.

Base Direction

RDF has not had the ability to simply decleare the base direction of a string (paragraph).

JSON-LD 1.1 defines an @direction of a value object using informative defitions of an i18n namespace and compound object.

PR w3c/rdf-concepts#48 (for issue w3c/rdf-concepts#9) adds directional language-tagged strings by adding a base direction element to literals.

This issue has been discussed since January and hasn't reached concensus due to concerns that a simple base direction doesn't go far enough to meet the I18N requirements (see here and here).

BCP47 Case Insensitivity

RDF has allowed implementations to convert language tags to lower case, but not required this. It also says that the value space is always lower case. This creates ambiguity that if two triples in an N-Triples serialization differ only in the case of the language tag, does the result graph have two triples or one.

<http://example.com/s> <http://example.com/p> "o"@en-US .
<http://example.com/s> <http://example.com/p> "o"@en-us .

This creates an issue for canonicalization and complicates the ability to use the recommended representation (@en-US) uniformly.

Issue w3c/rdf-concepts#55 suggests changing literal equality comparisons of language tags to be case-inseensitive. This should also affect considerations of triple uniqueness within a graph.

May conflict with existing D-interpretations where these are considered different terms with the same value.