Details of string operations from Andy Seaborne on 2010-12-01 (public-rdf-dawg@w3.org from October to December 2010)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Wed, 01 Dec 2010 22:30:15 +0000
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4CF6CC77.9080406@epimorphics.com>
This message is the details of executing on the WG decisions from
http://www.w3.org/2009/sparql/meeting/2010-11-30

It would be good if these details get reviewed but if I hear nothing, 
this is the approach I'll take when I write up the content (that won't 
be immediately).

Suggestion: change the name from LENGTH to STRLEN because "LENGTH" might 
imply RDF lists, or paths of Seq.

Suggestion: change the name from SUBSTRING to SUBSTR just to make it 
shorter, and 'STR' is used for strings in SPARQL elesewhere.

Details of string operations:

STRLEN(string)
SUBSTR(string, int, int)
UCASE(string)
LCASE(string)
ENDS(string, string)
STARTS(string, string)
CONTAINS(string, string)
ENCODES(string)
CONCAT(string*)

Issues to sort out are around different flavo(u)rs of string.  Unlike 
F&O we have 3 string forms: xsd:string, simple literal (the SPARQL term 
for a plain literal without a language tag) and plain literals with 
language tag ("LitLang", from now on).

Design:
1/ Operations cover simple literal, LitLang, xsd:string.

This makes it a good thing we have our own IRIs - the F&O operations 
only cover xsd:string.

2/ The return type will be the form of the principle argument.
principle argument means the one the operation is acting on.

So
Operations on xsd:string yield xsd:string
Operations on LitLang yield @lang
   but not with mixing of @tags
Operations on simple literal yield simple literals

3/ Mixing different language tags do not match or compare

Note that "Script" and "dialect" are parts of a language tag.


STRLEN(string) -> integer

SUBSTR(string, int) -> string
SUBSTR(string, int, int) -> string
Design-2 applies.
The first argument is the "principle argument"

Caution: F&O is 1-based indexing, + length
Warning to Java programmers and others, it's not
    [start,end)

UCASE(string)
LCASE(string)
   Design-2 applies.
   UCASE("abc") -> ""ABC"
   UCASE("abc"@de) -> ""ABC"@de
   UCASE("abc"^^xsd:string) -> ""ABC"^^xsd:string

ENDS(string, string)
STARTS(string, string)
CONTAINS(string, string)

    STARTS("abc", "a") -> true
    STARTS("abc"@en, "a"@en) -> true
    STARTS("abc"@en, "a"@en-UK) -> false  *** (could be error)

Must be same language tag if two language tags present (else false or error)

NB: This works:
   STARTS(str(?uri), str(prefix:))

ENCODES(string)
   Result is a simple literal regardless of string.
   string can be simple, or xsd:string
   Not clear to me it should apply to LitLang
     proposal: it does not (it is an error).

CONCAT(string*)

If all the strings are simple literals
    -> simple literals

If the strings are a mix of simple literals and one or more xsd:string
    -> xsd:string

If the strings are a mix of simple literals, xsd:strings
and LitLang, and the lang tags are all the same
    -> plain literal with that language tag.

If the strings are a mix of simple literals  and plain literals
and there are two or more different language tags
    -> simple literal

NB: CONCAT("abc"@en, "def"@en-UK) -> "abcdef"
because it has different language tags.

If the strings are a mix of simple literals, xsd:strings and LitLang and 
there are two or more different language tags
    -> xsd:string

CONCAT("abc"@en, "def"@en-UK, "z"^^xsd:string) -> "abcdefz"


Other types (including IRIs) do not get cast to string.  Add STR() or 
xsd:string() as needed. This is a choice point - as there are two 
choices for the cast STR() and xsd:string() if it were implicit, I 
suggest we require explicit casts.

 Andy
Received on Wednesday, 1 December 2010 22:30:52 UTC