Re: Details of string operations from Steve Harris on 2010-12-02 (public-rdf-dawg@w3.org from October to December 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 2 Dec 2010 10:58:14 +0000
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <E95D1625-59E0-47D1-9061-A3E98EF0189B@garlik.com>
On 2010-12-01, at 22:30, Andy Seaborne wrote:

> This message is the details of executing on the WG decisions from
> http://www.w3.org/2009/sparql/meeting/2010-11-30
> 
> It would be good if these details get reviewed but if I hear nothing, this is the approach I'll take when I write up the content (that won't be immediately).
> 
> Suggestion: change the name from LENGTH to STRLEN because "LENGTH" might imply RDF lists, or paths of Seq.
> 
> Suggestion: change the name from SUBSTRING to SUBSTR just to make it shorter, and 'STR' is used for strings in SPARQL elesewhere.
> 
> Details of string operations:
> 
> STRLEN(string)
> SUBSTR(string, int, int)
> UCASE(string)
> LCASE(string)
> ENDS(string, string)
> STARTS(string, string)
> CONTAINS(string, string)
> ENCODES(string)

ENCODES() strikes me as strange naming, sounds like a predicate.

There are many URI encodings people might reasonably want, "full" URI (encodeURI() in Javascript 1.5), URI component (encodeURIComponent() in Javascript 1.5), plus there's also base64 etc. encoding.

Prefer naming like ENCODE_URI(), ENCODE_URI_COMPONENT(), learning from Javascript's mistake.

Would also like DECODE_* forms.

[ I'd like MD5_HEX() and SHA1_HEX() too, returning hex encoded simple literals, very useful when minting stable identifiers, but not going to fight for it ]

> CONCAT(string*)
> 
> Issues to sort out are around different flavo(u)rs of string.  Unlike F&O we have 3 string forms: xsd:string, simple literal (the SPARQL term for a plain literal without a language tag) and plain literals with language tag ("LitLang", from now on).
> 
> Design:
> 1/ Operations cover simple literal, LitLang, xsd:string.
> 
> This makes it a good thing we have our own IRIs - the F&O operations only cover xsd:string.
> 
> 2/ The return type will be the form of the principle argument.
> principle argument means the one the operation is acting on.
> 
> So
> Operations on xsd:string yield xsd:string
> Operations on LitLang yield @lang
>  but not with mixing of @tags
> Operations on simple literal yield simple literals
> 
> 3/ Mixing different language tags do not match or compare
> 
> Note that "Script" and "dialect" are parts of a language tag.

+1

> STRLEN(string) -> integer
> 
> SUBSTR(string, int) -> string
> SUBSTR(string, int, int) -> string
> Design-2 applies.
> The first argument is the "principle argument"
> 
> Caution: F&O is 1-based indexing, + length
> Warning to Java programmers and others, it's not
>   [start,end)
> 
> UCASE(string)
> LCASE(string)
>  Design-2 applies.
>  UCASE("abc") -> ""ABC"
>  UCASE("abc"@de) -> ""ABC"@de
>  UCASE("abc"^^xsd:string) -> ""ABC"^^xsd:string

Just to note, supporting this for the bulk of unicode is quite a heavy requirement. I do think we should have it though.

> ENDS(string, string)
> STARTS(string, string)
> CONTAINS(string, string)
> 
>   STARTS("abc", "a") -> true
>   STARTS("abc"@en, "a"@en) -> true
>   STARTS("abc"@en, "a"@en-UK) -> false  *** (could be error)
> 
> Must be same language tag if two language tags present (else false or error)
> 
> NB: This works:
>  STARTS(str(?uri), str(prefix:))

Good :)

> ENCODES(string)
>  Result is a simple literal regardless of string.
>  string can be simple, or xsd:string
>  Not clear to me it should apply to LitLang
>    proposal: it does not (it is an error).

This could potentially cause some confusion if it's the only stringy function that will give an error when given a literal with lang tag.

I've also seen plenty of documents in the wild with <rdf:RDF xml:lang="en">, so all literals ended up with language tags by default, regardless of whether that makes any sense. People might reasonably want to do:

URI(CONCAT(STR(prefix:), ENCODE_URI_COMPONENT(?code)))

and will be surprised when ?code -> "Zm9vCg=="@en causes that result to be dropped.

> CONCAT(string*)
> 
> If all the strings are simple literals
>   -> simple literals
> 
> If the strings are a mix of simple literals and one or more xsd:string
>   -> xsd:string

For commonality with the rules below, this might be better returning a simple literal.

Also, in numeric operations you convert towards types with the greatest precision. xsd:string can't have lang tags, plain literals can, so plain literals maybe make more sense as the return type? They're more common in RDF data too, in my experience.

> If the strings are a mix of simple literals, xsd:strings
> and LitLang, and the lang tags are all the same
>   -> plain literal with that language tag.
> 
> If the strings are a mix of simple literals  and plain literals
> and there are two or more different language tags
>   -> simple literal
> 
> NB: CONCAT("abc"@en, "def"@en-UK) -> "abcdef"
> because it has different language tags.

> If the strings are a mix of simple literals, xsd:strings and LitLang and there are two or more different language tags
>   -> xsd:string
> 
> CONCAT("abc"@en, "def"@en-UK, "z"^^xsd:string) -> "abcdefz"
> 
> 
> Other types (including IRIs) do not get cast to string.  Add STR() or xsd:string() as needed. This is a choice point - as there are two choices for the cast STR() and xsd:string() if it were implicit, I suggest we require explicit casts.

Agreed.

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Thursday, 2 December 2010 10:58:50 UTC