Re: Details of string operations from Steve Harris on 2010-12-02 (public-rdf-dawg@w3.org from October to December 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 2 Dec 2010 14:36:32 +0000
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <2897B74A-E283-4D1F-AE97-BD28D8A10C57@garlik.com>
On 2010-12-02, at 11:56, Andy Seaborne wrote:

> Summary: outstanding:
> 
> Is a mix of simple literals and XSD strings on CONCAT going to return a simple literal or xsd:string?
> 
> Example:
> 
> CONCAT(?var1, " -> ", ?var2)
> 
> ?var1 and/or ?var2 are xsd:strings. is the result a simple literal or xsd:string?
> 
> Discussion below.
> 
> 	Andy
> 
> On 02/12/10 10:58, Steve Harris wrote:
>>> ENCODES(string)
>> 
>> ENCODES() strikes me as strange naming, sounds like a predicate.
> 
> As Greg pointed out, ENCODE.
> 
>> There are many URI encodings people might reasonably want, "full" URI (encodeURI() in Javascript 1.5), URI component (encodeURIComponent() in Javascript 1.5), plus there's also base64 etc. encoding.
> 
> This is specifically a counterpart to fn:encode-for-uri i.e. %-encoding.

There are two types of %encoding, fn:encode-for-uri is encodeURIComponent() in JS terms.

> The "for URI" is significant - it's not applying different rules to different parts of the string (e.g. hostnames)
> 
>> Prefer naming like ENCODE_URI(), ENCODE_URI_COMPONENT(), learning from Javascript's mistake.
> 
> I have no particular opinion but it's more like the latter (which is a tad long).
> 
> ENCODE_FOR_URI?

That's fine.

>> Would also like DECODE_* forms.
>> 
>> [ I'd like MD5_HEX() and SHA1_HEX() too, returning hex encoded simple literals, very useful when minting stable identifiers, but not going to fight for it ]
> 
> Can we make that a separate issue?  I was considering the resolution of the WG fro the last WG.  Someone (Paul?) took an action to write to the list about it.
> 
>>> CONCAT(string*)
> 
>>> UCASE(string)
>>> LCASE(string)
>>>  Design-2 applies.
>>>  UCASE("abc") ->  ""ABC"
>>>  UCASE("abc"@de) ->  ""ABC"@de
>>>  UCASE("abc"^^xsd:string) ->  ""ABC"^^xsd:string
>> 
>> Just to note, supporting this for the bulk of unicode is quite a heavy requirement. I do think we should have it though.
> 
> Yes - it is amusing isn't it :-)

I've got an implementation in 4store for 20 or so languages, that I picked up from a text processing library. IIRC in some cases it's dependent on having correct ISO lang tags though. I'm guessing F&O doesn't require that?

>>> ENCODES(string)
>>>  Result is a simple literal regardless of string.
>>>  string can be simple, or xsd:string
>>>  Not clear to me it should apply to LitLang
>>>    proposal: it does not (it is an error).
>> 
>> This could potentially cause some confusion if it's the only stringy function that will give an error when given a literal with lang tag.
>> 
>> I've also seen plenty of documents in the wild with<rdf:RDF xml:lang="en">, so all literals ended up with language tags by default, regardless of whether that makes any sense. People might reasonably want to do:
>> 
>> URI(CONCAT(STR(prefix:), ENCODE_URI_COMPONENT(?code)))
> 
> URI(CONCAT(STR(prefix:), ENCODE_URI_COMPONENT(STR(?code))))

But you'd have to remember that (just for ENCODE_FOR_URI) you need to add a STR, seems odd. No real extra work for implementors, pain for users.

>> and will be surprised when ?code ->  "Zm9vCg=="@en causes that result to be dropped.
> 
> Fair point but, in this operation, the return type is not going to be lang tagged whereas it maybe elsewhere.  i.e. it has different requirements anyway.

Sure, but they weren't caring about lag tags on the input, or they wouldn't have tried to %encode it.

>>> CONCAT(string*)
>>> 
>>> If all the strings are simple literals
>>>   ->  simple literals
>>> 
>>> If the strings are a mix of simple literals and one or more xsd:string
>>>   ->  xsd:string
>> 
>> For commonality with the rules below, this might be better returning a simple literal.
> 
> No strong opinion here but there is a reason:
> 
> My thinking was that if there is an xsd:string from the data, but the query writes a simple literal (convenience) then the result is typed.
> 
> e.g. CONCAT(?var1, " -> ", ?var2)
> 
> and ?var1 and ?var2 are xsd:strings from the data.

I see where you're coming from, but you could equally write CONCAT(?var1, " -> "^^xsd:string, ?var2), if you cared about the distinction for some reason.

I was thinking about whether the type gets "promoted" to, xsd:string, or plain literal. Plain literals feel more like a promotion, than a demotion. Plus it increases consistency:

CONCAT(plain, anything) -> plain
CONCAT(string, string) -> string

> What do others think?
> 
>>> CONCAT("abc"@en, "def"@en-UK, "z"^^xsd:string) ->  "abcdefz"
> 
> Should have been "abcdefz"^^xsd:string.

I've been giving a bit of thought to what I'd expect/want CONCAT("1", "2"^^xsd:integer) to do, but haven't reached any conclusions.

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Thursday, 2 December 2010 14:37:08 UTC