- From: Stefan Eissing <stefan.eissing@greenbytes.de>
- Date: Mon, 26 Nov 2007 08:44:51 +0100
- To: James M Snell <jasnell@gmail.com>
- Cc: Joe Gregorio <joe@bitworking.org>, URI <uri@w3.org>
-1.
I think this can get very messy real quick.
When we slip down the string-operators slope, we would soon need to
define string.length this way and possibly invent our own regular
expression sub-slang (as if the world would need another one)
afterwards.
Am 23.11.2007 um 21:46 schrieb James M Snell:
>
> Well, I'm not absolutely convinced it's required either but I can
> definitely imagine scenarios where it would be useful. One possible
> approach would be to have sub work against unreserved and pct-encoded
> characters, e.g.
>
> template-char = unreserved / pct-encoded
>
> sub would operate on template-char
>
> {-sub|0-1|foo=%FF%FF%FF} == %FF
>
> {-sub|0-2|foo=f%FFf%FF} == f%FF
>
> {-sub|0-3|foo=f%FFf%FF} == f%FFf
>
> {-sub|1-2|foo=f%FFf%FF} == %FFf
>
> - James
>
> Joe Gregorio wrote:
>> On Nov 5, 2007 1:36 PM, James M Snell <jasnell@gmail.com> wrote:
>>> Joe Gregorio wrote:
>>>> 2. The 'sub' operator could either be defined to operate on
>>>> the octets of the variables value, or on the unicode
>>>> character points
>>>> of the equivalent utf-8 decoded string. Both have their pros
>>>> and cons.
>>>>
>>> I would think that unicode codepoints would be what folks would
>>> typically expect. If we need to support both, different op codes
>>> can be
>>> used...
>>>
>>> octets = {-sub|0-1|username}
>>> codepoints = {-subc|0-1|username}
>>
>> In updating the specification and associated code and
>> examples I've come to believe that you can't do it
>> by unicode codepoint, simply because you can't be
>> certain that the source data was a unicode string.
>> That is, I was going to suggest that '-sub' work by:
>>
>> 1. percent-decode the variables value
>> 2. convert it from UTF-8 to unicode
>> 3. do the sub-string selection on the codepoints.
>> 4. substitute the substring of codepoints after they are
>> converted back to UTF-8 and percent-encode all octets
>> that fall outside ( unreserved / pct-encoded ).
>>
>> That won't work because the value might be a percent-encoded binary
>> blob. Here is a concrete example, the following substring
>> operator will
>> fail using the above algorithm:
>>
>> Vars:
>> foo := %FF%FF%FF
>> Template:
>> {-sub|0-1|foo}
>>
>>
>> I see several different solutions:
>>
>>
>> 1. Keep '-sub' but only have it act on the variable
>> value w/o doing any decoding back to codepoints.
>>
>> I.e.
>> {-sub|0-1|foo=%FF%FF%FF}
>> becomes:
>> "%F"
>>
>> Of limited use.
>>
>> 2. Keep '-sub' and define the algorithm to decode
>> back to codepoints but put large warnings in the spec
>> not to design URI Templates that would apply a '-sub'
>> expansion on a non-unicode string variable.
>>
>> In this case the above expansion would fail.
>>
>> 3. Drop '-sub'.
>>
>> At this point this is probably my favorite option. I'm not sure
>> how useful '-sub' would be and that the functionality it offers
>> can't be done using the other operators. For example, the
>> motivating
>> example was:
>>
>> Vars:
>> username := jcgregorio
>> Template:
>> {-sub|0-0|username}/{username}
>> URI:
>> j/jcgregorio
>>
>> But couldn't that be defined as:
>>
>> Vars:
>> username := jcgregorio
>> firstinitial := j
>> Template:
>> {firstinitial}/{username}
>> URI:
>> j/jcgregorio
>>
>>
>> Thanks,
>> -joe
>>
>
--
<green/>bytes GmbH, Hafenweg 16, D-48155 Münster, Germany
Amtsgericht Münster: HRB5782
Received on Monday, 26 November 2007 07:45:09 UTC