Re: Updated URI Template proposal from Stefan Eissing on 2007-11-26 (uri@w3.org from November 2007)

From: Stefan Eissing <stefan.eissing@greenbytes.de>
Date: Mon, 26 Nov 2007 08:44:51 +0100
To: James M Snell <jasnell@gmail.com>
Cc: Joe Gregorio <joe@bitworking.org>, URI <uri@w3.org>
Message-Id: <60883D26-9A3B-433E-97FE-37B376EBC936@greenbytes.de>
-1.

I think this can get very messy real quick.

When we slip down the string-operators slope, we would soon need to  
define string.length this way and possibly invent our own regular  
expression sub-slang (as if the world would need another one)  
afterwards.


Am 23.11.2007 um 21:46 schrieb James M Snell:

>
> Well, I'm not absolutely convinced it's required either but I can
> definitely imagine scenarios where it would be useful.  One possible
> approach would be to have sub work against unreserved and pct-encoded
> characters, e.g.
>
>   template-char = unreserved / pct-encoded
>
> sub would operate on template-char
>
>   {-sub|0-1|foo=%FF%FF%FF}  == %FF
>
>   {-sub|0-2|foo=f%FFf%FF}   == f%FF
>
>   {-sub|0-3|foo=f%FFf%FF}   == f%FFf
>
>   {-sub|1-2|foo=f%FFf%FF}   == %FFf
>
> - James
>
> Joe Gregorio wrote:
>> On Nov 5, 2007 1:36 PM, James M Snell <jasnell@gmail.com> wrote:
>>> Joe Gregorio wrote:
>>>> 2. The 'sub' operator could either be defined to operate on
>>>>     the octets of the variables value, or on the unicode  
>>>> character points
>>>>     of the equivalent utf-8 decoded string. Both have their pros  
>>>> and cons.
>>>>
>>> I would think that unicode codepoints would be what folks would
>>> typically expect.  If we need to support both, different op codes  
>>> can be
>>> used...
>>>
>>>   octets     = {-sub|0-1|username}
>>>   codepoints = {-subc|0-1|username}
>>
>> In updating the specification and associated code and
>> examples I've come to believe that you can't do it
>> by unicode codepoint, simply because you can't be
>> certain that the source data was a unicode string.
>> That is, I was going to suggest that '-sub' work by:
>>
>>  1. percent-decode the variables value
>>  2. convert it from UTF-8 to unicode
>>  3. do the sub-string selection on the codepoints.
>>  4. substitute the substring of codepoints after they are
>>     converted back to UTF-8 and percent-encode all octets
>>     that fall outside ( unreserved / pct-encoded ).
>>
>> That won't work because the value might be a percent-encoded binary
>> blob.  Here is a concrete example, the following substring  
>> operator will
>> fail using the above algorithm:
>>
>>    Vars:
>>        foo := %FF%FF%FF
>>    Template:
>>        {-sub|0-1|foo}
>>
>>
>> I see several different solutions:
>>
>>
>> 1. Keep '-sub' but only have it act on the variable
>>     value w/o doing any decoding back to codepoints.
>>
>>     I.e.
>>        {-sub|0-1|foo=%FF%FF%FF}
>>     becomes:
>>        "%F"
>>
>>     Of limited use.
>>
>> 2. Keep '-sub' and define the algorithm to decode
>>     back to codepoints but put large warnings in the spec
>>     not to design URI Templates that would apply a '-sub'
>>     expansion on a non-unicode string variable.
>>
>>     In this case the above expansion would fail.
>>
>> 3. Drop '-sub'.
>>
>>    At this point this is probably my favorite option. I'm not sure
>>    how useful '-sub' would be and that the functionality it offers
>>    can't be done using the other operators. For example, the  
>> motivating
>>    example was:
>>
>>    Vars:
>>        username := jcgregorio
>>    Template:
>>        {-sub|0-0|username}/{username}
>>    URI:
>>        j/jcgregorio
>>
>>   But couldn't that be defined as:
>>
>>    Vars:
>>        username := jcgregorio
>>        firstinitial   := j
>>    Template:
>>        {firstinitial}/{username}
>>    URI:
>>        j/jcgregorio
>>
>>
>>    Thanks,
>>    -joe
>>
>

--
<green/>bytes GmbH, Hafenweg 16, D-48155 Münster, Germany
Amtsgericht Münster: HRB5782
Received on Monday, 26 November 2007 07:45:09 UTC