Re: Encoding URI From/To UTF-16 Questions from Mike Brown on 2005-03-25 (uri@w3.org from March 2005)

From: Mike Brown <mike@skew.org>
Date: Fri, 25 Mar 2005 03:59:40 -0700 (MST)
To: James Cerra <jfcst24_public@yahoo.com>
CC: uri@w3.org
Message-Id: <200503251059.j2PAxe1D084346@chilled.skew.org>
James Cerra wrote:
> I'm writing a converter in Java for percent-encoding characters not-UNRESERVED
> bytes according to RFC 3986 [1].
>
> There are a few questions I have when
> encoding to/from non ASCII-like character encoding - esp. UTF-16 BE/LE and
> other encoding that use more than one byte per character.
> 

Also consider that unreserved characters can (but shouldn't) be interchanged 
with their percent-encoded equivalents in ASCII. That is, "A" can be "%41", 
and "%41" can be "A". So if you thought you had problems with UTF-16 before, 
your brain should really be fried now. In UTF-16 you need "%41" to just mean 
byte 0x41, not "A"! The problem is mainly just that you're muddling two levels 
of abstraction as you try to apply a percent-encoding algorithm too early.

> So far, here's the algorithm that I inferred from the spec:
> 
> 1) Given an input byte stream and output byte stream.

Nope. Bear with me, here, as this is a bit of a lengthy explanation.

Think of your problem this way: given a resource (which can be anything, e.g., 
a web page, or the idea of world peace), you are going to represent it with a 
uniform identifier (a sequence of characters, not bytes).

The identifier, a URI, will be of some scheme (http, mailto, etc.) that 
provides syntactic guidelines -- simplified guidelines as compared to the 
one-size-fits-all of RFC 3986, but not in conflict with RFC 3986 either), plus 
guidelines for mapping the identifying aspects of the resource (like an email 
address in the case of mailto) to a sequence of characters allowed in a URI. 
Additionally, the scheme may (or may not) imply or mandate a particular 
dereference mechanism that allows a representation of the resource to be 
obtained/acted upon (like, the fetching of a document via HTTP, or the sending 
of a message over an email network).

Your input is data that provides identifying aspects of the resource. In most 
cases, this data will arrive in the form of either bytes (concrete bit 
sequences which may or may not represent characters), or characters (abstract 
concepts representable by encoded bit sequences, scribbles on a napkin, pixels 
on a screen, bumps on a Braille printer, etc.), and will comprise a 'natural' 
identifier for the resource, just not in the format of a URI -- an OS-specific 
file path as compared to a 'file' scheme URI, for example.

Your output, the URI, is going to be a sequence of characters (those abstract 
things, independent of any encoding/representation) that conform to the URI 
syntax (or whatever subset thereof that needs to be percent-encoded). Once you 
have the URI as characters, you can do with it whatever is necessary to make 
it useful to you -- serialize it as bytes in some encoding, write it on a 
wall, speak it aloud, whatever. Of course you can shortcut this by writing 
directly to ASCII bytes, but in order to understand your role in the process, 
you need to think of URI construction in terms of bytes-or-characters-in, 
characters-out.

So first decide what kind of input you are really taking in: characters or 
bytes. Then figure out how best to map them URI characters. This really 
depends on what URI scheme you are producing for, and/or what kind of data you 
are representing (HTML form data and the application/x-www-form-urlencoded 
media type being applicable to various schemes), and/or what the receiver 
expects (CGI applications vary in their expectations, for example), so your
attempt to make a generic API for this is going to be a general solution that
may require the caller to know what they are doing and only call your code
when it is really needed.

Further complicating matters is that the specs governing the URI schemes and 
things like HTML form data submissions and CGI *should* (in everyone's fantasy 
world) be very clear about how data gets from its native format into a URI, 
but in practice, they rarely make things very clear at all. Consequently, a 
lot of what goes on in this area in the real world is ad-hoc. There is a 
trend toward making everything UTF-8 based, but this is often just a 
recommendation going forward, at best, and does not affect deployed 
implementations and long-unupdated specs.

But let's just say for now that in your API you know you're starting with a 
set of arbitrary bytes and you're going to prepare a fragment of a URI from 
them, and this will be done in a manner that will "just work" 80% of the time, 
regardless of the requirements of specific schemes and contexts. You can do 
this.

> 2) If an input byte is in the UNRESERVED set [2] then write to the 
>    output stream.

Stop. The unreserved set is a set of characters, not bytes.

> 3) Otherwise write 0x25 [3] and then the two byte hex version of the 
>    input byte, in ASCII, to output stream.
> 4) Continue on to end of stream.  Output stream is in ASCII.

The algorithm you are misquoting, if from RFC 3986, is intended to tell you 
how to go about representing *URI characters* in a URI. That is, once you have 
already converted your input data (the resource's 'natural' identifier, or 
some fragment thereof) to URI characters and percent-encoded octets, THEN you 
decide whether percent-encoding needs to be applied to any of the URI 
characters: unreserved characters can go in directly, reserved characters can 
go in directly if being used for their reserved purpose, and any other 
reserved characters must be converted to percent-encoded octets based on ASCII 
-- that's it; no other provisions need to exist, in the generic syntax, for 
representing any other characters, because the URI is at a higher level of 
abstraction than your input data.

You are tempted to think, thanks to very poorly written things like HTML's 
definiton of the application/x-www-form-urlencoded media type, as well as 
fairly well written things like RFC 2396, that you should do a one-to-one 
mapping of your input data, if it is character based, to URI characters, 
taking care to percent-encode any that are not in the unreserved set. As you 
discovered and as I tried to explain above, this only "works" if you base the 
percent-encoded octets on a character encoding that will not result in any 
ambiguity -- ASCII, UTF-8, ISO-8859-1, etc. are OK, but UTF-16 is not. There 
are indeed specifications that actually say to go about it in this way, but 
that's because they were written a long time ago for a world that was 
ASCII-based, using single-byte encodings and not differentiating between 
characters and bytes.

This actually was more clear in RFC 1738 than RFCs 2396 and 3986, in my
opinion, but the ideal method of producing URI characters from character
based data is to ALWAYS convert the character based data to bytes first,
then use percent-encoded octets for any bytes that aren't safe to replace
with their corresponding ASCII characters.

So here is the algorithm I think you want:

1) Input data: characters? Convert to bytes (UTF-8, preferably). Bytes? take as-is.
   Output: Unicode character buffer.
2) Input bytes corresponding, in ASCII, to characters in the unreserved set:
   write as characters to output buffer (0x41 -> \u0041)
   Other input bytes: write as percent-encoded octets (0x20 -> \u0025\u0032\u0030)
3) Serialize string buffer as an ASCII encoded string, or whatever is useful to you.

As I said, this is very general. It is hard to make an API that will work for 
every situation. You need to take into account what your input data really is, 
and what kind of URIs (if it is indeed URIs) that you are producing, and what 
will be done with them.

> "%00g%00o%00o%00g%00l%00e"
> 
> Is this correct?

Well...

If your intent was to prepare the character string "google" for incorporation 
in a URI, in the absence of clear guidelines for mapping the characters 
g,o,o,g,l,e to URI characters for a particular URI scheme and application 
context, then you did not choose the ideal, recommended representation, which 
would've just been "google". (I'm hesitating to say it's "incorrect").

If your intent was to prepare the byte sequence <00 67 00 6F 00 6F 00 67 00 6C 
00 65>, which happens to be the UTF-16BE representation of the character string 
"google", for incorporation in a URI, (+ caveats above), then yes, you chose 
the ideal, recommended representation for that input data. You could've also 
chosen "%00%67%00%6F%00%6F%00%67%00%6C%00%65" although that's not the preferred
form.

In any case, if the consumer of this data knows what to do with it, and it 
does not violate the generic syntax, then it is "correct". (Well, aside from 
the fact that you said it was UTF-16LE based.. looks like UTF-16BE to me!)

> And how should one interpret the
> scheme component - i.e. "http://" - in a string starting from UTF-16?  Surely
> the output shouldn't be "%00h%00t%00t%00p..."!

Of course not; the syntax forbids percent-encoded octets from appearing in the 
scheme component. The RFC also tells you to be careful to only apply percent 
encoding to the components that require it, during the construction of a URI; 
don't apply it blindly to an already-assembled URI.

Lastly, I can't help but wonder if you're reinventing the wheel. RFC 3986 is 
new and does change a few aspects of RFC 2396, but RFC 2396 based percent 
encoding APIs have long been available in Java, and the differences between 
3986 and 2396 are not all that significant for the kind of work you're doing. 
I'm sure every API could use some refinement, but it may not be crucial for 
your application... people have been winging it for years and years now, with 
half-baked APIs based on half-baked specifications...

-Mike
Received on Friday, 25 March 2005 10:59:42 UTC