- From: Mike Brown <mike@skew.org>
- Date: Fri, 25 Mar 2005 03:59:40 -0700 (MST)
- To: James Cerra <jfcst24_public@yahoo.com>
- CC: uri@w3.org
James Cerra wrote: > I'm writing a converter in Java for percent-encoding characters not-UNRESERVED > bytes according to RFC 3986 [1]. > > There are a few questions I have when > encoding to/from non ASCII-like character encoding - esp. UTF-16 BE/LE and > other encoding that use more than one byte per character. > Also consider that unreserved characters can (but shouldn't) be interchanged with their percent-encoded equivalents in ASCII. That is, "A" can be "%41", and "%41" can be "A". So if you thought you had problems with UTF-16 before, your brain should really be fried now. In UTF-16 you need "%41" to just mean byte 0x41, not "A"! The problem is mainly just that you're muddling two levels of abstraction as you try to apply a percent-encoding algorithm too early. > So far, here's the algorithm that I inferred from the spec: > > 1) Given an input byte stream and output byte stream. Nope. Bear with me, here, as this is a bit of a lengthy explanation. Think of your problem this way: given a resource (which can be anything, e.g., a web page, or the idea of world peace), you are going to represent it with a uniform identifier (a sequence of characters, not bytes). The identifier, a URI, will be of some scheme (http, mailto, etc.) that provides syntactic guidelines -- simplified guidelines as compared to the one-size-fits-all of RFC 3986, but not in conflict with RFC 3986 either), plus guidelines for mapping the identifying aspects of the resource (like an email address in the case of mailto) to a sequence of characters allowed in a URI. Additionally, the scheme may (or may not) imply or mandate a particular dereference mechanism that allows a representation of the resource to be obtained/acted upon (like, the fetching of a document via HTTP, or the sending of a message over an email network). Your input is data that provides identifying aspects of the resource. In most cases, this data will arrive in the form of either bytes (concrete bit sequences which may or may not represent characters), or characters (abstract concepts representable by encoded bit sequences, scribbles on a napkin, pixels on a screen, bumps on a Braille printer, etc.), and will comprise a 'natural' identifier for the resource, just not in the format of a URI -- an OS-specific file path as compared to a 'file' scheme URI, for example. Your output, the URI, is going to be a sequence of characters (those abstract things, independent of any encoding/representation) that conform to the URI syntax (or whatever subset thereof that needs to be percent-encoded). Once you have the URI as characters, you can do with it whatever is necessary to make it useful to you -- serialize it as bytes in some encoding, write it on a wall, speak it aloud, whatever. Of course you can shortcut this by writing directly to ASCII bytes, but in order to understand your role in the process, you need to think of URI construction in terms of bytes-or-characters-in, characters-out. So first decide what kind of input you are really taking in: characters or bytes. Then figure out how best to map them URI characters. This really depends on what URI scheme you are producing for, and/or what kind of data you are representing (HTML form data and the application/x-www-form-urlencoded media type being applicable to various schemes), and/or what the receiver expects (CGI applications vary in their expectations, for example), so your attempt to make a generic API for this is going to be a general solution that may require the caller to know what they are doing and only call your code when it is really needed. Further complicating matters is that the specs governing the URI schemes and things like HTML form data submissions and CGI *should* (in everyone's fantasy world) be very clear about how data gets from its native format into a URI, but in practice, they rarely make things very clear at all. Consequently, a lot of what goes on in this area in the real world is ad-hoc. There is a trend toward making everything UTF-8 based, but this is often just a recommendation going forward, at best, and does not affect deployed implementations and long-unupdated specs. But let's just say for now that in your API you know you're starting with a set of arbitrary bytes and you're going to prepare a fragment of a URI from them, and this will be done in a manner that will "just work" 80% of the time, regardless of the requirements of specific schemes and contexts. You can do this. > 2) If an input byte is in the UNRESERVED set [2] then write to the > output stream. Stop. The unreserved set is a set of characters, not bytes. > 3) Otherwise write 0x25 [3] and then the two byte hex version of the > input byte, in ASCII, to output stream. > 4) Continue on to end of stream. Output stream is in ASCII. The algorithm you are misquoting, if from RFC 3986, is intended to tell you how to go about representing *URI characters* in a URI. That is, once you have already converted your input data (the resource's 'natural' identifier, or some fragment thereof) to URI characters and percent-encoded octets, THEN you decide whether percent-encoding needs to be applied to any of the URI characters: unreserved characters can go in directly, reserved characters can go in directly if being used for their reserved purpose, and any other reserved characters must be converted to percent-encoded octets based on ASCII -- that's it; no other provisions need to exist, in the generic syntax, for representing any other characters, because the URI is at a higher level of abstraction than your input data. You are tempted to think, thanks to very poorly written things like HTML's definiton of the application/x-www-form-urlencoded media type, as well as fairly well written things like RFC 2396, that you should do a one-to-one mapping of your input data, if it is character based, to URI characters, taking care to percent-encode any that are not in the unreserved set. As you discovered and as I tried to explain above, this only "works" if you base the percent-encoded octets on a character encoding that will not result in any ambiguity -- ASCII, UTF-8, ISO-8859-1, etc. are OK, but UTF-16 is not. There are indeed specifications that actually say to go about it in this way, but that's because they were written a long time ago for a world that was ASCII-based, using single-byte encodings and not differentiating between characters and bytes. This actually was more clear in RFC 1738 than RFCs 2396 and 3986, in my opinion, but the ideal method of producing URI characters from character based data is to ALWAYS convert the character based data to bytes first, then use percent-encoded octets for any bytes that aren't safe to replace with their corresponding ASCII characters. So here is the algorithm I think you want: 1) Input data: characters? Convert to bytes (UTF-8, preferably). Bytes? take as-is. Output: Unicode character buffer. 2) Input bytes corresponding, in ASCII, to characters in the unreserved set: write as characters to output buffer (0x41 -> \u0041) Other input bytes: write as percent-encoded octets (0x20 -> \u0025\u0032\u0030) 3) Serialize string buffer as an ASCII encoded string, or whatever is useful to you. As I said, this is very general. It is hard to make an API that will work for every situation. You need to take into account what your input data really is, and what kind of URIs (if it is indeed URIs) that you are producing, and what will be done with them. > "%00g%00o%00o%00g%00l%00e" > > Is this correct? Well... If your intent was to prepare the character string "google" for incorporation in a URI, in the absence of clear guidelines for mapping the characters g,o,o,g,l,e to URI characters for a particular URI scheme and application context, then you did not choose the ideal, recommended representation, which would've just been "google". (I'm hesitating to say it's "incorrect"). If your intent was to prepare the byte sequence <00 67 00 6F 00 6F 00 67 00 6C 00 65>, which happens to be the UTF-16BE representation of the character string "google", for incorporation in a URI, (+ caveats above), then yes, you chose the ideal, recommended representation for that input data. You could've also chosen "%00%67%00%6F%00%6F%00%67%00%6C%00%65" although that's not the preferred form. In any case, if the consumer of this data knows what to do with it, and it does not violate the generic syntax, then it is "correct". (Well, aside from the fact that you said it was UTF-16LE based.. looks like UTF-16BE to me!) > And how should one interpret the > scheme component - i.e. "http://" - in a string starting from UTF-16? Surely > the output shouldn't be "%00h%00t%00t%00p..."! Of course not; the syntax forbids percent-encoded octets from appearing in the scheme component. The RFC also tells you to be careful to only apply percent encoding to the components that require it, during the construction of a URI; don't apply it blindly to an already-assembled URI. Lastly, I can't help but wonder if you're reinventing the wheel. RFC 3986 is new and does change a few aspects of RFC 2396, but RFC 2396 based percent encoding APIs have long been available in Java, and the differences between 3986 and 2396 are not all that significant for the kind of work you're doing. I'm sure every API could use some refinement, but it may not be crucial for your application... people have been winging it for years and years now, with half-baked APIs based on half-baked specifications... -Mike
Received on Friday, 25 March 2005 10:59:42 UTC