- From: James Cerra <jfcst24_public@yahoo.com>
- Date: Sun, 27 Mar 2005 01:45:29 -0800 (PST)
- To: Mike Brown <mike@skew.org>
- Cc: uri@w3.org
Mike, Thanks for the explanation. It was incredibly enlightening, but I'm still a little confused. > Then figure out how best to map them URI characters. This really depends on > what URI scheme you are producing for, and/or what kind of data you are > representing... > ...so your attempt to make a generic API for this is going to be a general > solution that may require the caller to know what they are doing and only > call your code when it is really needed. This is part of the project's specs. I have to extend a RDF producing framework [1] with string processing functions. The strings produced are concatenated and eventually become an URI. The author of the URI generally will know which sections need to be percent-encoded. I'm simply writing the method to perform that encoding. So when you wrote: ] But let's just say for now that in your API you know you're starting with a ] set of arbitrary bytes and you're going to prepare a fragment of a URI from ] them, and this will be done in a manner that will "just work" 80% of the ] time, regardless of the requirements of specific schemes and contexts. You ] can do this. That applies to my situation. > Stop. The unreserved set is a set of characters, not bytes. This confused me in the spec. When it said, "or consistency, percent-encoded octets in the ranges of..." I assumed that the bytes specified were the unreserved set. I _think_ that I understand my mistake now. > 1) Input data: characters? Convert to bytes (UTF-8, preferably). Bytes? take > as-is. Output: Unicode character buffer. > 2) Input bytes corresponding, in ASCII, to characters in the unreserved set: > write as characters to output buffer (0x41 -> \u0041) > Other input bytes: write as percent-encoded octets > (0x20 -> \u0025\u0032\u0030) > 3) Serialize string buffer as an ASCII encoded string, or whatever is useful > to you. I think I undestand. So say the program got the U+4F5B HAN IDEOGRAPH character, and the user wants to use UTF-16 as the character encoding for the bytes. Then the program should: 1) Convert input character to bytes 0x4F 0x5B (step 1). 2a) Write the characters "%" "4" "F" for the first byte (step 2). or 2b) Write character "O" (letter "Oh") for first input byte (step 2). 3) Write characters "%" "5" "B" for second input byte (step 2). 4) Output string of characters (step 3). So the function maps bytes to characters. From there, the user can apply whatever character encoding they wish to the character string. Right? The optional (2b) step confuses me. The octet %4F is in the ALPHA range, which the spec says should be converted to its ALPHA character [2]. Unless they mean "unreserved characters NOT bytes," in which case it can't be converted to "O" and step (2b) is wrong (shouldn't be there). In the case (2b) is wrong, I presume the general encoding steps are (assuming all reserved characters must be percent encoded in this URI fragment) when beginning with characters: 1) If character is unreserved, pass through. Otherwise: 2) Convert character to bytes in encoding "X". 3) Percent encode each byte (%xx %yy ...) and pass those characters through. So this algorithm converts characters to characters. Take the U+4F5B HAN IDEOGRAPH unicode character again, and the transcription encoding UTF-16BE, for another example. To get a URI fragment: 1) Character is not unreserved so don't pass through. 2) Convert to bytes in UTF-16: 0x4F 0x5B. 3) Percent encode btyes: %4F %5B. So the output is the string "%4F%5B". The octet "%4F" cannot be converted to the letter "O" ("Oh") since in UTF-16BE, the letter is "%00%4F". Also, since UTF-16BE encodes 2 bytes at a time, the string "%XX%00%4F%5B" (where %XX is anything) can never be normalized with an "O" in the middle since two octets must be decoded at a time (that is, for decoding a URI, octets bytes must be read in to according that encoding's "finite state table"). I think the second algorithm is the one to use, but I'm not sure. Finally, is encoding "X" is what RDF 3986 (in section 2.5) calls "data format encoding?" > Lastly, I can't help but wonder if you're reinventing the wheel. RFC 3986 is > new and does change a few aspects of RFC 2396, but RFC 2396 based percent > encoding APIs have long been available in Java, and the differences between > 3986 and 2396 are not all that significant for the kind of work you're > doing. I'm sure every API could use some refinement, but it may not be > crucial for your application... people have been winging it for years and > years now, with half-baked APIs based on half-baked specifications... None of the APIs that I investigated did what I needed. They either work with only a limited set of encodings [3, 4] or encode/ignore characters that I don't want encoded/ignored [4, 5]. I searched several more references before decided to implement this myself. I appreciate your help. Thanks! -- Jimmy Cerra P.S. I rewrote this responce several times as I came to understand you post. Please excuse (or point out) and incongruities. [1] http://www.wiwiss.fu-berlin.de/suhl/bizer/d2rmap/D2Rmap.htm [2] RFC 3986, Sections 2.3 and 2.4. [3] http://skew.org/xml/stylesheets/url-encode/ [4] http://www.w3.org/International/O-URL-code.html [5] java.net.URLEncoder and java.net.URLDecoder __________________________________ Do you Yahoo!? Yahoo! Small Business - Try our new resources site! http://smallbusiness.yahoo.com/resources/
Received on Sunday, 27 March 2005 09:46:00 UTC