- From: Ilguiz [eel ghEEz] Latypov <notifications@github.com>
- Date: Wed, 25 Nov 2015 16:49:01 -0800
- To: whatwg/url <url@noreply.github.com>
- Message-ID: <whatwg/url/issues/74@github.com>
RFC 3986 suggests to rely only on the smallest possible set of reserved characters that is necessary to split the URL into 5 components (Section 5.2.1 Pre-parse the Base URI). Assuming that the RFC implied left-to-right parsing, that would mean encoding only the terminator expected by the parser in each component. The query component has the hash mark as its terminator. The RFC goes as far as to recommend keeping raw as many characters as possible in section 3.4 Query: ``` as [..] one frequently used [query] value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters. ``` On the other hand, the following part of the RFC requires encoding of some special characters. (I did not find a reason for this in the RFC, and browsers appear to emit raw back-ticks et al in their HTTP requests as mentioned in a Mozilla bug referenced from #17 despite this requirement). ``` pchar = unreserved / pct-encoded / sub-delims / ":" / "@" query = *( pchar / "/" / "?" ) pct-encoded = "%" HEXDIG HEXDIG unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" ``` https://www.ietf.org/rfc/rfc3986.txt More to that, Appendix C Delimiting a URI in Context seems to imply that double quotes `"` (and, by extension with HTML attributes, single quotes `'`), whitespace ` `, hyphens `-` and angle brackets `<>` need encoding when the URLs are further submerged into a context of another parser or human reader. I suggest to remain strict about parsers external to the URL parser and let additional encoders protect against specific external parsers. These encoders will not have to comply with the encoding rules of the URL syntax. For example, the ampersand character may be encoded as an HTML entity &amp; The RFC suggests that queries consist of name=value pairs without even defining the delimiter between these pairs. So far I see the following algorithms for name=value pairs as satisfying the RFC's _musts_ and following its _shoulds_. I guess this should agree with https://github.com/tkem/uritools. (The RFC did not mention the vestige of isindex HTML tag submitting a request with words separated by the plus characters: the plus character in the query part of the URL decodes to the space character). ``` GetURLFromClient(network) -> URL ==> URL comes as a byte array. GETURLFromAddressBar(browser) -> UnicodeURL ==> The following parsers may accept Unicode strings as long as they allow a mix of UTF-8 byte arrays and UTF-16 code units when parsing percent-encoded strings. URLParser(URL) -> (scheme authority path query fragment): Split the string URL based on the structure: scheme ":" hier-part [ "?" query ] [ "#" fragment ] (The parser will split hier-part into authority and path expecting an optional leading double-slash and a slash indicating the beginning of the path). ==> query should hide its own ASCII hex 23#. The encoder opposite to QueryParser will provide that. QueryParser(query) -> *(name value) (a) Expect query to comply with the spec (no reasoning except protecting against the fragment 23# search and prematurely protecting against 7.3 Back-End Transcoding). query = * (ALPHA / DIGIT / pct-encoded / one-of "-._~!$&'()*+,;=:@/?") ==> query must hide the following characters found in *(name value): ASCII 00-1F, 20SP, 22", 23#, 24$, 25%, 2B+, 3C<, 3E>, 5B[, 5C\, 5D], 5E^, 60`, 7B{, 7C|, 7D}, 7FDEL, non-ASCII. (b) Split the string querySpec expecting separators "&","=" into *(name value) pairs. ==> names and values must protect own "26&", "3D=". (c) Decode percent-encoded UTF-8 in query to UTF-16 code units in *(name value) pairs. Decode the "+" vestige to 20SP. ``` --- Reply to this email directly or view it on GitHub: https://github.com/whatwg/url/issues/74
Received on Thursday, 26 November 2015 00:49:35 UTC