Re: URLencoding. from Dan Connolly on 2000-04-07 (www-html@w3.org from April 2000)

From: Dan Connolly <connolly@w3.org>
Date: Fri, 07 Apr 2000 10:42:39 -0500
To: Dave J Woolley <DJW@bts.co.uk>
CC: "'www-html@w3.org'" <www-html@w3.org>
Message-ID: <38EE01EF.D0E8573F@w3.org>

Dave J Woolley wrote:
> 
> > From: Dave Bridger [SMTP:dbridger@inlink.com]
[...]
> > Perhaps Section 17.3.4 of the HTML Spec should be clarified.

Perhaps; I haven't managed to double-check the details yet, but...

>         [DJW:]  It is not the job of the HTML spec to define the structure
>         of URLs

In fact, the URI spec just says what characters you can't put in a URI,
and a syntax for encoding numbers in URIs -- numbers that conventionally
refer to US-ASCII character code points, though that's not really
observable from the URI spec level.

In other words: some characters that might be used in filenames
(e.g. / on a mac) that aren't allowed or have reserved meaning
in URIs; the URI spec encourages
servers to map '/' in server-internal names to %2F. But only
that server is licensed to decode the %2F back to a '/'; no
other party in the net is licensed to take advantage of
the connection, without further knowledge.

The HTML spec specifies a convention for server-side
resources referred to by name/value pairs, and a convention
for encoding those name/value pairs as URIs. Clients
that know that they're talking to a server that understands
this convention (because the server sent <form> markup
in a document) can solicit name/value pairs from
the user and use the x-www-form-urlencoded convention
to pass them to the server.

So it is the job of the HTML spec to define this encoding
convention.

Did that make any sense?

Now... let's see if it does so clearly... I wrote the
HTML 2.0 spec, and I was always a little fuzzy on
forms stuff; I mostly just integrated contributions
from others without really grokking; I hope that
situation didn't persist into the HTML 4.0 development,
but let's see...

Well, perhaps this could be clearer, but it does specify
the set of characters that don't get escaped:

	"Space characters are replaced by `+', and
	then reserved characters are escaped as described in [RFC1738], section
2.2:
        Non-alphanumeric characters are replaced by `%HH', a percent
sign and
	two hexadecimal digits representing the ASCII code of the character.
	Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A'). 

	-- 17.13.4 Form content types
http://www.w3.org/TR/1999/REC-html401-19991224/interact/forms.html#h-17.13.4.1

That's clear enough, no?
	0. convert mac/unix/whatever linebreak conventions to internet CRLF
		if necessary
	1. replace all ' ' by +
	2. replace everything but alphanumerics [a-zA-Z0-9] by %HH

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Received on Friday, 7 April 2000 11:43:16 UTC