Forms/CGI urls: '&' in HREF attributes

There's an unfortunate interaction between the x-www-urlencoded
syntax for form data submission and SGML attribute value literal
syntax. This came up shortly after I started running the validation
service, and I thought we had discussed the problem, but it seems
to be getting worse, and not better.

An example of the problem:

Given this document:

===============================================
<!doctype html public "-//IETF//DTD HTML//EN">
<title>testing & in HREF</title>

<p>Here we go:
 <a href="http://foo.org/cgi-bin/do-something.pl?x=a&y=b">link</a>
===============================================

Trying to validate it yields:

===============================================
connolly@ulua ../connolly[1114] html-validate test.html
sgmls: SGML error at test.html, line 5 at "y":
       No declaration for entity "y"; reference ignored
===============================================

Section 7.9.3 "Attribute Value Specification" of the SGML standard
says:

	An attribute value literal is interpreted as an attribute value
	by replacing references within it, ignoring Ee and RS, and replacing
	an RE or SEPCHAR with a SPACE.

So the attribute value literal:

	"http://foo.org/cgi-bin/do-something.pl?x=a&y=b"

has an error it it: &y references an undeclared entity.


This should definitely go in as a NOTE: or something in the HTML spec,
and perhaps it's worth mentioning in the URL spec (though that's
stretching it).


There are a couple ways to represent the string:

	http://foo.org/cgi-bin/do-something.pl?x=a&y=b

as an attribute value literal:

	"http://foo.org/cgi-bin/do-something.pl?x=a&amp;y=b"
	"http://foo.org/cgi-bin/do-something.pl?x=a&#34;y=b"

but neither of those is interpreted correctly by existing browsers.

In the interest of interoperability, I'd like to move toward using ';'
rather than (or in addition to) '&' to separate form name/value pairs.

That way, the URL for this query can be:

	http://foo.org/cgi-bin/do-something.pl?x=a;y=b

You can put this in an HTML document by writing:

	HREF="http://foo.org/cgi-bin/do-something.pl?x=a;y=b"

A quick check through the Mosaic 2.4 source code shows that a ';'
characetr in an input field _will_ be %xx-ified, so this doesn't
introduce any ambiguity.

The way to start the transition is to enhance cgi scripts to support
separating form values by ';' as well as '&'. Then folks that want to
validate their HTML can change '&' to ';' in their HREF
attributes.

But folks will continue to copy-and-paste these form query URLs into
their HTML without quoting the '&' chars. So eventually, browsers
should start using ';' in the form encoding process in the first place
(as well as supporting &#34; inside attribute values!), and then the
issue will go away.


There's something of a chicken-and-egg problem here: who will support
the first browser to use ';' rather than '&' to encode form stuff?
That won't happen until the vast majority of CGI scripts have been
enhanced to support it. And that might not won't happen until folks
that want to validate their HTML start complaining. But it's a really
cheap fix on the CGI side, no?


I keep seeing more and more use of '&'to separate stuff in URLs, and
while this is really just a bug in the attribute value parsing in the
browsers, at can be avoided by using ';' in stead (or in addition).

Dan

Received on Thursday, 9 February 1995 14:06:24 UTC