Re: [VE][338] PHP sessions and ampersands in URLs from Jukka K. Korpela on 2005-04-20 (www-validator@w3.org from April 2005)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 20 Apr 2005 07:58:26 +0300 (EEST)
To: www-validator@w3.org
Message-ID: <Pine.GSO.4.58.0504200735490.23021@korppi.cs.tut.fi>
On Tue, 19 Apr 2005, David Dorward wrote:

> I suggest the following be appended to the first paragraph of the
> message:
>
>   Session handling code in PHP is a common perpetrator of this error
>   as explained in <a
>   href="http://dorward.me.uk/www/php-sessions/ampersand/">Ampersands,
>   PHP Sessions and Valid HTML</a>, a document which also proposes a
>   number of solutions.

As a general idea, it seems quite useful to add references to documents
that discuss such technical problems.

> I'd also welcome feedback on the document itself.

In the first paragraph, you say:

"Such characters cannot be simply typed into a document if you wish them
to display - how could the user agent tell the difference between <
(meaning start a new tag) and < (meaning a literal less than character)."

The first part is incorrect for HTML: there are many situations where
I can type "<" and have it displayed. The rhetoric part fails for the same
reason: it could be a genuine question in an exam on SGML, and the correct
answer is _not_ "you can't, ever". I would suggest something like the
following:

"Such characters cannot always be simply typed into a document if you wish
them to display. For example, if you would like to show the mathematical
expression b<a, you cannot type it as such, since browsers would take
the <a part as starting a tag."

Along the same lines, the following paragraph, too, says too much:

"Ampersand characters used as argument separators pose no problem in plain
old URLs, however in URLs encoded in HTML they still mean start of
character reference."

I would suggest saying "they might still start" instead of "they still
mean start". In HTML, the ampersand may appear as such unless followed by
a name character or by "#", and in a context like &#160; the ampersand
does not start a character reference.

There's a common objection to "escaping" ampersands: they usually appear
in URLs resulting from a form submission with method="get" or a similar
operation, and in that case they contain things like id=42&copy=1
so that although the "&" is followed by a name character, the name
is not terminated by a semicolon. Further confusion is caused by the XML
rule that makes the semicolon mandatory. Maybe you could find a way to
address this issue in a manner that does not confuse people too much.
The point is that e.g. id=42&copy=1 is treated as id=42©=1 (with &copy
replaced by the copyright sign) by most browsers, and this is the correct
processing by HTML rules. And you might that there is a large number
of predefined entity names in HTML and nobody wants to remember them by
heart.

At the end of the document you say
"You appear to be using Internet Explorer or a browser based on its
underlying engine. Please read my browser support page."
I know this is not relevant to the topic at hand, but I still comment on
it, since _you_ should know that too. It is foolish to ask a user read
your browser support page, even if you ask politely, since it is quite
irrelevant. How does it benefit me, or someone who tried to validate a
page, checked the error message, and consulted your document, to learn -
after reading the document - that there are some completely unspecified
"Many rendering glitches" on the browser I use?

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 20 April 2005 04:58:29 UTC