W3C home > Mailing lists > Public > public-html@w3.org > June 2009

[Bug 6742] pre-encoded form values should be restorable as submitted

From: <bugzilla@wiggum.w3.org>
Date: Sun, 28 Jun 2009 08:03:29 +0000
To: public-html@w3.org
Message-Id: <E1MKpMH-000324-PB@wiggum.w3.org>

Nick Levinson <Nick_Levinson@yahoo.com> changed:

           What    |Removed                     |Added
             Status|RESOLVED                    |REOPENED
         Resolution|INVALID                     |

--- Comment #12 from Nick Levinson <Nick_Levinson@yahoo.com>  2009-06-28 08:03:29 ---
Consider a one-way communication, using UTF-8. The key parts of a server are
the UA and I/O. UTF-8 is used here, for convenience and in case the plan for
HTML 5 is to require UTF-8 everywhere relevant.

Say a form is used by a human to contact another human. Four parties take part:
-- the sending human;
-- the sending UA;
-- the receiving UA; and
-- the receiving human.

A communication as it should work follows:
-- Sending human types: I said "this & that."
-- Sending UA encodes to: I said %22this %26 that.%22
-- Receiving UA will decode from: I said %22this %26 that.%22
-- Receiving human reads: I said "this & that."

But suppose the sender, who we'll say is a programmer, types a percent-code
into the original message, while typing quote marks as usual:
-- Sending human types: You inserted %26 on line 7. Like Pat said, "you
shouldn't have."
-- Sending UA encodes to: You inserted %26 on line 7. Like Pat said, %22you
shouldn't have.%22
-- Receiving UA will decode from: You inserted %26 on line 7. Like Pat said,
%22you shouldn't have.%22
-- Receiving UA, without further information, assumes that %26 previously
replaced an ampersand and so replaces it now with an ampersand.
-- Receiving human reads: You inserted & on line 7. Like Pat said, "you
shouldn't have."

Result: The receiving human does not receive the message that was sent, but a
different one. The receiving human could well reply, "I didn't insert &." The
sending human might send a new message, "I didn't say you did. I said you
inserted %26, and you shouldn't have. & would have been better." The receiving
human will see, "I didn't say you did. I said you inserted &, and you shouldn't
have. & would have been better.", and may reply, "What's the difference between
& and &?"

This responds to sec., step 6, substep 2, subsubstep 1, and sec. 8.2
of <http://www.w3.org/TR/html5/single-page/>, Working Draft 23 April 2009, as
accessed 6-28-09. Sec. 8.2.4 appears relevant except that I couldn't find a
subsection thereof that specifically governed percent-decoding, or I missed it;
perhaps something should be added on the assumption that UA makers infer its
existence anyway.

UTF-8 is recommended but not mandatory, thus a UA not using UTF-8 might not be
a violation. See especially section, step 2; also, e.g., ". . .
windows-1252 is recommended as a [fallback] default . . . ." (sec.,
step 7), "User agents must at a minimum support the UTF-8 and Windows-1252
encodings, but may support more." (sec. 2.8), "The [meta element's] charset
attribute specifies the character encoding used by the document. . . . If the
attribute is present in an XML document, its value must be an ASCII
case-insensitive match for the string 'UTF-8' (and the document is therefore
required to use UTF-8 as its encoding)." (sec. 4.2.5), "Authors are encouraged
to use UTF-8. Conformance checkers may advise against authors using legacy
encodings." (sec., and secs. 2.7.2-2.7.3 & 2.7.6. Thus, UTF-8 is not
required for non-XML documents except as otherwise required.

Correcting a prior error of mine: Of the options of listing and flagging, if
listing is chosen, and if one or more instances of a single representation are
to be reversed to recover original strings but another one or more instances
are to be left as they are, only the fewer instances would be listed to save on
bandwidth, as long as T/F will flag whether the list is for reversing or

A use case is not limited to online conversations between programmers. This
also applies to scholarly writing in which storage and transmission of a
submission have to be highly accurate and paraphrasing of the "we know what was
meant" variety may not be acceptable to content authors. Since even programmers
who are expert in other languages having little to do with the Web, such as
Cobol or PostScript, might have conversations like that hypothesized above,
familiarity with the existence of HTML's percent-encoding should not be assumed
even for programmers in general, thus adding to the use case.

Thank you.


Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Sunday, 28 June 2009 08:03:39 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:04 UTC