Re: form submission and chars. outside the reper. of ... from Ian Hickson on 2003-09-22 (www-international@w3.org from July to September 2003)

From: Ian Hickson <ian@hixie.ch>
Date: Mon, 22 Sep 2003 14:03:24 +0000 (UTC)
To: Jungshik Shin <jshin@i18nl10n.com>, John Cowan <cowan@mercury.ccil.org>, Bjoern Hoehrmann <derhoermi@gmx.net>, Martin Duerst <duerst@w3.org>
Cc: "www-international@w3.org" <www-international@w3.org>, Francois Yergeau <FYergeau@alis.com>, "kuro@sonic.net" <kuro@sonic.net>, Paul Deuter <PaulD@plumtree.com>
Message-ID: <Pine.LNX.4.58.0309221252160.15992@dhalsim.dreamhost.com>
On Fri, 19 Sep 2003, Jungshik Shin wrote:
> On Fri, 19 Sep 2003, Ian Hickson wrote:
>> will say). Personally I prefer to replace out-of-set characters with "?".
>> Some UAs, namely Mozilla (in all such cases) and IE (in a more limited set
>> of cases) currently replace unknown characters with the string "&#", the
>> decimal representation of the character's Unicode code point, and ";".
>> Now, this is not really wise, as has already been discussed in this
>> thread,
>
> I agree that it's not wise, but some server-side programs have sorta
> 'relied' on that behavior making things complicated ...

Indeed.


>>    character depending on the availability of characters in the
>>    submission character set.
>
> Were you alluding to a possible transliteration or just a different
> question-mark-like character?

I was just thinking question marks, but you're right, transliteration
would also be sensible.


> Needless to say, it'd have been still better if it had had a built-in
> mechanism for character encoding specification from the very beginning
> (even for GET).

That would be nice. Still, it wouldn't cope with this case, where the
server has stated that only one character set is acceptable.


On Fri, 19 Sep 2003, John Cowan wrote:
>>
>>    If the submission is not cancelled, the user agent MUST replace
>>    each character that is not in the submission character set with a
>>    single replacement character, either U+FFFD, "?", or some other
>>    character depending on the availability of characters in the
>>    submission character set.
>
> How does it enhance interoperability to insist on replacing all the
> untransmissible characters with a single character, and not prescribe
> the single character?

I'm more concerned with avoiding that the characters be turned into
something that servers assume they can then search for and turn into
Unicode codepoints, like they do with &#...;, since that otherwise means
the user can no longer enter those characters and have them treated
literally. This is especially important, e.g., for comments forms on
technical forums.


> As written, it would be conformant to change "die sch�ne M�llerin" (in
> a US-ASCII-encoded form) to "die schXne MXllerin", but changing it to
> "die schoene Muellerin" would be non-conformant.  That makes no sense
> to me.

I agree.


> Furthermore, mentioning U+FFFD in this connection is the merest futility.
> If U+FFFD is transmissible, any Unicode character will be.

The character entered might not be in Unicode.


On Fri, 19 Sep 2003, Bjoern Hoehrmann wrote:
>
> Note that GET submissions are limited to US-ASCII; by your proposal it
> would be impossible to search for "Bj�rn" on Google while it currently
> is just undefined what happens if I try.

Then we should define that too. IMHO the character set should be a
character set chosen by the UA, from the list of acceptable character
sets, defaulting to the character set of the document in the absence of
other hints, and if the '_charset_' input is going to be sent, it should
be set to the value that was used.


New proposal (aimed at HTML4.01 section 17.13.3 as an errata):

   If the form data set contains characters that are outside the
   acceptable submission character sets, the user agent SHOULD inform
   the user that his submission will be changed, for example using a
   dialog in the form:
      ____________________________________________________
     || Warning |||||||||||||||||||||||||||||||||||||||||||
     |                                                    |
     | This form cannot handle some of the characters you |
     | have entered. The data will be sent as "D?rst".    |
     |                                                    |
     |              (( Send anyway ))  ( Return to form ) |
     `----------------------------------------------------'

   If the submission is not cancelled, the user agent MUST replace
   each character that is not in the submission character set with one
   or more replacement characters.

   For each such missing character, UAs must either transliterate the
   character to a human-recognisable representation (for example
   transliterating U+263A to the three-character string ":-)" in
   US-ASCII, or U+2126 to the byte 0xD9 in ISO-8859-7), or, for
   characters where a dedicated transliteration is not known to the
   UA, replace the character with either U+FFFD, "?", or some other
   single character representing the same semantic as U+FFFD.

   Note that a string containing the codepoint's value itself (for
   example the six-character string "U+263A") is not considered to be
   human readable and must not be used as a transliteration. (This is
   to discourage servers from attempting to mechanically convert such
   codepoints back into Unicode characters, as there is no way to
   distinguish such characters from identical literal strings entered
   by the user.)

In addition, I propose changing the semantics of GET so that if the
form data set contains characters outside US-ASCII, the UA must encode
the form data using a character set chosen from the list of acceptable
encodings, defaulting to the document character set, or, in the
absence of other information, UTF-8. This would require changes to a
number of places in HTML4 chapter 17, namely all but one of the places
that mention the term "ASCII".

Finally, I propose that section 17.3.2 gain a new list item, namely:

   * Controls of type 'hidden' with the name '_charset_' whose value
     is the empty string are always successful, and have a value equal
     to the name of character set used for submission.

Comments?

-- 
Ian Hickson                                      )\._.,--....,'``.    fL
U+1047E                                         /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 22 September 2003 10:03:31 UTC