- From: Ian Hickson <ian@hixie.ch>
- Date: Mon, 22 Sep 2003 14:03:24 +0000 (UTC)
- To: Jungshik Shin <jshin@i18nl10n.com>, John Cowan <cowan@mercury.ccil.org>, Bjoern Hoehrmann <derhoermi@gmx.net>, Martin Duerst <duerst@w3.org>
- Cc: "www-international@w3.org" <www-international@w3.org>, Francois Yergeau <FYergeau@alis.com>, "kuro@sonic.net" <kuro@sonic.net>, Paul Deuter <PaulD@plumtree.com>
On Fri, 19 Sep 2003, Jungshik Shin wrote: > On Fri, 19 Sep 2003, Ian Hickson wrote: >> will say). Personally I prefer to replace out-of-set characters with "?". >> Some UAs, namely Mozilla (in all such cases) and IE (in a more limited set >> of cases) currently replace unknown characters with the string "&#", the >> decimal representation of the character's Unicode code point, and ";". >> Now, this is not really wise, as has already been discussed in this >> thread, > > I agree that it's not wise, but some server-side programs have sorta > 'relied' on that behavior making things complicated ... Indeed. >> character depending on the availability of characters in the >> submission character set. > > Were you alluding to a possible transliteration or just a different > question-mark-like character? I was just thinking question marks, but you're right, transliteration would also be sensible. > Needless to say, it'd have been still better if it had had a built-in > mechanism for character encoding specification from the very beginning > (even for GET). That would be nice. Still, it wouldn't cope with this case, where the server has stated that only one character set is acceptable. On Fri, 19 Sep 2003, John Cowan wrote: >> >> If the submission is not cancelled, the user agent MUST replace >> each character that is not in the submission character set with a >> single replacement character, either U+FFFD, "?", or some other >> character depending on the availability of characters in the >> submission character set. > > How does it enhance interoperability to insist on replacing all the > untransmissible characters with a single character, and not prescribe > the single character? I'm more concerned with avoiding that the characters be turned into something that servers assume they can then search for and turn into Unicode codepoints, like they do with &#...;, since that otherwise means the user can no longer enter those characters and have them treated literally. This is especially important, e.g., for comments forms on technical forums. > As written, it would be conformant to change "die schöne Müllerin" (in > a US-ASCII-encoded form) to "die schXne MXllerin", but changing it to > "die schoene Muellerin" would be non-conformant. That makes no sense > to me. I agree. > Furthermore, mentioning U+FFFD in this connection is the merest futility. > If U+FFFD is transmissible, any Unicode character will be. The character entered might not be in Unicode. On Fri, 19 Sep 2003, Bjoern Hoehrmann wrote: > > Note that GET submissions are limited to US-ASCII; by your proposal it > would be impossible to search for "Björn" on Google while it currently > is just undefined what happens if I try. Then we should define that too. IMHO the character set should be a character set chosen by the UA, from the list of acceptable character sets, defaulting to the character set of the document in the absence of other hints, and if the '_charset_' input is going to be sent, it should be set to the value that was used. New proposal (aimed at HTML4.01 section 17.13.3 as an errata): If the form data set contains characters that are outside the acceptable submission character sets, the user agent SHOULD inform the user that his submission will be changed, for example using a dialog in the form: ____________________________________________________ || Warning ||||||||||||||||||||||||||||||||||||||||||| | | | This form cannot handle some of the characters you | | have entered. The data will be sent as "D?rst". | | | | (( Send anyway )) ( Return to form ) | `----------------------------------------------------' If the submission is not cancelled, the user agent MUST replace each character that is not in the submission character set with one or more replacement characters. For each such missing character, UAs must either transliterate the character to a human-recognisable representation (for example transliterating U+263A to the three-character string ":-)" in US-ASCII, or U+2126 to the byte 0xD9 in ISO-8859-7), or, for characters where a dedicated transliteration is not known to the UA, replace the character with either U+FFFD, "?", or some other single character representing the same semantic as U+FFFD. Note that a string containing the codepoint's value itself (for example the six-character string "U+263A") is not considered to be human readable and must not be used as a transliteration. (This is to discourage servers from attempting to mechanically convert such codepoints back into Unicode characters, as there is no way to distinguish such characters from identical literal strings entered by the user.) In addition, I propose changing the semantics of GET so that if the form data set contains characters outside US-ASCII, the UA must encode the form data using a character set chosen from the list of acceptable encodings, defaulting to the document character set, or, in the absence of other information, UTF-8. This would require changes to a number of places in HTML4 chapter 17, namely all but one of the places that mention the term "ASCII". Finally, I propose that section 17.3.2 gain a new list item, namely: * Controls of type 'hidden' with the name '_charset_' whose value is the empty string are always successful, and have a value equal to the name of character set used for submission. Comments? -- Ian Hickson )\._.,--....,'``. fL U+1047E /, _.. \ _\ ;`._ ,. http://index.hixie.ch/ `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 22 September 2003 10:03:31 UTC