- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sun, 22 Aug 2004 17:32:50 +0300
On Aug 17, 2004, at 16:37, Ian Hickson wrote: > On Tue, 13 Jul 2004, Henri Sivonen wrote: >>> 2.5. Extensions to file upload controls >> >>> * UAs should use the list of acceptable types in constructing a >>> filter >>> for a file picker, if one is provided to the user. >> >> That feature is not likely to be reliably implementable considering >> that >> real-world systems do not have comprehensive ways of mapping between >> file >> system type data and MIME types. > > I am told modern systems do, now. Which modern systems? >>> For text input controls it specifies the maximum length of the >>> input, in >>> terms of numbers of characters. For details on counting string >>> lengths, see >>> [CHARMOD]. >> >> Should UAs use NFC for submissions? > > I don't know, should they? I am inclined to think that NFC SHOULD be used in order to accommodate transitional systems that treat Unicode as "wide ASCII". For example, a server-side system written in PHP4 may not have Unicode normalization facilities available to it and might send the data to Mozilla later. If a UA had posted content in NFD to the server and the server na?vely sent to the content to the OS X version of Mozilla, text in common European languages would break in an ugly way. I would hesitate making NFC a MUST, though, because I don't know whether small devices can hold the data that is needed in order to carry out Unicode normalization. Requiring desktop apps to normalize shouldn't be a big deal. At least OS X and Gnome provide normalization facilities and ICU can be thrown in as a cross-platform solution. In any case, robust server-side systems should not trust that the input in is a particular normalization form and should normalize the data themselves. The point is accommodating systems that are not robust. >>> To prevent an attribute from being processed in this way, put a >>> non-breaking >>> zero-width space character () at the start of the attribute. >> >> Isn't the use of that char as anything but the BOM deprecated or at >> least >> considered harmful? > > Arguably, it _is_ a BOM here. > > I'm not overly fond of this either, but it's the only solution I could > find that was relatively harmless (the BOM can always be dropped at the > start of strings) Exactly. Which is why tools used for generating the page might drop it on the server! Actually, I am distributing one such tool myself. Is the tool broken? http://iki.fi/hsivonen/php-utf8/ > and yet did the job. Better suggestions are welcome > though. My immediate thought is ZWNJ, but I'm not sure if using it is a good idea. >>> Note that a string containing the codepoint's value itself (for >>> example, the >>> six-character string "U+263A" or the seven-character string >>> "☺") is >>> not considered to be human readable and must not be used as a >>> transliteration. >> >> Do you expect UAs that already do this change their behavior with the >> legacy >> submission types? > > We can hope. FWIW, there may be CMS input form handlers that expect the prohibited behavior. I have been involved in developing one myself. (Not that I recommend relying on such things. Obviously, UTF-8 is the way to go.) >>> which has a root element named "submission", with no prefix, >>> defining a >>> default namespace uuid:d10e4fd6-2c01-49e8-8f9d-0ab964387e32. >> >> I think that is an inappropriate attempt to micromanage the syntactic >> details >> that are in the realm of a lower-level spec. I think the submission >> format >> should either allow all the syntactic sugar that comes with >> Namespaces in XML >> or be layered directly on top XML 1.0 without namespace support. > > The reason it is micromanaged is to make it possible to use either a > pure > XML 1.0 parser _or_ an XML 1.0 with namespaces parser on the server > side > without getting into any complications. I was able to guess that that was the rationale behind the requirement. But why is the ability use a namespace-unaware XML processor a requirement? The only reason I can come up with is that PHP4 is borked by default but widely used. Processing namespaced XML with tools that don't support namespaces is clueless and just plain wrong. If tools that don't support namespaces are to be accommodated, wouldn't the natural way be to spec that the elements are not in a namespace and the namespace processing layer is not used? That way you wouldn't endorse behavior that is clueless and just plain wrong. I can see three problems with namespacelessness: 1) The current best practice for dispatching on the type of an XML document is dispatching on the namespace. If there was no namespace, one would have to fall back on dispatching on the content type. This is not a real problem with this particular vocabulary because this vocabulary has a distinct content type from the start. 2) You couldn't mix the vocabulary with other vocabularies using namespaces. This is a theoretical problem but probably not a real one, because the vocabulary is limited to a specific case of client-server interaction. Besides, the way you limit the use of namespaces in the current spec language would also preclude creative augmentations to the submission vocabulary. 3) You intend to submit the spec to a consortium that shall not be named and you know the powers that be in the consortium that shall not be named would veto any spec that builds directly on top XML 1.0 without the namespace layer in between. So of the three problems only the last one is significant and it is a political problem and not a technical one. Sadly, political problems may be more difficult to overcome than technical problems. >>> but must include a BOM. >> >> I think that is not a legitimate requirement when UTF-8 is used. > > Why not? It is a requirement that applies to the XML serialization, but the requirement is not present in the XML spec. The requirement would mean that you could not use any arbitrary but conforming XML serializer. The use of the BOM as a UTF-8 signature is a Microsoftism that was only allowed in XML 1.0 second edition, because fighting Microsoft text editors would have been futile. Still, if you pick a non-Microsoft XML serializer off the shelf, chances are it does not emit a BOM in the UTF-8 mode. Is there a good reason to limit the use of arbitrary but conforming off-the-shelf XML serializers? >>> UAs may use either CDATA blocks, entities, or both in escaping the >>> contents of attributes and elements, as appropriate. >> >> In order not to imply that this spec could restrict the ways >> characters >> are escaped, that sentence should be a note rather than part of the >> normative prose. (Of course, only the pre-defined entities are >> available. Then there are NCRs.) > > This spec _could_ restrict the ways characters are escaped. It needs to > not be a note so that the "may" has normative value. No? The could restrict the escaping in the same sense the HTTP spec could restrict how you choose TCP sequence numbers. In general, please see section 4.3 of RFC 3470. -- Henri Sivonen hsivonen at iki.fi http://iki.fi/hsivonen/
Received on Sunday, 22 August 2004 07:32:50 UTC