- From: Ian Hickson <ian@hixie.ch>
- Date: Fri, 27 Aug 2004 09:25:46 +0000 (UTC)
On Sun, 22 Aug 2004, Henri Sivonen wrote: > > > > > > > > 2.5. Extensions to file upload controls > > > > > > > * UAs should use the list of acceptable types in constructing a > > > > filter > > > > for a file picker, if one is provided to the user. > > > > > > That feature is not likely to be reliably implementable considering that > > > real-world systems do not have comprehensive ways of mapping between file > > > system type data and MIME types. > > > > I am told modern systems do, now. > > Which modern systems? Windows, Mac, Gnome, etc. > > > > For text input controls it specifies the maximum length of the > > > > input, in > > > > terms of numbers of characters. For details on counting string lengths, > > > > see > > > > [CHARMOD]. > > > > > > Should UAs use NFC for submissions? > > > > I don't know, should they? > > I am inclined to think that NFC SHOULD be used in order to accommodate > transitional systems that treat Unicode as "wide ASCII". For example, a > server-side system written in PHP4 may not have Unicode normalization > facilities available to it and might send the data to Mozilla later. If a UA > had posted content in NFD to the server and the server na?vely sent to the > content to the OS X version of Mozilla, text in common European languages > would break in an ugly way. > > I would hesitate making NFC a MUST, though, because I don't know whether small > devices can hold the data that is needed in order to carry out Unicode > normalization. Requiring desktop apps to normalize shouldn't be a big deal. At > least OS X and Gnome provide normalization facilities and ICU can be thrown in > as a cross-platform solution. > > In any case, robust server-side systems should not trust that the input in is > a particular normalization form and should normalize the data themselves. The > point is accommodating systems that are not robust. Ok, NFC and SHOULD it is. > > > > To prevent an attribute from being processed in this way, put a > > > > non-breaking zero-width space character () at the start of > > > > the attribute. > > > > > > Isn't the use of that char as anything but the BOM deprecated or at > > > least considered harmful? > > > > Arguably, it _is_ a BOM here. > > > > I'm not overly fond of this either, but it's the only solution I could > > find that was relatively harmless (the BOM can always be dropped at > > the start of strings) > > Exactly. Which is why tools used for generating the page might drop it > on the server! That's fine. When put at the start of the string, it should be dropped. > Actually, I am distributing one such tool myself. Is the tool broken? > http://iki.fi/hsivonen/php-utf8/ It depends. If it drops the BOM in the middle of the string, then yes. I expect this to be used so that you first output the attribute with this "BOM", then the user-derived string, then the rest of the document: ... print("<input value=\"\xFEFF"); print(escape(data)); print("\">"); ... > My immediate thought is ZWNJ, but I'm not sure if using it is a good > idea. I think that would be worse than the BOM. > > > > Note that a string containing the codepoint's value itself (for > > > > example, the six-character string "U+263A" or the seven-character > > > > string "☺") is not considered to be human readable and must > > > > not be used as a transliteration. > > > > > > Do you expect UAs that already do this change their behavior with > > > the legacy submission types? > > > > We can hope. > > FWIW, there may be CMS input form handlers that expect the prohibited > behavior. I have been involved in developing one myself. (Not that I > recommend relying on such things. Obviously, UTF-8 is the way to go.) Yeah. Google, for one. I've also seen login forms where people typed in characters not in the form's submission set, and thus got a username that was not the one they thought it was, so when they switched to another UA that did things differently, it broke. It's madness. > > > > which has a root element named "submission", with no prefix, > > > > defining a default namespace > > > > uuid:d10e4fd6-2c01-49e8-8f9d-0ab964387e32. > > > > > > I think that is an inappropriate attempt to micromanage the > > > syntactic details that are in the realm of a lower-level spec. I > > > think the submission format should either allow all the syntactic > > > sugar that comes with Namespaces in XML or be layered directly on > > > top XML 1.0 without namespace support. > > > > The reason it is micromanaged is to make it possible to use either a > > pure XML 1.0 parser _or_ an XML 1.0 with namespaces parser on the > > server side without getting into any complications. > > I was able to guess that that was the rationale behind the requirement. > But why is the ability use a namespace-unaware XML processor a > requirement? The only reason I can come up with is that PHP4 is borked > by default but widely used. There are various people using non-namespace-aware parsers. I don't really want to force namespace-aware parsing when in fact the document is anyway guarenteed to only have one namespace. > Processing namespaced XML with tools that don't support namespaces is > clueless and just plain wrong. If tools that don't support namespaces > are to be accommodated, wouldn't the natural way be to spec that the > elements are not in a namespace and the namespace processing layer is > not used? That way you wouldn't endorse behavior that is clueless and > just plain wrong. It's actually more the other way around. This is a non-namespaced document, but to accomodate people who are going to be using it in namespace-aware environments, possibly merging it into other documents, etc, it makes sense to actually give it a namespace. For example, the same data format is later used for seeding forms. If on the server you stack the data into a huge XML file containing other data too, it would make sense to be able to just yank out that namespaced subtree and just use it for preseeding too. > 1) The current best practice for dispatching on the type of an XML > document is dispatching on the namespace. If there was no namespace, one > would have to fall back on dispatching on the content type. This is not > a real problem with this particular vocabulary because this vocabulary > has a distinct content type from the start. It does during submission. But when the data is flying about after submission, who knows. > 2) You couldn't mix the vocabulary with other vocabularies using > namespaces. This is a theoretical problem but probably not a real one, > because the vocabulary is limited to a specific case of client-server > interaction. It's only limited _if_ it doesn't have a namespace. Also, it is later used for preseeding forms. > Besides, the way you limit the use of namespaces in the current spec > language would also preclude creative augmentations to the submission > vocabulary. Well, extensions would be non-compliant, yes. But at least there is a clear mechanism for experimentation. > > > > but must include a BOM. > > > > > > I think that is not a legitimate requirement when UTF-8 is used. > > > > Why not? > > It is a requirement that applies to the XML serialization, but the > requirement is not present in the XML spec. The requirement would mean > that you could not use any arbitrary but conforming XML serializer. > > The use of the BOM as a UTF-8 signature is a Microsoftism that was only > allowed in XML 1.0 second edition, because fighting Microsoft text > editors would have been futile. Still, if you pick a non-Microsoft XML > serializer off the shelf, chances are it does not emit a BOM in the > UTF-8 mode. > > Is there a good reason to limit the use of arbitrary but conforming > off-the-shelf XML serializers? I guess that makes sense. And the BOM isn't really needed anyway. Ok, I've made it optional for UTF-8. > > > > UAs may use either CDATA blocks, entities, or both in escaping the > > > > contents of attributes and elements, as appropriate. > > > > > > In order not to imply that this spec could restrict the ways > > > characters are escaped, that sentence should be a note rather than > > > part of the normative prose. (Of course, only the pre-defined > > > entities are available. Then there are NCRs.) > > > > This spec _could_ restrict the ways characters are escaped. It needs > > to not be a note so that the "may" has normative value. No? > > The could restrict the escaping in the same sense the HTTP spec could > restrict how you choose TCP sequence numbers. > > In general, please see section 4.3 of RFC 3470. Yes, indeed. That's why WF2 specifically _doesn't_ restrict this. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 27 August 2004 02:25:46 UTC