- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Wed, 24 Mar 2010 10:49:14 +0000
- To: Maciej Stachowiak <mjs@apple.com>
- CC: public-html@w3.org
Maciej Stachowiak wrote: > > On Mar 24, 2010, at 2:29 AM, Philip Taylor wrote: > >> Maciej Stachowiak wrote: >>> [...] >>> Julian & Philip, how confident are you that the full set of >>> characters that need escaping is U+003C, U+000D, U+000A, U+0009 and >>> U+0020? Does & need to be escaped? >> >> It needs these characters "as well" as the ones already mentioned in >> the previous paragraph in the spec (quotes and &s). >> >> I can't think of any other characters that have particularly special >> behaviour, but what is the purpose of this note? If it is aimed at >> people writing software that emits XML syntax fragments given an >> arbitrary string of Unicode codepoints, attempting to tell them >> everything they need to know in order to serialise safely (i.e. >> without allowing the content to break their entire page), then it >> would probably also have to say that U+FFFE and U+FFFF and other >> characters in U+0000..U+001F aren't allowed, and that they must be >> encoded in the same character encoding as the rest of the document, etc. >> >> It seems silly to duplicate the XML spec in that much detail here - if >> someone's correctly implementing XML then they should already have an >> XML serialiser that deals with all these issues, and repeating the >> information here will be a source of bugs and a waste of time. >> >> If it's aimed at people writing XHTML by hand, telling them about >> common things to be careful of, it probably doesn't need to bother >> mentioning U+0020 because (as far as I can see) that only matters in >> obscure cases when the DTD has set srcdoc to be non-CDATA. But <iframe >> srcdoc> is not a useful feature when writing markup by hand - the use >> cases were things like sandboxing untrusted user comments, and the >> whole point is that people will write software to serialise these >> values, so it's not useful to give advice intended for hand-authoring. > > What spec change (if any) would you recommend on this issue? (I'm not > sure from the above if you are arguing for a detailed note, a shorter > but partially incomplete note, no note, or something else.) Mostly I'm just trying to provide information, not argue :-) I'm happy with the current spec text - it tells XML authors to be careful, and they can use existing tools and documentation to work out exactly what to do. I wouldn't like to entirely remove the note about XML, because authors may think the note about the HTML syntax applies to them too. I wouldn't like to remove the note about the HTML syntax, because it's simple and is sufficient for the case of writing software that safely embeds untrusted user input, so it helps authors use this feature correctly for its intended purpose. I wouldn't like text about XML that only gives half the details (e.g. doesn't say how to protect against well-formedness errors from invalid characters), because people are likely to think it's intended as a sufficient description when it's not. I wouldn't like text about XML that gives the full details, because it would be long and unwieldy and a likely source of confusion (for authors and reviewers) - it's a complex enough topic that it seems better to have authors look at XML documentation for a full discussion. If the text did attempt to give the full details for embedding untrusted input, I think it should cover the need to escape ", &, <, U+000D, U+000A, U+0009, U+0020, and the need to delete/replace U+FFFE and U+FFFF and other characters in U+0000..U+001F, and the need to encode it validly in the document's character encoding. I'm not certain that's all, but I can't think of any more right now. -- Philip Taylor pjt47@cam.ac.uk
Received on Wednesday, 24 March 2010 10:49:44 UTC