- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 13 Nov 2008 11:18:53 -0500
- To: Dean Edridge <dean@dean.org.nz>
- Cc: public-html <public-html@w3.org>, www-tag@w3.org
(with some trepidation I'll follow your lead in cross-posting to public-html and www-tag; fyi, I do not regulary read the former, though I do look at archives occasionally) Dean Edridge writes: > Sorry, but I don't get this "clean content" thing. I don't want to start a long flame war here, as I felt I had a good chance to express my feelings at the F2F, but I'll be glad to clarify what I intended (speaking for myself, not the TAG). Let's start with some things that I think we all agree. In particular, HTML5 as drafted provides that browsers will accept quite a range of input as text/html. For example, all of the following will be parsed into DOMs, and presented to users if retrieved as text/html: a) <!-- clearly OK --> <html> <body> <div> <p>Para</p> </div> </body> </html> b) <html> <body> <div> <p>Para</div> <!-- note bad nesting of tags --> </p> <!-- note bad nesting of tags --> </body> </html> c) <html> <body> <!-- quoted attr --> <img src="http://example.com/img.jpg"> </body> </html> d) <html> <body> <!-- unquoted attr --> <img src=http://example.com/img.jpg> </body> </html> e> XXXXXX (Isn't obviously HTML at all, but browser will presumably build a DOM and render XXXXXX) The best example I have of 'unclean' are (b), in which the close tags are in the wrong order, and (e), which has no tags at all. As far as I know, an HTML browser will accept both of these, built a DOM for them, allow scripting of that DOM, and render on the screen output per the HTML 5 Recommendation. Perhaps all of those are therefore what we mean by legal or clean HTML 5, but I don't think so. (a) seems to me to be legal HTML in a sense that (b), for example, is not. If I wrote an HTML editor and it put out content in the form of (b), I hope you'd tell me my editor was buggy, and that the tags should be properly nested. So, that being the case, when there's a language as important as HTML 5, I think it's a good thing for there to be a high quality specification that makes very clear answers to questions such as: * What documents are part the language (or legal in the language if you prefer) and which ones not? * What is the correct interpretation of the legal documents? In short, this would be just a language specification, as distinct from the existing HTML 5 draft, which focusses on consuming and rendering HTML 5 as well as consuming and rendering other input. Note that, in principle, a language specification is not just for authors. It's a specification of what the language >is<. No doubt, the most common consumers of HTML 5 will be browsers, which will be much more liberal in what they accept, but the language specification should be referenced by anyone who wants to either produce or consume clean, legal, HTML (e.g. no badly nested tags). Usually, such a language specification will say nothing about documents like (b) that aren't in the language, except to make clear that they aren't. Since the current HTML 5 draft is focussed to a significant degree on what browsers consume, it provides for processing and building DOMs from (a-e). The question I raised in Mandelieu is whether it would be beneficial to have a first class, standalone specification for just the legal HTML. My understanding is that Ian and the group in fact plan to attempt to create such a specification, primarily by extracting sections from the existing draft. That's good. If I have a concern, it's that I think having such a specification is important, and there is reason to wonder whether extracting it semi-automatically from bits of a larger specification will produce a high quality result. As I said in Mandelieu, I think it's fine to try, and we'll see how it goes. > If the TAG wants authors to quote their > attributes or something like that, then they > should just come out and say it :-) I have no strong opinion on whether unquoted attributes (c) should be allowed in the "clean" language, or whether they should be viewed by browsers like as more like (b), I.e. poorly formed content that is nonetheless processed by browsers. My guess is that, given the state of deployment of HTML, unquoted attributes should be legal in HTML. It was certainly not my intention to argue to the contrary in Mandelieu. I really had in mind examples more like (b) and (e). As I say, I think I've had a chance to note my concern, and the HTML working group has been clear that they want to try this in part by extracting from the normative draft, and also by continuing with Lachlan's work. So, I don't particularly care to burn a lot of anyone's energy pursuing this at the moment; I hope everyone will give a bit of thought to my suggestions and move on down whatever path they deem best. BTW: I think Lachlan's draft looks like it's off to a very good start; my concern with it, if any, is that I think it's important for a language like HTML 5 to have a normative specification, and I understand that Lachlan is aiming for a non-normative primer (also very important and useful). Thank you. Noah -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 --------------------------------------
Received on Thursday, 13 November 2008 16:19:58 UTC