- From: <bugzilla@jessica.w3.org>
- Date: Sun, 08 Jul 2012 23:46:14 +0000
- To: public-html-bugzilla@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359 --- Comment #18 from theimp@iinet.net.au 2012-07-08 23:46:13 UTC --- (In reply to comment #17) > And for some reason your are not PRO the same thing when it comes to XML. I wonder why. It is a compromise position. This is a specification, not a religion; compromise is good. Properly formed XHTML served as XHTML to a bug-free parser would almost never have any need to have the encoding changed by the user. The few cases where it would be useful would probably be better handled with specialty software that does not need to be interested in parsing according to the spec. So, I for one would be prepared to compromise. This talk about the "user experience" is silly. I would postulate the following, offering no evidence whatsoever: 1) In 99.9% of cases, users who see garbage simply close the page. They know nothing about encodings or how to override them, and care to know even less. 2) Of the other 0.1%, in 99.9% of cases the user changes the encoding of the page only to view it, not to submit any form. 3) Of the other 0.1% of 0.1%, in 99.9% of cases the submitted forms contain only AT-standard-keyboard-characters with ASCII codepoints which are identical to the codepoints of the corresponding UTF-8 text. Agree/disagree? > A plug-in or config that ignore/extend a spec, do ignore/extend that spec. The point is, you can never get the assurances you seem to argue are essential. > Specs, UAs, tools, authors etc need to stretch towards perfect unity. Finally, something we are in total agreement over. > When there are discrepancies, the question is who needs to change. Of course. Unfortunately, more and more it seems to be answered with "anyone or everyone, except us!" > It even works with the highly regarded xmllint (aka libxml2). Great, more broken software. How is this helpful to your position? I certainly do not regard it highly now that I know it parses incorrectly. > Validators are a special hybrid of user agents and authoring tools which of course need to be strict. The whole point of XML is that *all* processors process it the same way! There is nothing special about a validator in that respect. It is clear that you do not care one bit about the XML spec. - it is an argument for you to use when it suits you, and to discard when it goes against you. > Something should adjust - the XML spec or the UAs. Either. But if the user agents are to stay the same (in respect of XML), it should be because the XML spec. changes - not because the HTML5 spec. changes unilaterally. The internet is bigger than the web, and the web is bigger than HTML, and HTML is bigger than the major browsers. Outright layer violations, marginalization of national laws, breaking of backwards compatibility; you name it, HTML5 has it. I sincerely hope that these attempts to re-architect the internet from the top down are, in time, fondly remembered for their reckless pursuit of perfection that made everything much better in the end, and not poorly remembered for their hubris that set back standards adoption more than the Browser Wars. > Make it an 'error' as opposed to 'fatal error', is my proposal. The result of an 'error' is undefined. That would be VERY BAD. It would mean that there is more than one valid way to parse the same document. Your previous suggestion was much better: > <INS> or a byte order mark</INS> I would even be prepared to support such a suggestion, if there was no indication that the current wording is deliberate/relied upon (which I suspect - note how the current reading is fundamentally the same as the CSS rules, for example. Not likely to be a coincidence). > Your page contains encoding related errors, and your argumet in favor of user override was to help the user to (temporarily) fix errors. So it perplexes me to hear that this is a "different argument". My goodness me, that page was meant to demonstrate exactly one point, not every argument I made in this bug. Sheesh. Okay, here we will do a little though experiment. Imagine a page like my example. Now imagine that, like a large number of web pages, it is fed content from a database. I *know* that the content in the database is proper ISO-8859-1. Imagine that this text also happens to coincidentally be valid UTF-8 sequences, which is very possible (if this is too hard to imagine, let me know and I'll update the page). Or more likely, that the page is not served as XHTML, which is even easier to imagine. Imagine that I get complaints from my users, that the page is full of garbage. I check the page, and they're right! I click "View Source", and see the garbage. I don't see the BOM, because it's invisible when the software thinks it's a BOM. I do see my encoding declaration, though, and to me it looks like it's at the top of the page (because I can't see the BOM). Why didn't that work? Or did it, and the problem is somewhere else? I open the XHTML file in an editor, and it looks the same, and for the same reason. I don't see the database text, of course, just code. Eventually I notice that the browser says the page is UTF-8. So I try to change it to what I know the content actually is. But it doesn't work! Funny... that used to work. Must be a browser bug. Why else would the menu even be there? ** Note: If your browser lets you change the encoding, you can stop here. Otherwise, continue to the end, then start using WireShark. In all probability, this theoretical author acquired the XHTML file from somewhere else complete with BOM and just searched for "how do i make a xhtml file iso-8859-1 plz". That, or they had some overreaching editor that decided that it would insert a UTF-8 BOM in front of all ASCII files in the name of promoting UTF-8. Or maybe the editor used very poor terminology and used the word "Unicode" instead of the actual encoding name in the save dialog, making the author - who knows basically nothing about encodings - think that this simply implies that it will work on the web ("all browsers support Unicode!"). Or any number of possible scenarios. Consider that, having explained what the author observed above, they would probably be getting advice along the lines of "try setting the http-content-type header lol!!1". Actually, they'd probably be told - it's your database. For as long as they're looking there, they'll never find the problem. Getting this far is probably already the limit of what the vast majority of authors can manage. How does this author, who just wants their damn blog to work with their old database, debug this problem further? Exercise: If some records in my database coincidentally don't have any extended characters (ie. are equivalent to ASCII and therefore UTF-8), meaning that some pages work and some don't, how am I (and anyone I ask) *more likely* to think that the problem is with the static page, rather than the database? Bonus points: Why does the CSS file, from the same source/edited with the same editor, work exactly the way they expect? ie. 0xEF 0xBB 0xBF @charset "ISO-8859-1"; body::before{content:"CSS treated as encoded in ISO-8859-1!"} /* Not by all browsers; just browsers which obey the spec. */ Triple Word Score: What if the BOM was actually added by, say, the webserver (or the FTP server that got it onto the webserver): http://publib.boulder.ibm.com/infocenter/zos/v1r12/topic/com.ibm.zos.r12.halz001/unicodef.htm Or a transformative cache? Or a proxy? > As an author, I sometimes want prevent users from overriding the encoding regardless of how conscious they are about violating my intent. As a user, I sometimes want override the encoding regardless of how conscious the author is about violating my intent. What makes you argument more valid than mine? Hint: It's not: http://www.w3.org/TR/html-design-principles/#priority-of-constituencies "costs or difficulties to the user should be given more weight than costs to authors" There are more users than there are authors, and there always will be. > There is evidence that users do override the encoding, manually. Of course there is! Because they have to! Why do you think browsers allowed users to change the encoding in the first place? Where is the evidence that this happens for pages WITH A BOM. > And that this causes problems, especially in forms. Not so much problems for the user. As for authors, read on. > To solve that problem, the Ruby community has developed their own "BOM snowman": Notes about that article: 1) It makes no mention of such pages containing UTF-8 or UTF-16 BOMs. There is no indication that the proposed resolution of this bug would solve that problem at all, because in all likelihood those pages do not have BOMs. 2) It admits that the problem begins when authors start serving garbage to their users. 3) It cannot be fixed by the proposed resolution of this bug. Everything from legacy clients to active attackers can still send them garbage, and they will still serve it right back out to everyone. But even if you could magically stop all submission of badly-encoded data, that does not change the users' need to change the encoding for all of their pages that have already been polluted. The *real problem* is that the authors didn't sanitize their inputs (accepting whatever garbage they receive without validation), nor did they sanitize their outputs (sending whatever garbage they have without conversion). This kind of fix would just give authors/developers an excuse to be lazy, at the expense of the users. Note also that authors *still* can't depend upon the BOM to solve ANY problem they have, because again, it might be either added or stripped by editors, filesystem drivers, backup agents, ftp clients or servers, web servers, transformative caches (language translation, archival, mobile speedup), proxies, CDNs, filters, or anything else that might come in between the coder and the browser user. Consider: http://schneegans.de/ http://webcache.googleusercontent.com/search?q=cache:kKjAj-u4s6IJ:http://schneegans.de/ Notice how the Google Cache version has stripped the BOM? If this is just about forms, then how is this for a compromise: "When the document contains a UTF-8 or UTF-16 BOM, all forms are to be submitted in UTF-8 or UTF-16, respectively, regardless of the current encoding of the page. However, authors must not rely on this behavior." Would that satisfy you? -- Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Sunday, 8 July 2012 23:46:15 UTC