[Bug 15359] Make BOM trump HTTP

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359

--- Comment #18 from theimp@iinet.net.au 2012-07-08 23:46:13 UTC ---
(In reply to comment #17)
> And for some reason your are not PRO the same thing when it comes to XML. I wonder why.

It is a compromise position. This is a specification, not a religion;
compromise is good. Properly formed XHTML served as XHTML to a bug-free parser
would almost never have any need to have the encoding changed by the user. The
few cases where it would be useful would probably be better handled with
specialty software that does not need to be interested in parsing according to
the spec. So, I for one would be prepared to compromise.

This talk about the "user experience" is silly. I would postulate the
following, offering no evidence whatsoever:

1) In 99.9% of cases, users who see garbage simply close the page. They know
nothing about encodings or how to override them, and care to know even less.
2) Of the other 0.1%, in 99.9% of cases the user changes the encoding of the
page only to view it, not to submit any form.
3) Of the other 0.1% of 0.1%, in 99.9% of cases the submitted forms contain
only AT-standard-keyboard-characters with ASCII codepoints which are identical
to the codepoints of the corresponding UTF-8 text.

Agree/disagree?

> A plug-in or config that ignore/extend a spec, do ignore/extend that spec.

The point is, you can never get the assurances you seem to argue are essential.

> Specs, UAs, tools, authors etc need to stretch towards perfect unity.

Finally, something we are in total agreement over.

> When there are discrepancies, the question is who needs to change.

Of course. Unfortunately, more and more it seems to be answered with "anyone or
everyone, except us!"

> It even works with the highly regarded xmllint (aka libxml2).

Great, more broken software. How is this helpful to your position? I certainly
do not regard it highly now that I know it parses incorrectly.

> Validators are a special hybrid of user agents and authoring tools which of course need to be strict.

The whole point of XML is that *all* processors process it the same way! There
is nothing special about a validator in that respect.

It is clear that you do not care one bit about the XML spec. - it is an
argument for you to use when it suits you, and to discard when it goes against
you.

> Something should adjust - the XML spec or the UAs.

Either. But if the user agents are to stay the same (in respect of XML), it
should be because the XML spec. changes - not because the HTML5 spec. changes
unilaterally.

The internet is bigger than the web, and the web is bigger than HTML, and HTML
is bigger than the major browsers. Outright layer violations, marginalization
of national laws, breaking of backwards compatibility; you name it, HTML5 has
it. I sincerely hope that these attempts to re-architect the internet from the
top down are, in time, fondly remembered for their reckless pursuit of
perfection that made everything much better in the end, and not poorly
remembered for their hubris that set back standards adoption more than the
Browser Wars.

> Make it an 'error' as opposed to 'fatal error', is my proposal. The result of an 'error' is undefined.

That would be VERY BAD. It would mean that there is more than one valid way to
parse the same document.

Your previous suggestion was much better:
> <INS> or a byte order mark</INS>

I would even be prepared to support such a suggestion, if there was no
indication that the current wording is deliberate/relied upon (which I suspect
- note how the current reading is fundamentally the same as the CSS rules, for
example. Not likely to be a coincidence).

> Your page contains encoding related errors, and your argumet in favor of user override was to help the user to (temporarily) fix errors. So it perplexes me to hear that this is a "different argument".

My goodness me, that page was meant to demonstrate exactly one point, not every
argument I made in this bug. Sheesh.

Okay, here we will do a little though experiment.

Imagine a page like my example.

Now imagine that, like a large number of web pages, it is fed content from a
database. I *know* that the content in the database is proper ISO-8859-1.

Imagine that this text also happens to coincidentally be valid UTF-8 sequences,
which is very possible (if this is too hard to imagine, let me know and I'll
update the page). Or more likely, that the page is not served as XHTML, which
is even easier to imagine.

Imagine that I get complaints from my users, that the page is full of garbage.

I check the page, and they're right!

I click "View Source", and see the garbage. I don't see the BOM, because it's
invisible when the software thinks it's a BOM. I do see my encoding
declaration, though, and to me it looks like it's at the top of the page
(because I can't see the BOM). Why didn't that work? Or did it, and the problem
is somewhere else?

I open the XHTML file in an editor, and it looks the same, and for the same
reason. I don't see the database text, of course, just code.

Eventually I notice that the browser says the page is UTF-8. So I try to change
it to what I know the content actually is. But it doesn't work! Funny... that
used to work. Must be a browser bug. Why else would the menu even be there?

** Note: If your browser lets you change the encoding, you can stop here.
Otherwise, continue to the end, then start using WireShark.

In all probability, this theoretical author acquired the XHTML file from
somewhere else complete with BOM and just searched for "how do i make a xhtml
file iso-8859-1 plz". That, or they had some overreaching editor that decided
that it would insert a UTF-8 BOM in front of all ASCII files in the name of
promoting UTF-8. Or maybe the editor used very poor terminology and used the
word "Unicode" instead of the actual encoding name in the save dialog, making
the author - who knows basically nothing about encodings - think that this
simply implies that it will work on the web ("all browsers support Unicode!").
Or any number of possible scenarios.

Consider that, having explained what the author observed above, they would
probably be getting advice along the lines of "try setting the
http-content-type header lol!!1".

Actually, they'd probably be told - it's your database. For as long as they're
looking there, they'll never find the problem.

Getting this far is probably already the limit of what the vast majority of
authors can manage. How does this author, who just wants their damn blog to
work with their old database, debug this problem further?

Exercise: If some records in my database coincidentally don't have any extended
characters (ie. are equivalent to ASCII and therefore UTF-8), meaning that some
pages work and some don't, how am I (and anyone I ask) *more likely* to think
that the problem is with the static page, rather than the database?

Bonus points: Why does the CSS file, from the same source/edited with the same
editor, work exactly the way they expect? ie.

0xEF 0xBB 0xBF @charset "ISO-8859-1";
               body::before{content:"CSS treated as encoded in ISO-8859-1!"}
               /* Not by all browsers; just browsers which obey the spec. */

Triple Word Score: What if the BOM was actually added by, say, the webserver
(or the FTP server that got it onto the webserver):
http://publib.boulder.ibm.com/infocenter/zos/v1r12/topic/com.ibm.zos.r12.halz001/unicodef.htm
Or a transformative cache? Or a proxy?

> As an author, I sometimes want prevent users from overriding the encoding regardless of how conscious they are about violating my intent.

As a user, I sometimes want override the encoding regardless of how conscious
the author is about violating my intent.

What makes you argument more valid than mine?

Hint: It's not:
http://www.w3.org/TR/html-design-principles/#priority-of-constituencies
"costs or difficulties to the user should be given more weight than costs to
authors"

There are more users than there are authors, and there always will be.

> There is evidence that users do override the encoding, manually.

Of course there is! Because they have to! Why do you think browsers allowed
users to change the encoding in the first place?

Where is the evidence that this happens for pages WITH A BOM.

> And that this causes problems, especially in forms.

Not so much problems for the user. As for authors, read on.

> To solve that problem, the Ruby community has developed their own "BOM snowman":

Notes about that article:

1) It makes no mention of such pages containing UTF-8 or UTF-16 BOMs. There is
no indication that the proposed resolution of this bug would solve that problem
at all, because in all likelihood those pages do not have BOMs.

2) It admits that the problem begins when authors start serving garbage to
their users.

3) It cannot be fixed by the proposed resolution of this bug. Everything from
legacy clients to active attackers can still send them garbage, and they will
still serve it right back out to everyone. But even if you could magically stop
all submission of badly-encoded data, that does not change the users' need to
change the encoding for all of their pages that have already been polluted. The
*real problem* is that the authors didn't sanitize their inputs (accepting
whatever garbage they receive without validation), nor did they sanitize their
outputs (sending whatever garbage they have without conversion). This kind of
fix would just give authors/developers an excuse to be lazy, at the expense of
the users.

Note also that authors *still* can't depend upon the BOM to solve ANY problem
they have, because again, it might be either added or stripped by editors,
filesystem drivers, backup agents, ftp clients or servers, web servers,
transformative caches (language translation, archival, mobile speedup),
proxies, CDNs, filters, or anything else that might come in between the coder
and the browser user.

Consider:

http://schneegans.de/
http://webcache.googleusercontent.com/search?q=cache:kKjAj-u4s6IJ:http://schneegans.de/

Notice how the Google Cache version has stripped the BOM?

If this is just about forms, then how is this for a compromise:

"When the document contains a UTF-8 or UTF-16 BOM, all forms are to be
submitted in UTF-8 or UTF-16, respectively, regardless of the current encoding
of the page. However, authors must not rely on this behavior."

Would that satisfy you?

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Sunday, 8 July 2012 23:46:15 UTC