[Bug 6259] Different result when valdating by direct input or by url from bugzilla@wiggum.w3.org on 2008-12-02 (www-validator-cvs@w3.org from December 2008)

From: <bugzilla@wiggum.w3.org>
Date: Tue, 02 Dec 2008 16:22:53 +0000
To: www-validator-cvs@w3.org
Message-Id: <E1L7Y1V-0006AN-0a@farnsworth.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=6259


Olivier Thereaux <ot@w3.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|                            |INVALID
             Status|NEW                         |RESOLVED
                 CC|                            |ot@w3.org




--- Comment #1 from Olivier Thereaux <ot@w3.org>  2008-12-02 16:22:52 ---
(In reply to comment #0)
> So, character encoding is not checked when using validating by direct HTML
> input. Imho the character encoding should also be checked when validating by
> direct HTML input.

When you validate by URI, the validator retrieves the resource via HTTP. What
it retrieves are bytes, which must be decoded into characters - hence the
importance of knowing the character encoding, either  thanks to the �charset=�
parameter in the HTTP headers sent by the server, or with the <meta>
information in HTML, or other possibles sources.

Likewise, with file upload, with some minor differences (there is no web
server, but the web browser pretty much plays that role).

When using direct input however, what you give to the validator is not a series
of bytes: you copy and paste characters into a form on the validator's home
page. That validator page is encoded in utf-8, which means that, automatically,
the form submitted to the validator will be in utf-8. And that, regardless of
what your original content was encoded in, regardless of whatver meta
information is present. 

Does the validator need to check encoding in "direct input" mode? No, per the
above. Should it do it? There answer here too is "no".

Imagine that your document is encoded in ISO-9959-1 (a.k.a latin-1). It is
properly served as latin-1 on your web server, there is a <meta> tag with that
information too. All is well. Now, imagine that you take that page source,
copy-paste it into the "direct input" form of the validator: then, as explained
above, the markup automatically becomes characters in utf-8. Should the
validator complain that it is receiving utf-8 content when the source says
"iso-8859-1"? Of course not, that would be wrong and confusing.

In other words, direct input and by UIR validation are a very different
paradigm, and the difference shows most clearly in handling encodings. It *is*
confusing, and if you can think of any way to make it less confusing, ideas are
welcome.


-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Tuesday, 2 December 2008 16:23:01 UTC