W3C home > Mailing lists > Public > www-validator-cvs@w3.org > October 2011

[Bug 10174] Bogus error reported for UTF-8 characters in larger documents

From: <bugzilla@jessica.w3.org>
Date: Sun, 30 Oct 2011 01:08:22 +0000
To: www-validator-cvs@w3.org
Message-Id: <E1RKJss-0007Ef-7b@jessica.w3.org>

--- Comment #10 from Michael[tm] Smith <mike@w3.org> 2011-10-30 01:08:21 UTC ---
(In reply to comment #9)
> The curl man page is indeed pretty confusing wrt. what exactly --data does. 
> But it does say this: "-d/--data is the same as --data-ascii." and then later
> for --data-binary "Data is posted in a similar manner as --data-ascii does,
> except that newlines are preserved and conversions are never done.".  I don't
> think it's actually a matter of binary vs text, but rather posting as-is or
> with some conversions.

Ah, OK. Yeah, after re-reading the curl man page, I understand now. That switch
really ought to be called "--data-as-is" or something instead... 

> Not sure what conversions they mean other than something related to newlines,
> but I have verified locally with wireshark is that --data-binary POSTs files
> as-is as I want it to (and like the validator does)

Yeah, I verified the same thing using the curl --trace option; e.g.,

curl --data-binary @utf-8-validation.html -H "Content-Type: text/html" --trace
- "http://localhost:8888/?out=gnu"

And looking at the hex dump of that, I notice that the position at which it
reports "End of file seen" (line 1254, column 13) is byte 0x2000 (decimal 8096,
8KB) of the last chunk of data in the post. And I then notice sort of the same
thing that Adam mentions in comment #1 -- if I insert a different character
before that position, then behavior changes. But unlike Adam's case (where he
seems to be saying that he still gets an error, but just that it gets reported
for a different character), I get no error any longer at all -- instead, the
document validates as expected.

So, there's definitely something weird going on here. The fact that it the
error gets reported at exactly the 8KB mark really does make it seem like it's
running into some kind of limit.

> and --data on the other
> hand at least discards newlines, probably also leading whitespace (which would
> mean problems with line and column numbers in results if validator did that).

Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Sunday, 30 October 2011 01:08:27 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:02:51 UTC