UTF-8 Errors on file upload, not by URI from Kessler,Nathan on 2014-09-09 (www-validator@w3.org from September 2014)

From: Kessler,Nathan <kesslern@oclc.org>
Date: Tue, 9 Sep 2014 19:40:45 +0000
To: "www-validator@w3.org" <www-validator@w3.org>
Message-ID: <1410291644706.96048@oclc.org>

I'm trying to validate http://worldcat.org<http://worldcat.org.>. If I run a scan by URI or by direct input, the scan runs as expected. However, when the HTML source is saved in a file and uploaded, this error is reported on line 651:

"The error was: utf8 "\xED" does not map to Unicode" and the scan doesn't run.


The specific character in question: http://www.fileformat.info/info/unicode/char/ed/index.htm


If this character is removed, it fails on the fancy character in "traducción" -- it's not just the character above. The encoding of the page is UTF-8 and it is saved as UTF-8 before being uploaded. The scan works when the encoding is set to UTF-16, but not when it reads UTF-8 from the HTML.


Can anyone provide any advice here? We have an automated system that downloads web pages and runs them against our local validator via a file upload and this page won't scan due to this error. Is the encoding set improperly on the web page? Am I missing something else here?


Thanks for your time and work,

Nathan Kessler

Received on Wednesday, 10 September 2014 21:12:51 UTC