- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 21 Jun 2005 17:48:44 +0900
- To: w3c.amc+0+@google.com
- Cc: chris@w3.org, www-international@w3.org
Dear Adam, This if Felix Sasaki from the i18n Activity of W3C. Chris Lilley recently recognized some strange usage of the term "XML-encoded" in the documentation of Goolge sitemaps. I have attached his mail below. Do you know who we should contact to change the documentation? Thanks a lot for your help. Best, Felix. MAIL FROM CHRIS TO www-international@w3.org: I noticed this curious term "XML-encoded" http://www.google.com/webmasters/sitemaps/docs/en/protocol.html#faq_xml_encoding on the Google sitemaps page. The problem is that it encourages people to assume that XML required an IRI to be escaped to a URI. It would be better if Google used the already defined terms, and made it clear that this escaping is a special requirement of their particular format and not of XML in general. The escaping should refer to RFC 3987 and not HTML 4 (which is not even an XML format, and is not the defining instance of the escape mechanism). It should also refer to RFC 3986 and not 2396, of course. TEXT IN THE DOCUMENTATION: Q: How do I XML-encode a URL? To properly encode your URLs, follow the procedure recommended by the HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and then URL-escape the result. For details about Internationalized Resource Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987. The following is an example python script for XML encoding a URL: $ python Python 2.2.2 (#1, Feb 24 2003, 19:13:11) >>> import xml.sax.saxutils >>> xml.sax.saxutils.escape("http://www.test.org/view?widget=3&count>2") The encoded URL from the example above is: http://www.test.org/view?widget=3&count>2 Q: Does it matter which character encoding method I use to generate my Sitemap files? Yes. Your Sitemap files must use UTF-8 encoding.
Received on Tuesday, 21 June 2005 08:48:54 UTC