- From: Chris Lilley <chris@w3.org>
- Date: Wed, 8 Jun 2005 13:08:28 +0200
- To: www-international@w3.org
Hello www-international, I noticed this curious term "XML-encoded" http://www.google.com/webmasters/sitemaps/docs/en/protocol.html#faq_xml_encoding on the Google sitemaps page. The problem is that it encourages people to assume that XML required an IRI to be escaped to a URI. It would be better if Google used the already defined terms, and made it clear that this escaping is a special requirement of their particular format and not of XML in general. The escaping should refer to RFC 3987 and not HTML 4 (which is not even an XML format, and is not the defining instance of the escape mechanism). It should also refer to RFC 3986 and not 2396, of course. http://www.google.com/webmasters/sitemaps/docs/en/protocol.html Q: How do I XML-encode a URL? To properly encode your URLs, follow the procedure recommended by the HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and then URL-escape the result. For details about Internationalized Resource Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987. The following is an example python script for XML encoding a URL: $ python Python 2.2.2 (#1, Feb 24 2003, 19:13:11) >>> import xml.sax.saxutils >>> xml.sax.saxutils.escape("http://www.test.org/view?widget=3&count>2") The encoded URL from the example above is: http://www.test.org/view?widget=3&count>2 Q: Does it matter which character encoding method I use to generate my Sitemap files? Yes. Your Sitemap files must use UTF-8 encoding. -- Chris Lilley mailto:chris@w3.org Chair, W3C SVG Working Group W3C Graphics Activity Lead
Received on Wednesday, 8 June 2005 11:08:43 UTC