- From: Chris Lilley <chris@w3.org>
- Date: Wed, 8 Jun 2005 13:08:28 +0200
- To: www-international@w3.org
Hello www-international,
I noticed this curious term "XML-encoded"
http://www.google.com/webmasters/sitemaps/docs/en/protocol.html#faq_xml_encoding
on the Google sitemaps page. The problem is that it encourages people to
assume that XML required an IRI to be escaped to a URI. It would be
better if Google used the already defined terms, and made it clear that
this escaping is a special requirement of their particular format and
not of XML in general. The escaping should refer to RFC 3987 and not
HTML 4 (which is not even an XML format, and is not the defining
instance of the escape mechanism).
It should also refer to RFC 3986 and not 2396, of course.
http://www.google.com/webmasters/sitemaps/docs/en/protocol.html
Q: How do I XML-encode a URL?
To properly encode your URLs, follow the procedure recommended by the
HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and
then URL-escape the result. For details about Internationalized Resource
Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987.
The following is an example python script for XML encoding a URL:
$ python
Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
>>> import xml.sax.saxutils
>>> xml.sax.saxutils.escape("http://www.test.org/view?widget=3&count>2")
The encoded URL from the example above is:
http://www.test.org/view?widget=3&count>2
Q: Does it matter which character encoding method I use to generate my Sitemap files?
Yes. Your Sitemap files must use UTF-8 encoding.
--
Chris Lilley mailto:chris@w3.org
Chair, W3C SVG Working Group
W3C Graphics Activity Lead
Received on Wednesday, 8 June 2005 11:08:43 UTC