'XML-encoded" as a misused term in Google sitemaps

Hello www-international,

I noticed this curious term "XML-encoded"
http://www.google.com/webmasters/sitemaps/docs/en/protocol.html#faq_xml_encoding

on the Google sitemaps page. The problem is that it encourages people to
assume that XML required an IRI to be escaped to a URI. It would be
better if Google used the already defined terms, and made it clear that
this escaping is a special requirement of their particular format and
not of XML in general. The escaping should refer to RFC 3987 and not
HTML 4 (which is not even an XML format, and is not the defining
instance of the escape mechanism).

It should also refer to RFC 3986 and not 2396, of course.

http://www.google.com/webmasters/sitemaps/docs/en/protocol.html

Q: How do I XML-encode a URL?

To properly encode your URLs, follow the procedure recommended by the
HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and
then URL-escape the result. For details about Internationalized Resource
Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987.

The following is an example python script for XML encoding a URL:

    $ python
    Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
    >>> import xml.sax.saxutils
    >>> xml.sax.saxutils.escape("http://www.test.org/view?widget=3&count>2")

The encoded URL from the example above is:

    http://www.test.org/view?widget=3&count>2

Q: Does it matter which character encoding method I use to generate my Sitemap files?

Yes. Your Sitemap files must use UTF-8 encoding.


-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 W3C Graphics Activity Lead

Received on Wednesday, 8 June 2005 11:08:43 UTC