W3C home > Mailing lists > Public > www-international@w3.org > April to June 2005

'XML-encoded" as a misused term in Google sitemaps

From: Chris Lilley <chris@w3.org>
Date: Wed, 8 Jun 2005 13:08:28 +0200
Message-ID: <06861456.20050608130828@w3.org>
To: www-international@w3.org

Hello www-international,

I noticed this curious term "XML-encoded"

on the Google sitemaps page. The problem is that it encourages people to
assume that XML required an IRI to be escaped to a URI. It would be
better if Google used the already defined terms, and made it clear that
this escaping is a special requirement of their particular format and
not of XML in general. The escaping should refer to RFC 3987 and not
HTML 4 (which is not even an XML format, and is not the defining
instance of the escape mechanism).

It should also refer to RFC 3986 and not 2396, of course.


Q: How do I XML-encode a URL?

To properly encode your URLs, follow the procedure recommended by the
HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and
then URL-escape the result. For details about Internationalized Resource
Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987.

The following is an example python script for XML encoding a URL:

    $ python
    Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
    >>> import xml.sax.saxutils
    >>> xml.sax.saxutils.escape("http://www.test.org/view?widget=3&count>2")

The encoded URL from the example above is:


Q: Does it matter which character encoding method I use to generate my Sitemap files?

Yes. Your Sitemap files must use UTF-8 encoding.

 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 W3C Graphics Activity Lead
Received on Wednesday, 8 June 2005 11:08:43 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:25 UTC