W3C home > Mailing lists > Public > www-international@w3.org > April to June 2005

The term "XML-encoded" in Google sitemaps

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 21 Jun 2005 17:48:44 +0900
To: w3c.amc+0+@google.com
Cc: chris@w3.org, www-international@w3.org
Message-ID: <op.ssppfibqx1753t@ibm-60d333fc0ec.w3.mag.keio.ac.jp>

Dear Adam,

This if Felix Sasaki from the i18n Activity of W3C. Chris Lilley recently  
recognized some strange usage of the term "XML-encoded" in the  
documentation of Goolge sitemaps. I have attached his mail below. Do you  
know who we should contact to change the documentation?

Thanks a lot for your help. Best, Felix.

MAIL FROM CHRIS TO www-international@w3.org:

I noticed this curious term "XML-encoded"

on the Google sitemaps page. The problem is that it encourages people to
assume that XML required an IRI to be escaped to a URI. It would be
better if Google used the already defined terms, and made it clear that
this escaping is a special requirement of their particular format and
not of XML in general. The escaping should refer to RFC 3987 and not
HTML 4 (which is not even an XML format, and is not the defining
instance of the escape mechanism).

It should also refer to RFC 3986 and not 2396, of course.


Q: How do I XML-encode a URL?

To properly encode your URLs, follow the procedure recommended by the
HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and
then URL-escape the result. For details about Internationalized Resource
Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987.

The following is an example python script for XML encoding a URL:

     $ python
     Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
     >>> import xml.sax.saxutils

The encoded URL from the example above is:


Q: Does it matter which character encoding method I use to generate my  
Sitemap files?

Yes. Your Sitemap files must use UTF-8 encoding.
Received on Tuesday, 21 June 2005 08:48:54 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:22 UTC