W3C home > Mailing lists > Public > www-international@w3.org > April to June 2005

The term "XML-encoded" in Google sitemaps

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 21 Jun 2005 17:48:44 +0900
To: w3c.amc+0+@google.com
Cc: chris@w3.org, www-international@w3.org
Message-ID: <op.ssppfibqx1753t@ibm-60d333fc0ec.w3.mag.keio.ac.jp>

Dear Adam,

This if Felix Sasaki from the i18n Activity of W3C. Chris Lilley recently  
recognized some strange usage of the term "XML-encoded" in the  
documentation of Goolge sitemaps. I have attached his mail below. Do you  
know who we should contact to change the documentation?

Thanks a lot for your help. Best, Felix.


MAIL FROM CHRIS TO www-international@w3.org:

I noticed this curious term "XML-encoded"
http://www.google.com/webmasters/sitemaps/docs/en/protocol.html#faq_xml_encoding

on the Google sitemaps page. The problem is that it encourages people to
assume that XML required an IRI to be escaped to a URI. It would be
better if Google used the already defined terms, and made it clear that
this escaping is a special requirement of their particular format and
not of XML in general. The escaping should refer to RFC 3987 and not
HTML 4 (which is not even an XML format, and is not the defining
instance of the escape mechanism).

It should also refer to RFC 3986 and not 2396, of course.

TEXT IN THE DOCUMENTATION:

Q: How do I XML-encode a URL?

To properly encode your URLs, follow the procedure recommended by the
HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and
then URL-escape the result. For details about Internationalized Resource
Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987.

The following is an example python script for XML encoding a URL:

     $ python
     Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
     >>> import xml.sax.saxutils
     >>>  
xml.sax.saxutils.escape("http://www.test.org/view?widget=3&count>2")

The encoded URL from the example above is:

     http://www.test.org/view?widget=3&amp;count&gt;2

Q: Does it matter which character encoding method I use to generate my  
Sitemap files?

Yes. Your Sitemap files must use UTF-8 encoding.
Received on Tuesday, 21 June 2005 08:48:54 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:05 GMT