RE: encoding missing in xml declaration from Martin Gudgin on 2003-05-12 (xml-dist-app@w3.org from May 2003)

From: Martin Gudgin <mgudgin@microsoft.com>
Date: Mon, 12 May 2003 14:26:41 -0700
To: "Aman Singh" <haramansingh@hotmail.com>, <noah_mendelsohn@us.ibm.com>
Cc: <xml-dist-app@w3.org>, <xml-dist-app-request@w3.org>
Message-ID: <7C083876C492EB4BAAF6B3AE0732970E0B6DFB34@red-msg-08.redmond.corp.microsoft.com>
 

> -----Original Message-----
> From: Aman Singh [mailto:haramansingh@hotmail.com] 
> Sent: 12 May 2003 20:34
> To: noah_mendelsohn@us.ibm.com; Martin Gudgin
> Cc: xml-dist-app@w3.org; xml-dist-app-request@w3.org
> Subject: RE: encoding missing in xml declaration
> 
> Thank you replying back and many more thanks for the 
> clarification, however, I am still confused.
> 
> According to the XML 1.0 W3C Recommendation in Appendix F, 
> the following is
> stated:
> 
> --------------------------------------------------------------
> --------------------------------------------------------------
> ---------
> F.2 Priorities in the Presence of External Encoding 
> Information The second possible case occurs when the XML 
> entity is accompanied by encoding information, as in some 
> file systems and some network protocols. 
> When multiple sources of information are available, their 
> relative priority and the preferred method of handling 
> conflict should be specified as part of the higher-level 
> protocol used to deliver XML. In particular, please refer to 
> [IETF RFC 2376] or its successor, which defines the text/xml 
> and application/xml MIME types and provides some useful 
> guidance. In the interests of interoperability, however, the 
> following rule is recommended.
> 
> If an XML entity is in a file, the Byte-Order Mark and 
> encoding declaration are used (if present) to determine the 
> character encoding.
> --------------------------------------------------------------
> --------------------------------------------------------------
> ---------
> 
> I conducted an XML experiment of my own
> 
> Using Notepad (OS: Windows XP) I created two xml files with 
> the following
> content:
> <?xml version='1.0' ?>
> <root>&#197;Å</root>
> 
> I saved the first file as of type ANSI encoding and the other 
> as Unicode.
> Then I opened them up in Internet Explorer 6 (msxml 4 on the OS).
> 
> (Case 1) An error was received while opening the first file.

The encoding of the first file is ANSI, which, I *think* is ISO-8859-1, hence in order for an XML parser to correctly interpret it it MUST have an xml declaration with the value ISO-8859-1 ( or some recapitalization thereof ). 

> (Case 2)The second file opened fine.

Right, because it was encoding using UTF-16, began with a BOM and was interpretable automatically by the XML Parser

> 
> Another experiment (Case 3)
> When I add the following encoding attribute to the xml 
> declaration in Case 1 and save the file as ANSI, I get a 
> positive result.
> <?xml version="1.0" encoding="ISO-8859-1" ?> <root>&#197;Å</root>

Right. Case 1 was neither UTF-8 or UTF-16, therefore an encoding attribute was required.

> 
> What I am trying to get at is what is really used to 
> determine the character encoding for SOAP, In Case 1, it was 
> the way file was saved and not the xml declaration, 

No, the way an XML parser figures out the encoding is to use the BOM and/or encoding attribute. If the XML resource was supplied over HTTP, then the charset parameter to Content-Type could be used in the absence of a BOM and encoding attribute. In both case 1 and 2 your XML declaration said 'Hey XML parser, figure out the encoding for yourself, but it's either UTF-8 or UTF_16'

> But the 
> encoding attribute did take prescendence when it was added to 
> the xml declaration (Case 3).

Yup.

> 
> However for SOAP, will it be the transport level inforamtion (i.e HTTP
> Headers) that determine the encoding for the document, or the 
> xml declaration?

[1] seems to indicate that HTTP headers MAY take precedence over XML declaration. 

> 
> Is the speficiation ambigious that it is left to the XML parser?

I don't think so. Everything else in the XML world works this way AFAIK.

> 
> In what context is the xml encoding to be used according to 
> the XML 1.0 Recommendation?

If the encoding is NOT UTF-8 or UTF-16 then an XML declaration MUST be present and the encoding attribute MUST appear. The relevant text from [1] is

"It is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16."

> 
> I am sorry that I am still confused.  Given the experiments, 
> my confusion is justified ;)

I hope the above helps.

Gudge

[1] http://www.w3.org/TR/REC-xml#charencoding
Received on Monday, 12 May 2003 17:26:51 UTC