W3C home > Mailing lists > Public > xml-dist-app@w3.org > May 2003

RE: encoding missing in xml declaration

From: Aman Singh <haramansingh@hotmail.com>
Date: Tue, 13 May 2003 09:25:26 -0400 (EDT)
To: mgudgin@microsoft.com, noah_mendelsohn@us.ibm.com
Cc: xml-dist-app@w3.org, xml-dist-app-request@w3.org
Message-ID: <Sea1-F107RFapNuT1FM0000b85a@hotmail.com>




It all makes sense to me now, it has to do with byte order marks.

>>Case 1 was neither UTF-8 or UTF-16, therefore an encoding attribute was 
>>required.

One can be easily misled by the verbosity of an XML fragment to think that 
for case 1, the default encoding would be utf-8.

<?xml version='1.0' ?>
<root>&#197;</root>

By looking at the contents of the file, I would never think that an encoding 
attribute is required.

Thanks again for your time.

Best Regards,

aman singh

>From: "Martin Gudgin" <mgudgin@microsoft.com>
>To: "Aman Singh" <haramansingh@hotmail.com>,<noah_mendelsohn@us.ibm.com>
>CC: <xml-dist-app@w3.org>,<xml-dist-app-request@w3.org>
>Subject: RE: encoding missing in xml declaration
>Date: Mon, 12 May 2003 14:26:41 -0700
>
>
>
> > -----Original Message-----
> > From: Aman Singh [mailto:haramansingh@hotmail.com]
> > Sent: 12 May 2003 20:34
> > To: noah_mendelsohn@us.ibm.com; Martin Gudgin
> > Cc: xml-dist-app@w3.org; xml-dist-app-request@w3.org
> > Subject: RE: encoding missing in xml declaration
> >
> > Thank you replying back and many more thanks for the
> > clarification, however, I am still confused.
> >
> > According to the XML 1.0 W3C Recommendation in Appendix F,
> > the following is
> > stated:
> >
> > --------------------------------------------------------------
> > --------------------------------------------------------------
> > ---------
> > F.2 Priorities in the Presence of External Encoding
> > Information The second possible case occurs when the XML
> > entity is accompanied by encoding information, as in some
> > file systems and some network protocols.
> > When multiple sources of information are available, their
> > relative priority and the preferred method of handling
> > conflict should be specified as part of the higher-level
> > protocol used to deliver XML. In particular, please refer to
> > [IETF RFC 2376] or its successor, which defines the text/xml
> > and application/xml MIME types and provides some useful
> > guidance. In the interests of interoperability, however, the
> > following rule is recommended.
> >
> > If an XML entity is in a file, the Byte-Order Mark and
> > encoding declaration are used (if present) to determine the
> > character encoding.
> > --------------------------------------------------------------
> > --------------------------------------------------------------
> > ---------
> >
> > I conducted an XML experiment of my own
> >
> > Using Notepad (OS: Windows XP) I created two xml files with
> > the following
> > content:
> > <?xml version='1.0' ?>
> > <root></root>
> >
> > I saved the first file as of type ANSI encoding and the other
> > as Unicode.
> > Then I opened them up in Internet Explorer 6 (msxml 4 on the OS).
> >
> > (Case 1) An error was received while opening the first file.
>
>The encoding of the first file is ANSI, which, I *think* is ISO-8859-1, 
>hence in order for an XML parser to correctly interpret it it MUST have an 
>xml declaration with the value ISO-8859-1 ( or some recapitalization 
>thereof ).
>
> > (Case 2)The second file opened fine.
>
>Right, because it was encoding using UTF-16, began with a BOM and was 
>interpretable automatically by the XML Parser
>
> >
> > Another experiment (Case 3)
> > When I add the following encoding attribute to the xml
> > declaration in Case 1 and save the file as ANSI, I get a
> > positive result.
> > <?xml version="1.0" encoding="ISO-8859-1" ?> <root></root>
>
>Right. Case 1 was neither UTF-8 or UTF-16, therefore an encoding attribute 
>was required.
>
> >
> > What I am trying to get at is what is really used to
> > determine the character encoding for SOAP, In Case 1, it was
> > the way file was saved and not the xml declaration,
>
>No, the way an XML parser figures out the encoding is to use the BOM and/or 
>encoding attribute. If the XML resource was supplied over HTTP, then the 
>charset parameter to Content-Type could be used in the absence of a BOM and 
>encoding attribute. In both case 1 and 2 your XML declaration said 'Hey XML 
>parser, figure out the encoding for yourself, but it's either UTF-8 or 
>UTF_16'
>
> > But the
> > encoding attribute did take prescendence when it was added to
> > the xml declaration (Case 3).
>
>Yup.
>
> >
> > However for SOAP, will it be the transport level inforamtion (i.e HTTP
> > Headers) that determine the encoding for the document, or the
> > xml declaration?
>
>[1] seems to indicate that HTTP headers MAY take precedence over XML 
>declaration.
>
> >
> > Is the speficiation ambigious that it is left to the XML parser?
>
>I don't think so. Everything else in the XML world works this way AFAIK.
>
> >
> > In what context is the xml encoding to be used according to
> > the XML 1.0 Recommendation?
>
>If the encoding is NOT UTF-8 or UTF-16 then an XML declaration MUST be 
>present and the encoding attribute MUST appear. The relevant text from [1] 
>is
>
>"It is also a fatal error if an XML entity contains no encoding declaration 
>and its content is not legal UTF-8 or UTF-16."
>
> >
> > I am sorry that I am still confused.  Given the experiments,
> > my confusion is justified ;)
>
>I hope the above helps.
>
>Gudge
>
>[1] http://www.w3.org/TR/REC-xml#charencoding
>

_________________________________________________________________
MSN 8 with e-mail virus protection service: 2 months FREE*  
http://join.msn.com/?page=features/virus
Received on Tuesday, 13 May 2003 09:54:34 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:59:14 GMT