Re: MIME type for XML

Terry Allen <tallen@fsc.fujitsu.com> writes:
>RFC 1874 on SGML Media Types defines both text and application
>for SGML, with some language that may or may not be relevant;
>the main idea appears to be to provide fallback to text/plain
>from text/sgml.
>
>ftp://ds.internic.net/rfc/rfc1874.txt

Got it -- thanks Terry. The fact that the RFC states preference for
US-ASCII was kinda my point. Since the default charset for XML is not
US-ASCII, I don't think the assumption in RFC 1874 is valid or useful in
XML. And yes, I think the RFC should be changed, not XML. From the XML 1.0
spec:

    This specification depends on the international standard ISO/IEC 10646
    and the technically identical Unicode Standard, Version 2.0, which
    define the encodings and meanings of the characters which make up
    XML text data.

Relevant quotes from RFC 1874 follow. Section 2.1 describes text/sgml as
being employed when software is meant to be human-readable:

  2.1.  Text/SGML

         MIME type name:          Text
         MIME subtype name:       SGML
         Required parameters:     none
         Optional parameters:     charset, SGML-bctf, SGML-boot
         Encoding considerations: may be encoded
         Security considerations: see section 4 below
         Published specification: ISO 8879:1986
         Person and email address to contact for further information:
                                  E. Levinson <ELevinson@Accurate.com>

   The Text/SGML media-type can be employed when the contents of the
   SGML entity is intended to be read by a human and is in a readily
   comprehensible form.  That is the content can be easily discerned by
   someone without SGML display software.  Each record in the SGML
   entity, delimited by record start (RS) and record end (RE) codes,
   must correspond to a line in the Text/SGML body part.

   SGML entities that do not meet the above requirements should use the
   Application/SGML media-type.


A document in UCS-4 Arabic is certainly intended to be read by a human. The
problem doesn't seem to be use of RS and RE per se, it's their
transformation into multibyte Unicode equivalents.

[...describing the 'charset' parameter...]

       charset     The charset parameter for Text/SGML is defined in
                   [RFC-1521], the valid values and their meaning are
                   registered by the Internet Assigned Numbers
                   Authority (IANA) [RFC-1590].  The default charset
                   value for all Text content-types is "us-ascii"
                   [RFC-1521].

                   The charset parameter is provided to permit non-
                   SGML capable systems to provide reasonable
                   behavior when Text/SGML defaults to Text/Plain.
                   SGML capable systems will use the SGML-bctf param-
                   eter.


What needs changing is the definition of MIME 'text/*' from 646 to 10646,
not a UCS-4 document instance into an 'application/*' MIME type. Otherwise,
MIME is inextricably bound to US-ASCII, which seems a mistake. I'm sure
someone more qualified than I has argued this out in the MIME/SGML WGs.

XML may simply be among the first applications requiring this type of i18n
modification to what are gradually becoming outdated specs.

Murray

```````````````````````````````````````````````````````````````````````````````
    Murray Altheim, Program Manager
    Spyglass, Inc., Cambridge, Massachusetts
    email: <mailto:murray@spyglass.com>
    http:  <http://www.cm.spyglass.com/murray/murray.html>
           "Give a monkey the tools and he'll eventually build a typewriter."

Received on Monday, 2 December 1996 20:47:43 UTC