Re: application/ttml+xml charset and encoding from Glenn Adams on 2013-01-20 (public-tt@w3.org from January 2013)

From: Glenn Adams <glenn@skynav.com>
Date: Sun, 20 Jan 2013 11:50:07 -0700
To: Michael Dolan <mdolan@newtbt.com>
Cc: public-tt@w3.org
Message-ID: <CACQ=j+cxE5EMTiLJevTszewJ_yoT4q659WYonVSx66H_K1VfGQ@mail.gmail.com>
On Thu, Jan 17, 2013 at 9:28 AM, Michael Dolan <mdolan@newtbt.com> wrote:

> This is a bit off-topic, especially in light of our decision today, but…**
> **
>
> ** **
>
> Unfortunately, in addition to encoding values, 3023 also uses the charset
> values of “iso-2022-jp” and “8859-1”; and the default value of charset (for
> text/xml) is “us-ascii”, so the semantics of the field are quite clearly
> mixed up in 3023, and it is not just that the field name should have been
> “encoding”.
>

8859-1 and us-ascii are labels for both character sets and encodings.
iso-2022-jp is a label for an encoding that supports multiple character
sets.

Historically, there has never been a careful distinction between character
set (repertoire) and encoding. So I think you are attempting to ask for
something (clarity of usage) that has never been there.


> ****
>
> ** **
>
> Can a decoder arrive at the right answer most of the time with such an
> overloading of this field (which I think is your point)?  Yes.
>

I'm not sure what you are asking. If charset is set to any of these three
values { 8859-1, iso-2022-jp, us-ascii }, there is no ambiguity about what
is to be decoded.

If the author or transport misspecified the charset, then that is a
different problem.


> ****
>
> ** **
>
>                 Mike****
>
> ** **
>
> *From:* Glenn Adams [mailto:glenn@skynav.com]
> *Sent:* Thursday, January 17, 2013 8:04 AM
>
> *To:* Michael Dolan
> *Cc:* public-tt@w3.org
> *Subject:* Re: application/ttml+xml charset and encoding****
>
> ** **
>
> a similar equivalence is indicated in HTML5, see****
>
> ** **
>
>
> http://www.w3.org/TR/html5/single-page.html#attr-meta-http-equiv-content-type
> ****
>
> On Thu, Jan 17, 2013 at 7:47 AM, Glenn Adams <glenn@skynav.com> wrote:****
>
> the "charset" parameter of MIME types and the "encoding" parameter for the
> XML declaration are effectively (if not identically) synonymous; in general
> (but not always) they map to both a character repertoire (a character set)
> and an on the wire encoding of strings that employ that repertoire****
>
> ** **
>
> On Wed, Jan 16, 2013 at 2:00 PM, Michael Dolan <mdolan@newtbt.com> wrote:*
> ***
>
> (per my AI and for discussion in tomorrow’s meeting)****
>
>  ****
>
> TTML 1.0 defines an media type “application/ttml+xml” in Appendix C:****
>
>
> https://dvcs.w3.org/hg/ttml/raw-file/tip/ttml10/spec/ttaf1-dfxp.html#media-type-registration
> ****
>
> We are in the process of submitting a media type registration to IANA,
> revised from what was published in TTML 1.0.****
>
>  ****
>
> Both the parameter and encoding considerations sections of the
> registration refer to “application/xml” (section 3.2) defined in RFC 3023,
> “XML Media Types”:****
>
> http://www.rfc-editor.org/rfc/rfc3023.txt ****
>
>  ****
>
> An optional charset parameter is defined.  The value of charset is
> entirely unconstrained.  RFC 3023 seems to mix charset (e.g. 8859-1) with
> encoding (e.g. utf-8) which adds a layer of confusion.****
>
>  ****
>
> Character encoding requirements XML in general are in section 4.3.3 of the
> XML 1.0 spec (RFC 3023 cites XML 1.0):****
>
> http://www.w3.org/TR/REC-xml/#charencoding ****
>
> and optionally, the algorithm defined in the informative Appendix F:****
>
> http://www.w3.org/TR/REC-xml/#sec-guessing****
>
>  ****
>
> There are a variety of scenarios for which the charset/encoding cannot be
> determined. So, in the end, there is no deterministic way to deduce the
> charset/encoding from the file alone.  The media type charset parameter or
> some other external signaling means is required.  Most file systems do not
> include this metadata.  This makes file exchange problematic.****
>
>  ****
>
> RFC 3023 makes some specific comments and recommendations in this area:***
> *
>
>  ****
>
> Although listed as an optional parameter, the use of the charset parameter
> is STRONGLY RECOMMENDED, since this information can be used by XML
> processors to determine authoritatively the charset of the XML MIME entity.
> ****
>
>  ****
>
> "utf-8" [RFC2279] and "utf-16" [RFC2781] are the recommended values,
> representing the UTF-8 and UTF-16 charsets, respectively.  These charsets
> are preferred since they are supported by all conforming processors of
> [XML].****
>
>  ****
>
> I recommend that TTWG follow the RFC 3023 recommendation and clarify that
> the “application/ttml-xml” media type be constrained to utf-8 and utf-16
> encoding only. Given the mixing of semantics for “charset”, I recommend we
> remain silent on that optional parameter, since, with this constraint,
> explicit signaling is not required. The other encoding consideration of RFC
> 3023 still apply.****
>
>  ****
>
> Regards,****
>
>  ****
>
>                 Mike****
>
>  ****
>
>  ****
>
> Michael A DOLAN****
>
> TBT, Inc.    PO Box 190****
>
> Del Mar, CA 92014****
>
> (m) 858-882-7497****
>
>  ****
>
> ** **
>
> ** **
>
Received on Sunday, 20 January 2013 18:50:56 UTC