Re: internet media types and encoding

On Friday, April 11, 2003, 4:28:29 PM, Paul wrote:


PG> At 19:20 2003 04 11 +1000, Rick Jelliffe wrote:
>>XML can, nothing else can, we need it, it is possible, therefore XML should.
>>Have any users requested to the XML Core WG that XML should be
>>made less reliable?

PG> No, but we're not talking about making it less reliable, we're talking
PG> about leaving it as reliable (in this area) as it is currently in XML 1.0.

PG> And many users have requested backward compatibility with XML 1.0.

PG> paul

PG> p.s.  Despite my arguments, I'm still not sure what the right answer is.
PG> But personally, I'd like to hear cost/benefit analyses from folks on both
PG> sides, or this will likely be decided merely by intensity of discussion.

Well, see my comments about what that area actually contains.

http://lists.w3.org/Archives/Public/www-tag/2003Apr/0074.html

Backwards compatibility is all very well in general; but backwards
compatibility with stuff that was either

a) errors, or
b) stuff people had no business doing

does not rate very highly. Its not so much decreased backwards
compatibility as removing an inadvertent loophole.

Unlike Rick I am not making this argument on the basis of the ease of
detecting encoding labelling or conversion errors; rather, on the
basis of those non-printing characters having no basis being in a
marked up document. I mean, start of string? end of guarded area?

I think that Unicode Technical Report #20 agrees with me:

Unicode in XML and other Markup Languages
Unicode Technical Report #20
W3C Note 18 February 2002
http://www.w3.org/TR/unicode-xml/

see in particular

2.2 Overlap of Control Code and Markup Semantics
http://www.w3.org/TR/unicode-xml/#Overlap

> When markup is not available, plain text may require control
> characters. This is usually the case where plain text must contain
> some scoping or attribute information in order to be legible, i.e.
> to be able to transmit the same content between originator and
> receiver. Many of these control characters have direct equivalents
> in particular markup languages, since markup handles these concerns
> efficiently. If both characters and their markup equivalents may be
> present in the same text, the question of priority is raised.
> Therefore it is important to identify and resolve these ambiguities
> at the time markup is first applied.


PG> [1] http://lists.w3.org/Archives/Member/chairs/2002JulSep/0128

PG> [2] To quote from [1], it said:

PG> The removal of direct representation of control characters in the range
PG> #x7F-#x9F represents a change in well-formedness. That is, well-formed
PG> XML 1.0 documents which contain these characters do not become
PG> well-formed XML 1.1 documents simply by changing their version number.
PG> Occurrences of control characters must also be converted to numeric
PG> character references.

Yes. And as you say, its an evaluation of cost/benefit ratio. The
number of such documents is not very large; of that number, the vast
majority are erroneous, incorrectly labelled encoding and will be
*helped*, by being made not well formed. They will be noticed, and
fixed, and the correct codepoints used for euro and typographic quote
and so fort rather than some software ignoring the control codes and
some silently fixing it up in a 'we know you really meant windows code
page 1252' manner.

The rest, a very small number, have no business using those control
codes and are a security risk in terms of setting terminals into odd
configurations. And such bogus use is still permitted, as long as
people really do want them, by escaping the odious control characters.


PG> As a criterion for exiting CR, the XML Core WG will collect evidence
PG> substantiating (or contradicting) our opinion that:

PG> 1) converting characters in the #x7F-#x9F range to numeric
PG>    character references while updating XML 1.0 documents to XML 1.1 does
PG>    not represent a significant obstacle to adoption of XML 1.1;

I concur with this observation

PG> 2) there are no significant scenarios where converting characters
PG>    in the #x7F-#x9F range to numeric character references is impractical or
PG>    impossible;

Yes

PG> 3) that the benefits of this change to the proper detection of
PG>    character encoding represent a significant improvement in
PG>    interoperability.

Yes.

-- 
 Chris                            mailto:chris@w3.org

Received on Friday, 11 April 2003 12:45:54 UTC