- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Mon, 9 Jul 2001 20:52:57 +0800
- To: <www-xml-blueberry-comments@w3.org>
(Extracted from a post to XML-DEV. Gist: does MIME text/* actually forbid NEL, and if it does should XML support NEL? ) From John Cowan: ><flame>If XML had insisted that the > One True Representation of line-end is LF, and XML processors were > passing through every CR in character content and coughing on every CR > in markup, don't you think the situation would have been changed P.D.Q.? > Justice delayed is justice denied, but better than justice denied > forever.</flame> But the change in XML was an explicit simplification of SGML's rules, where there are Record Start and Record End signals implied into the document, and the entity management maps them from the incoming conventions. I thought the reason this simplification was possible was because of the requirements of sending XML over HTTP, which is the single most important use-case for "SGML on the Web". Where does HTTP fit into this? " When in canonical form, media subtypes of the "text" type use CRLF as the text line break. HTTP relaxes this requirement and allows the transport of text media with plain CR or LF alone representing a line break when it is done consistently for an entire entity-body. HTTP applications MUST accept CRLF, bare CR, and bare LF as being representative of a line break in text media received via HTTP. In addition, if the text is represented in a character set that does not use octets 13 and 10 for CR and LF respectively, as is the case for some multi-byte character sets, HTTP allows the use of whatever octet sequences are defined by that character set to represent the equivalent of CR and LF for line breaks. This flexibility regarding line breaks applies only to text media in the entity-body; a bare CR or LF MUST NOT be substituted for CRLF within any of the HTTP control structures (such as header fields and multipart boundaries)." http://www.ietf.org/rfc/rfc2068.txt 3.7.1 (which is referenced by MIME types in XML http://www.ietf.org/rfc/rfc2376.txt which is informatively referenced by XML 1.0 2e.) Because XML is "SGML on the WWW", any requirements imposed by HTTP must be weighed extremely high (and, indeed, it would be a mistake for XML to do anything counter to HTTP.) So I don't think there is any need for anyone to explode in flames. XML's rules are aimed at trying to be consonant with HTTP 1.1, which says clearly that the MIME rule for text/* is CRLF, but that HTTP allows relaxing of this. XML supports HTTP's relaxing, and so allows a multiplicity of mappings. What seems quite clear from that passage is that, due to requirements inherited from HTTP, the responsibility for mapping from non-CRLF line breaks to CRLF line breaks (as required by ) is the responsiblity of the sending system. Not the receiving XML processor. Section 19.4.1 is also relevant: "19.4.1 Conversion to Canonical Form MIME requires that an Internet mail entity be converted to canonical form prior to being transferred. Section 3.7.1 of this document describes the forms allowed for subtypes of the "text" media type when transmitted over HTTP. MIME requires that content with a type of "text" represent line breaks as CRLF and forbids the use of CR or LF outside of line break sequences. HTTP allows CRLF, bare CR, and bare LF to indicate a line break within text content when a message is transmitted over HTTP. Where it is possible, a proxy or gateway from HTTP to a strict MIME environment SHOULD translate all line breaks within the text media types described in section 3.7.1 of this document to the MIME canonical form of CRLF. Note, however, that this may be complicated by the presence of a Content-Encoding and by the fact that HTTP allows the use of some character sets which do not use octets 13 and 10 to represent CR and LF, as is the case for some multi-byte character sets." Note the previous sentence refers, I believe, to when multi-byte encodings contain in them an octet 13 or 10, rather than CR and LF being in other code points. In the MIME rfc http://www.ietf.org/rfc/rfc2046.txt we see "4.1.1. Representation of Line Breaks The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden. This rule applies regardless of format or character set or sets involved. NOTE: The proper interpretation of line breaks when a body is displayed depends on the media type. In particular, while it is appropriate to treat a line break as a transition to a new line when displaying a "text/plain" body, this treatment is actually incorrect for other subtypes of "text" like "text/enriched" [RFC-1896]. Similarly, whether or not line breaks should be added during display operations is also a function of the media type. It should not be necessary to add any line breaks to display "text/plain" correctly, whereas proper display of "text/enriched" requires the appropriate addition of line breaks. NOTE: Some protocols defines a maximum line length. E.g. SMTP [RFC- 821] allows a maximum of 998 octets before the next CRLF sequence. To be transported by such protocols, data which includes too long segments without CRLF sequences must be encoded with a suitable content-transfer-encoding." So the IBM character MUST NOT be used as a replacement for CRLF, as a line break. If it is serving as a replacement is MUST be mapped at the server end. If it is acting as some different character that is not a newline, why are we considering it? Actually, I am going to far: it only means that any XML that is sent representing newlines using the IBM character rather than CR and/or LF must be sent application/*. But it would be a bad design error to introduce a class of XML documents that can only be sent application/*, I suspect. > > 2) state that "XML processors may, at user option, if they detect the > > IBM newline or any other visual white-space in markup, element content > > or in an entity/XML declaration, replace the characters with LF, as a > > matter of entity management." > > That is what Blueberry does, except that the "user option" is expressed > in the document, not by some out-of-band means. This is plausible, > since it is the document creator who knows whether NEL, or post-2.0 name > characters, or both, are being used. I thought the proposal was to allow NEL as a distinct character from CRLF to also act as whitespace. This is different from replacing it with LF. Cheers Rick Jelliffe
Received on Monday, 9 July 2001 06:47:00 UTC