Re: JTidy new line processing

"Andy Quick" <ac.quick@sympatico.ca> wrote:

>I don't see the "consistency" requirement where the line-ends
>must be consistent throughout the document.

See RFC 2616, "Hypertext Transfer Protocol -- HTTP/1.1",
<ftp://ftp.isi.edu/in-notes/rfc2616.txt>:

>3 Protocol Parameters
>3.7 Media Types
>3.7.1 Canonicalization and Text Defaults
>
>   Internet media types are registered with a canonical form. An
>   entity-body transferred via HTTP messages MUST be represented in the
>   appropriate canonical form prior to its transmission except for
>   "text" types, as defined in the next paragraph.
>
>   When in canonical form, media subtypes of the "text" type use CRLF as
>   the text line break. HTTP relaxes this requirement and allows the
>   transport of text media with plain CR or LF alone representing a line
>   break when it is done consistently for an entire entity-body. HTTP
>   applications MUST accept CRLF, bare CR, and bare LF as being
>   representative of a line break in text media received via HTTP. In
>   addition, if the text is represented in a character set that does not
>   use octets 13 and 10 for CR and LF respectively, as is the case for
>   some multi-byte character sets, HTTP allows the use of whatever octet
>   sequences are defined by that character set to represent the
>   equivalent of CR and LF for line breaks. This flexibility regarding
>   line breaks applies only to text media in the entity-body; a bare CR
>   or LF MUST NOT be substituted for CRLF within any of the HTTP control
>   structures (such as header fields and multipart boundaries).

I point out, "HTTP... allows the transport of text media with plain CR or
LF alone representing a line break WHEN IT IS DONE CONSISTENTLY FOR AN
ENTIRE ENTITY-BODY."  An entity-body is effectively the entire text/*
document served, such as text/html, and excludes the HTTP headers which
precede it (which must use CRLF) and multipart boundaries.  HTTP defines
the preferred EOL marker as CRLF.

I guess then if you're not creating web pages for serving via HTTP, such as
on a CD-ROM, that consistency may not be required, but it's a safe bet that
it will be served via HTTP, and since consistency doesn't hurt when it
isn't served via HTTP, that consistency should be enforced.

>If that is the case,
>tidy would need some sort of pre-parsing pass to determine
>what the line-end sequence is before parsing.

It would be sufficient to treat every CR, LF, and CRLF as end-of-line and
output as what is appropriate for the platform for which it was compiled
or, failing that, default to CRLF.  (Defaulting to just CR or LF as EOL
will cause problems for systems that use LF or CR as EOL respectively, but
using CRLF will ensure that all three situations will be satisfied by
having a recognized EOL marker and not having one long unbroken line, where
the opposing character is either invisible or represented by a placeholder
character.  (LFCR is not anyone's EOL.))

Received on Thursday, 8 June 2000 00:41:55 UTC