Re: JTidy new line processing

"Andy Quick" <ac.quick@sympatico.ca> wrote:
>Bernice Maslan <Bernice.Maslan@activeindexing.com> wrote:

>>I am running the Java version of HtmlTidy.  When the Html input looks
>>like the one below , Tidy replaces the ^M with nothing, resulting in two
>>separate words being combined (see Tidy output below also).  I do not
>>know what product was used to create the offending Html.  I tried
>>setting Word2000 and Clean to yes, but there was no change.  Is there
>>anything I can configure to make Tidy substitute a space for the ^M?

>I assume that you mean the character 0x0D (ie. '\r') when you say "^M"
>because tidy processes "^M" like text. The line end character for
>HTML/XML is 0x0A ('\n').

Actually it is 0x0D0x0A (CRLF, "\r\n").  Or, more specifically, CRLF is
preferred, but you can have CR or LF alone delimit end of line, but
whichever one you use, you MUST be consistent throughout the entire
document.  (This is an HTTP requirement for the requested document body.)

>Tidy strips out control characters (other than '\t' and '\n') from
>the input stream.  There is no option to treat '\r' like white
>space or line-end.

That would be an error then.  Tidy should treat CR, LF, and CRLF as
equivalent and normalize the document to whichever is appropriate for the
platform on which it was compiled or, if indeterminate, to CRLF.

Received on Saturday, 3 June 2000 17:55:58 UTC