Re: Current Status of Discussion on RE/RS Handling

This is a little long, so here's a summary:
    On systems where documents are not split into records at logical
    line boundaries, RS/RE processing is not mandated by ISO8879:SGML.

    Neither Unix nor the Internet split documents in that way.
    I am not sure about MS-DOS and Macintosh files.  Under Windows, it
    is optional.

    So if RS/RE processing is optional, let's decide not to do it.

    The whitespace handling proposed in Jon's summary of the ERB
    'phone call (or was it Eliot?  Sorry) seems fine to me.


> At 17:18 27/09/96 -0400, Gavin Nicol wrote:
> >This is only true if the record ends occur in the input. If a suitable
> >declaration is given (ie, one in which RE and RE never occur) then an
> >SGML parser and an XML parser should return identical parse results.

James said:
> I explained in an earlier message
> <URL:http://lists.w3.org/Archives/Public/w3c-sgml-wg/msg00243.html>
> (mainly paragraph 2) why I believe that view to be mistaken.

including the statement:

    The entity manager is supposed to transform
    whatever mechanism the OS uses for representing lines into RS/RE.

Wll, Unix itself does not mandate a particular storage format within a
file, and does not itself represent lines (although connected devices often
do, of course, typically with CR on input and CR LF on output).  So on
Unix you could argue that the entity manager should go and reaf annex F :-)

Annex F, says:
    If the document contains record boundaries, they are processed as defined
    in this international standard.

imlying that if it doesn't contain them, that's fine.

Again, B.3.3 starts with
    Not every text processing system breaks its storage entites into records.

I grant you, however, that these are not normative.  Record is defined in
4.252 as
    normally corresponding to an input line on a text entry device.
So if your SGML parser is reading input from what Unix calls a teletype,
the line boundaries (terminated either by CR or LF depending on tty modes)
are record boundaries.  If you are reading from a regular file,
this definition does not apply.

The notes to the definition do seem to indicate that a Record is
intended to be a line in the normal sense of the word.  However, the same
definition also says that the record must be bounded (on input, presumably)
by a record start and a "record end character".  It's interesting that
we have record start but record end character -- I am not sure how to
interpret that, except possibly as an editing error.

Also, see note 2 to 7.6.1:
    Alhough the handling of record boundaries is defined by
    SGML, there is no requirement that SGML documents must be
    organized in records.

At any rate, it seems to boil down to
[1] must the parser generate RS and RE if they are not in the file?
    7.6.1 talks about recognising RS and RE, not generating them;
    so I think it should not.

[2] must the parser always recognise ASCII CR and LF as RS and RE even
    if the SGML declaration changes RS and RE?

    The standard does not mention ASCII CR and LF in this context, but
    only talks about RS and RE.  So you should be able to change them.
    Of course, not all parsers support variant concrete syntaxes.

[3] must the parser treat the characters associated with RS and RE as
    record boundaries if the operating system or underlying storage
    system does not use drecords?

    Clearly not -- see Annex F quoted above, and also note 2 to 7.6.1.

Since RS/RE processing is optional except on operating systems using
records (would that include MS-DOS??), let's just decide not to use it.