- From: <lee@sq.com>
- Date: Sat, 28 Sep 96 14:21:24 EDT
- To: jjc@jclark.com, w3c-sgml-wg@w3.org
This is a little long, so here's a summary: On systems where documents are not split into records at logical line boundaries, RS/RE processing is not mandated by ISO8879:SGML. Neither Unix nor the Internet split documents in that way. I am not sure about MS-DOS and Macintosh files. Under Windows, it is optional. So if RS/RE processing is optional, let's decide not to do it. The whitespace handling proposed in Jon's summary of the ERB 'phone call (or was it Eliot? Sorry) seems fine to me. Lee > At 17:18 27/09/96 -0400, Gavin Nicol wrote: [...] > >This is only true if the record ends occur in the input. If a suitable > >declaration is given (ie, one in which RE and RE never occur) then an > >SGML parser and an XML parser should return identical parse results. James said: > I explained in an earlier message > <URL:http://lists.w3.org/Archives/Public/w3c-sgml-wg/msg00243.html> > (mainly paragraph 2) why I believe that view to be mistaken. including the statement: The entity manager is supposed to transform whatever mechanism the OS uses for representing lines into RS/RE. Wll, Unix itself does not mandate a particular storage format within a file, and does not itself represent lines (although connected devices often do, of course, typically with CR on input and CR LF on output). So on Unix you could argue that the entity manager should go and reaf annex F :-) Annex F, says: If the document contains record boundaries, they are processed as defined in this international standard. imlying that if it doesn't contain them, that's fine. Again, B.3.3 starts with Not every text processing system breaks its storage entites into records. I grant you, however, that these are not normative. Record is defined in 4.252 as normally corresponding to an input line on a text entry device. So if your SGML parser is reading input from what Unix calls a teletype, the line boundaries (terminated either by CR or LF depending on tty modes) are record boundaries. If you are reading from a regular file, this definition does not apply. The notes to the definition do seem to indicate that a Record is intended to be a line in the normal sense of the word. However, the same definition also says that the record must be bounded (on input, presumably) by a record start and a "record end character". It's interesting that we have record start but record end character -- I am not sure how to interpret that, except possibly as an editing error. Also, see note 2 to 7.6.1: Alhough the handling of record boundaries is defined by SGML, there is no requirement that SGML documents must be organized in records. At any rate, it seems to boil down to [1] must the parser generate RS and RE if they are not in the file? 7.6.1 talks about recognising RS and RE, not generating them; so I think it should not. [2] must the parser always recognise ASCII CR and LF as RS and RE even if the SGML declaration changes RS and RE? The standard does not mention ASCII CR and LF in this context, but only talks about RS and RE. So you should be able to change them. Of course, not all parsers support variant concrete syntaxes. [3] must the parser treat the characters associated with RS and RE as record boundaries if the operating system or underlying storage system does not use drecords? Clearly not -- see Annex F quoted above, and also note 2 to 7.6.1. Since RS/RE processing is optional except on operating systems using records (would that include MS-DOS??), let's just decide not to use it. Lee
Received on Saturday, 28 September 1996 14:21:53 UTC