Re: [oh no, Mr. Bill! not] B.3 Record-end handling? from lee@sq.com on 1996-10-16 (w3c-sgml-wg@w3.org from October 1996)

From: <lee@sq.com>
Date: Wed, 16 Oct 96 15:30:22 EDT
To: tbray@textuality.com, w3c-sgml-wg@w3.org
Message-Id: <9610161930.AA09687@sqrex.sq.com>
> 2. DTD or none, if a line contains some markup, and aside from that 
>    only white space, you lose said white space and the trailing record
>    boundary character(s).  "Markup" meaning tags, comments, and PIs.

The only trouble with this is that it relies on non-portable line breaks.

With the system-specific rules for what constitutes a line (or record,
if you prefer that terminology), a file on a Macintosh with ASCII CR
between lines copied to Unix, which recognises only LF as line breaks,
causes an interoperability problem: the document is now all on one line,
so you have to read it all before you can display past the first space.

MIME uses CR LF for a line boundary, like SMTP, but most servers don't in
fact translate properly as far as I can tell.

We've heard a lot about how XML applications will have problems if there
is whitespace in tables, but have not heard how an XML application is to
recognise that something is a table.  I have assumed so far the use of
a style sheet to determine that.

Perhaps the style sheet should tell the XML application how to display
spaces and newlines.  Then all spaces and newlines and carriage returns
are passed retained by the lexical analyser and/or parser at the discretion
of the application.  I would expect multple blanks (of any kind) to be
coalesced and compressed into a single space.  A non-editing application
would lose comments at the same time, of course.

One thing seems to be generally (if not universally) agreed: the SGML
rules for whitespace and RS/RE are too complex for XML, for a variety
of reasons.  There is an excellent paper produced by Software Exoterica
on the handling of RS/RS in their parser.  That someone could write a
paper on whitespace (and still say that there are grey areas in which they
had to make a decision, as I recall!) says that it is too complex.

The problem is made complex by two things:

(1) a desire to use whitespace to be part of the data in some places,
    and part of the markup in others, without any syntactic differentitation.
    In other words, in <boy>  </boy>, the spaces might be important, but
    in another DTD, <boy>  </boy> might be the same as <boy></boy>.
    The same applies to newlines, with the additional twist that in some
    places, people might want a newline to be different froma space, and
    even the SGML parser can't tell where.

(2) a fond imagining that the SGML parser can deal with people putting
    the parkup in the wrong place.  We have been told that
    <P>
    x x x
    </P>
    is more natural to some people than
    <P>x x x</p>
    but that to others
    <P>
       x
       x
       x
    </P>
    is more natural.  SGML special-cases the first of these.  But since
    it is subjective, there is no clear reason to prefer any of these forms
    over any other.
    We have also been told that SGML can handle the case where a non-SGML-
    aware editor performs word wrap incorrectly, and attempts to divine
    the "True Data" that the user was imagining.  But this is more than
    subjective, it's absurd.  (please tell me privately if
    I am misrepresenting anyone, and I will gladly summarise to the list)

All of these points seem very strange to anyone from a computer science
background, and are in fact central to why SGML is not as widely implemented
as it could be, I believe.  These issues show how the `cultural assumptions'
of SGML are utterly alien to the CS world.  This is more important than
complexity.  If complexity were the only issue, we would see large numbers
of partial implementations.  We don't.

There is no hope in trying to change the CS community in this regard;
we have already tried that.  There are things that SGML can teach CS, but
there are also things SGML must learn from CS.  Regularity is one of them.

I do not believe there is any justification for using white-space
rules other than those used in Pascal, LISP, C, C++, Java, Alogol, PostScript,
TeX, and many other languages designed since the 1960s.

I realise there are people with large amounts of legacy data in SGML,
or other non-XML formats.  But if the goal is to get wide deployment,
you need to get software developers to _like_ your language.

So you need to avoid irregularities and quirks.

Yes, you can have multiple syntactic constructs.
Most languages _do_ use a special syntax for comments, for example, so
that the parser can ignore them easily.
But try and make them fit into existing language constructs used elsewhere
and you will gain many more converts, much more quickly.

Maybe there are 10 or 20 thousand people who can write SGML DTDs today.
But there are many more people than that who can write a C program.

So I am afraid I do not think Peter's idea is the right approach.
It is a good idea if you can't take the right approach.  But we can.

Lee
Received on Wednesday, 16 October 1996 15:30:37 UTC