Re: Simple(?) comments question... from S.N.Brodie@ecs.soton.ac.uk on 1996-09-19 (www-talk@w3.org from September to October 1996)

From: <S.N.Brodie@ecs.soton.ac.uk>
Date: Thu, 19 Sep 1996 12:06:10 +0100 (BST)
To: galactus@htmlhelp.com (Arnoud "Galactus" Engelfriet)
Cc: www-talk@w3.org
Message-Id: <10579.9609191106@strachey.ecs.soton.ac.uk>

Arnoud "Galactus" Engelfriet wrote:
> 
> In article <828.9609180920@strachey.ecs.soton.ac.uk>,
> S.N.Brodie@ecs.soton.ac.uk wrote:
> > My impression is that this is not a correctly terminated comment, since
> > it does not fit the strict definition given in RFC1866.  However,
> > that's irrelevant, as all 3 browsers I've tried it on accept it. 
> 
> I suppose these browsers simply consider "<!--" the comment starting
> tag and "-->" the corresponding closing tag. My favourite way to
> demonstrate that is the following *legal* comment:
> 
>   <!-- -- --> -->
> 
> That's two comments, one of which only contains " " and one contains
> "> ".

Agreed.

> Anyway, as far as I can see RFC 1866 does not discuss the "-" character
> as last character before "--" explicitly. It only states (section 3.2.5)
> 
>    Each comment starts with `--' and includes
>    all text up to and including the next occurrence of `--'.
> 
> I'm just confused if the sequence "---" _should_ be seen as "-"
> followed by "--" or as "--" followed by "-".

That is the problem.  My (very recently modifed :-) parser accepts the
following and displays "Body text." as it should:

Body <!-- comment -- -- > shouldn't see this! --> text.

Netscape gets itself into all kinds of a mess with this.  It seems to be
applying the "-->" terminates a comment and if we don't find one, go back
to the first occurrence of a >   For example:

One <!-- hi -- -- > lo --> Two

is displayed as "One Two".  However:

One <!-- hi -- -- > lo -- > Two

is displayed as "One lo -- > Two"

Anybody got any suggestions how --- should be parsed whilst parsing a
comment structure?  My inclination is to treat it differently depending
on whether you are inside a comment or not, but it becomes a special
case that way, since behaviour will have to change depending on the
number of consecutive - characters.  Obviously you have to keep track
of whether you are "inside" a -- or not.  Having seen <!-- parser goes
into comment parsing mode, and sets a flag "in_double_dash" to true.
Then it continues until it sees 2 or more consecutive - characters.  It
discards all of these characters and sets "in_double_dash" to false.  A
'>' is only accepted as terminator if in_double_dash is false.  Upon
seeing a -- whilst in_double_dash is zero, set in_double_dash to one.

Whilst this has a set behaviour for 2+ consecutive - symbols, is it the
desired behaviour?

One of the places this is most likely to crop up (IMHO) is when reading
inlined scripts.  I have specifically added recognition for <script>
and </script> to the parser so it can discard that stuff automatically
without relying on the script being commented out.  I discard it as I
have no intention of porting Visual Basic or writing a Javascript
interpreter.  I just don't have the time.

-- 
Stewart Brodie, Electronics & Computer Science, Southampton University.
http://www.ecs.soton.ac.uk/~snb94r/      http://delenn.ecs.soton.ac.uk/

Received on Thursday, 19 September 1996 07:07:03 UTC