W3C home > Mailing lists > Public > www-html@w3.org > June 1998

Re: CheckHtmlEsis

From: Peter Flynn <pflynn@imbolc.ucc.ie>
Date: 18 Jun 1998 16:32:42 +0100
To: d.cary@ieee.org
Cc: roconnor@uwaterloo.ca, www-html@w3.org
Message-id: <199806181532.QAA18900@imbolc.ucc.ie>
David Cary writes:
   Dear "Russell Steven Shawn O'Connor" and Peter Flynn,

   The comment that some kinds of validation should be done *only* by the
   browser doesn't make sense to me. 

I don't think I ever said it should be done only by the browser.
I hope not, anyway :-)

   Here are a few things which I wish my validation tools would check:

   Once I forgot to put the terminating quote on a URI inside a <a></a>
   entity. Since ">" seems to be a valid character inside a string, ... my
   validation tools gave me error messages, but they were misleading. It took
   me a while to figure out the real problem.

Yes, you are expected to understand the error messages, having first
read the SGML standard :-)

1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Draft//EN">
2 <html>
3   <head>
4     <title>Test</title>
5   </head>
6   <body>                           |-- here's the missing quote
7     <p><img src="foo.gif" alt="A Foo>Me</p>
8     <p><img src="bar.gif" alt="A Bar">My dog</p>
9   </body>              |-- this is where it gags (line 8, char 24)
10</html>

nsgmls -s -c/usr/local/lib/sgml/CATALOG test.html
ld.so: warning: /usr/lib/libc.so.1.8 has older revision than expected 9
nsgmls:test.html:8:24:E: an attribute value literal can occur in an
attribute specification list only after a vi delimiter

The parse is taking the whole of this:

   "A Foo>Me</p>
<p><img src="

as the value of ALT (understandably, since it starts and ends with a
quote).  The attribute value literal is the unquoted attribute value,
here bar.gif which it finds next. Then it hits a new quote, which is
out of context. Simple, isn't it? :-)

   I once had a bunch of URIs similar to <a href="www.ti.com">TI</a>,
   which the DTD would accept. My link check software kept telling me
   that this was a bad link, but the URI seemed to work fine when I
   manually typed it into my web browser ... color me confused. I wish
   I had gotten some warning that would suggest "I think you meant to
   say http://www.ti.com/ ".

The problem is that SGML does not (and cannot) provide any syntax-
checking INSIDE an attribute apart from testing if it's a valid ID,
for example, or a valid NUMBER. Once it's specified as CDATA, anything
will go in there...and it's up to the application to check it. So your
validator was perfectly correct in saying that www.ti.com was valid
character data and your link-checker was right to throw it out as
missing the scheme.

   I wish my validators would warn me when "You forgot to put a 'alt'
   attribute inside this <img> tag". (same for the height and width
   attributes).

Easy to fix: edit your DTD and change the ATTLIST for IMG from

              ALT  CDATA  #IMPLIED
to
              ALT  CDATA  #REQUIRED

Oh...you're not using a DTD? 

   Many people intend to make *every* graphic a link, so they would appreciate
   a program that listed which <img> tags were not wrapped in a <a></a> tag.

Here's a 5-line Omnimark program to do this.  Snip this into a file
called soloimg.xom and run Omnimark LE over your file with a batch
file or shell script or even commandline like this:

omle sgmlhtml.dec %1.htm -s soloimg.xom 

--------------------- soloimg.xom -------------------------
down-translate

element IMG when ancestor isnt A
	output "Image for %v(src) is not inside an <A>%n%c"

element #IMPLIED
	put #suppress "%c"
-----------------------------------------------------------

You do need to make sure your copy of the relevant HTML DTD is in the 
same directory (if you use a SYSTEM identifier in your DOCTYPE
declaration) or referenced in a catalog if you use PUBLIC.

   Even though the "&lt" is apparently legal SGML, I intend to always use the
   full "&lt;" and would like some warning when I slip up.

It's not a slip, and it's not "apparent". You can use &lt with no
semicolon any time that the &lt is followed by a space or other
punctuation. It's only when you follow it with another letter that
it's an error, eg &ltH2&gt will cause a complaint that entity "ltH2"
is not defined -- reasonably enough, I think.

   I intend to wrap every URI in the source text with a link to that URI. I
   would like a validator to check that every string (outside of a tag) of the
   form "http:" or "ftp:" or "mailto:" (what others are there now ?) is not
   merely inside a <a></a> entity, but that the href attribute is actually set
   to the *same* location (rather than some other unrelated location).

It would be nice if editors could do this (actually ADEPT and
Author/Editor can if you use their scripting languages). Omnimark can
do this as a standalone program like above: something like

   translate pcdata 
	( ("http:" or "ftp:") 
		"//" 
		[ letter or "." or "-" ]+ 
		( ":" digit+)?
		[ "/" ]? 
	-- I won't go on, you get the idea, it's a pattern-match --
		) =url
	when name of element isnt A
	output "<a href=%"%x(url)%">%x(url)</a>"

You could do another one in the same file for occasions when name of
element is A, and check %v(href) is equal to %x(url). 

This is sounding like an advert for Omnimark [disclaimer: I have no
connection except as a satisfied user] but what I'm trying to say is
that all the tools to do these things already exist...but they assume
you are creating valid HTML to start with: then checking this stuff
becomes trivial.

   I don't think my tools are smart enough to check that (a) for every <a
   href="#misc">misc</a> there is one and only one <a name="misc">misc</a> in
   the document,    and (b) that for each <a name="misc">misc</a> there is at
   least one <a href="#misc">misc</a>. 

Make them ID/IDREFs without the # and any SGML parser will
automatically check them. Then flip 'em back to NAME and HREF. 

Oh...you're not using a DTD? 

   When I add a new section to a page,
   something like (b) would remind me to add that section to the table of
   contents I keep at the top of the page.

Any decent editor macro should be able to do this. But will it delete
a ToC section when you remove a section? Or change its name?

   In my opinion, *every* web page needs to have a email address somewhere on
   it, so people viewing it can respond to any questions the author raises.

This is _content_, SGML can't do anything about that. But you could
add a compulsory <ADDRESS> to the end of the content model for <BODY>
in your DTD.

   I'm sure there are many other little things that a machine could easily
   check, but that current validators do not check.

Yep.

///Peter
Received on Thursday, 18 June 1998 11:31:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:15:37 GMT