[Prev][Next][Index][Thread]

Re: equivalent power in SGML and XML



>> (We would still have the problem of the 60% of invalid documents
> 
> More like 96% invalid from what I've seen...

A while ago I asked Tim Bray to check for me with the Open Text index...
and a little under 5% of HTML documents had a DOCTYPE line.  That doesn't
mean that they were valid, and there may be documents that are otherwise
valid but don't have DOCTYPE, but it's unlikely that the overall figure is
much higher.  So I'd say that 90% to 95% invalid is a good guess.
I'd be interested to run the same test again now a few million more copies
of HoTMetaL have shipped...

>> , but hopefully this situation will get better once standard WP
>> tools start offering automatic conversion to HTML.)

The conversions we've seen do not generally attempt to create valid SGML.
I think there isn't enough encuragement.  This is an aspect of the demise
of the IETF HTML WG that is unfortunate, I think -- there are things that
are easier for that sort of very approachable standards body.

However that may be, I think it's reasonable to expect to have to run sme
sort of transformation from arbitrary/normal/typical HTML into XML.  Most
HTML documents won't go automatically.  In addition, it's very common for
HTML documents to be different on the server than when they are delivered--

* processing instructions and significant comments are used by some servers:
  <?dvi filename> to replace the PI with a DVI image on the fly
  <!--#include filename> to do server-side inclusion (like entities, but
  without the indirection -- much more natural for a C programmer)
  <!--#exec date> -- inserts today's date, more often used for generating
  those this-page-visited-00000026-times-most-of-them-by-my-mum counters

* database servers often use their own elements (or perhaps I shouldn't
  elevate them thta high, their own _tags_), e.g. stuff like
    <sql>select orderno from..... </sql>
  where the SQL query is executed before the document is shipped, and the
  content of the tags need not be stuff that would be valid within the
  HTML document.

* some servers asemble fragments (not in the SGML OPEN fragment sense) on
  the fly -- should the inividual fragments be valid, or only the result?
  (rhetorical question!)

It's interesting that HTML serves as a low-level portable document formatting
language, sort of like a new troff that's a little easier to parse.  Perhaps
if XML had been around some 7 years ago, so that you could write an XML
parser in C in a day or two, even as an undergraduate, Tim and later Marc
and Eric & friends would have used it.  If HTML 4 is based on XML, it will
get widely deployed if it is not too much harder than HTML 3 to type in
NOTEPAD or to parse.

You don't need to declare that Hey Presto! all HTML documents are XML! --
far from it, it is better not to.  If there is no need to improve HTML
documents to make them XML, what have we accomplished?

Sorry for a long mesage -- I think it's important to agrewe on this, though.

Lee

-- 
Liam Quin, SoftQuad Inc    | lq-text freely available Unix text retrieval
lee@sq.com +1 416 544-9000 | FAQs: Metafont fonts, OPEN LOOK UI, OpenWindows
SGML: http://www.sq.com/   | We've moved; new 'phone number & postal address!
The barefoot programmer    | `who is my neighbour?'