- From: <lee@sq.com>
- Date: Thu, 19 Sep 96 09:16:51 EDT
- To: jenglish@crl.com, w3c-sgml-wg@w3.org
>> (We would still have the problem of the 60% of invalid documents > > More like 96% invalid from what I've seen... A while ago I asked Tim Bray to check for me with the Open Text index... and a little under 5% of HTML documents had a DOCTYPE line. That doesn't mean that they were valid, and there may be documents that are otherwise valid but don't have DOCTYPE, but it's unlikely that the overall figure is much higher. So I'd say that 90% to 95% invalid is a good guess. I'd be interested to run the same test again now a few million more copies of HoTMetaL have shipped... >> , but hopefully this situation will get better once standard WP >> tools start offering automatic conversion to HTML.) The conversions we've seen do not generally attempt to create valid SGML. I think there isn't enough encuragement. This is an aspect of the demise of the IETF HTML WG that is unfortunate, I think -- there are things that are easier for that sort of very approachable standards body. However that may be, I think it's reasonable to expect to have to run sme sort of transformation from arbitrary/normal/typical HTML into XML. Most HTML documents won't go automatically. In addition, it's very common for HTML documents to be different on the server than when they are delivered-- * processing instructions and significant comments are used by some servers: <?dvi filename> to replace the PI with a DVI image on the fly <!--#include filename> to do server-side inclusion (like entities, but without the indirection -- much more natural for a C programmer) <!--#exec date> -- inserts today's date, more often used for generating those this-page-visited-00000026-times-most-of-them-by-my-mum counters * database servers often use their own elements (or perhaps I shouldn't elevate them thta high, their own _tags_), e.g. stuff like <sql>select orderno from..... </sql> where the SQL query is executed before the document is shipped, and the content of the tags need not be stuff that would be valid within the HTML document. * some servers asemble fragments (not in the SGML OPEN fragment sense) on the fly -- should the inividual fragments be valid, or only the result? (rhetorical question!) It's interesting that HTML serves as a low-level portable document formatting language, sort of like a new troff that's a little easier to parse. Perhaps if XML had been around some 7 years ago, so that you could write an XML parser in C in a day or two, even as an undergraduate, Tim and later Marc and Eric & friends would have used it. If HTML 4 is based on XML, it will get widely deployed if it is not too much harder than HTML 3 to type in NOTEPAD or to parse. You don't need to declare that Hey Presto! all HTML documents are XML! -- far from it, it is better not to. If there is no need to improve HTML documents to make them XML, what have we accomplished? Sorry for a long mesage -- I think it's important to agrewe on this, though. Lee -- Liam Quin, SoftQuad Inc | lq-text freely available Unix text retrieval lee@sq.com +1 416 544-9000 | FAQs: Metafont fonts, OPEN LOOK UI, OpenWindows SGML: http://www.sq.com/ | We've moved; new 'phone number & postal address! The barefoot programmer | `who is my neighbour?'
Received on Thursday, 19 September 1996 09:17:12 UTC