- From: Bob Briscoe <rbriscoe@jungle.bt.co.uk>
- Date: Thu, 17 Oct 1996 16:36:05 +0100
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
A "pipeline" for converting the Word document of the spec. into HTML is attached (I needed this for internal reasons, but I thought it might be useful generally). This is not as easy as it sounds (unless someone knows something I don't) as Internet Assistant for Word wipes all the internal anchors on the section headings if you've got an auto-generated table of contents in there. I even had to lower myself to learning pidgin Word Basic macro language, as well as remembering my sed and awk. Also, the internal structure of the spec has become terribly messy since it's been in Word. I spent a lot of time fixing all the internal links and anchors (there are 335 links in draft 7) and other things like bulletted and numbered lists done inconsistently. I've carefully documented the internal changes I've made below. The pipeline isn't particularly generic and involves some manual intervention (it wouldn't do if I could be bothered to learn perl), but it works and you get clean HTML out the other end. I've passed draft 7 through it, so when the RFC comes out, someone may wish to use this pipeline for that too. If anyone wants, I can stick these on our ftp server: a) A clean Word file of draft 7 with all the internal linkage ready for conversion. b) A clean HTML file of draft 7 I would add that neither a) nor b) would produce straight text that would be identical to the current plain/text draft 7. This is because: - Many of the internal links to the references became squashed out of the spec. at various stages, but still exist internal to the Word file data structure, so I've re-revealed these. - I tried to avoid altering the linear white space, but probably failed. Whether this can still be called draft 7 is up for debate. This is why I haven't made it available already, in case someone objects on a change control basis. Attached is the process. Bob ================================================================================ Step 0: Removed auto-generated table of contents (unfortunately this doesn't remove the anchors it refers to) Step 1: WINWORD: Unpicked bugs in bookmark naming and linkage as follows... * Heading 3.11 Entity Tags Double bookmarked as "Entity_Tags" & "Opaque_Tags" Removed latter * Heading 3.12 Range Units Double bookmarked as "Range_Units" & "Range_Protocol_Param" Removed latter * Heading 8.1.3 Proxy Servers No bookmark Added bookmark "Persist_Proxy_Servers" * Heading 9.1 Safe and Idempotent Methods No bookmark Added bookmark "Safe_Idem_Methods" * Heading 12.1 Server-driven Negotiation No bookmark Added bookmark "Server_driven_Negotiation" * Heading 12.2 Agent-driven Negotiation No bookmark Added bookmark "Agent_driven_Negotiation" * Heading 12.3 Transparent Negotiation No bookmark Added bookmark "Transparent_Negotiation" * Heading 13.3.2 Entity Tag Cache Validators was bookmarked as "Tags" & nested within it was the "Entity_Tag_Cache_Validators" bookmark Removed former and replaced it with latter * Heading 13.4 Response Cachability Triple bookmarked as "Response_Cachability", "Caching_and_Status_C" & "Constructing_Respons" Removed all but first * Heading 13.6 Caching Negotiated Responses Double bookmarked as "Vary_Header_Use" & "Caching_and_Varying_" Removed latter * Section 14.4 Arbitrary sentence at end was bookmarked "OLE_LINK8" Removed bookmark * Section 14.5 Whole section was bookmarked as "OLE_LINK1" Removed bookmark * Heading 14.9.2 What May be Stored by Caches Bookmark ended a character early Shifted bookmark end right * Heading 14.16 Content-MD5 Bookmark ended after para mark Shifted bookmark end left * Heading 14.25 If-Match Double bookmarked as "If_Match" & "If_Valid" Removed latter * Heading 14.28 If-Unmodified-Since Double bookmarked as "If_Unmodified_Since" & "Unless_Modified_Sinc" Removed latter * Heading 14.44 Via Double bookmarked as "Via" & "Forwarded" Removed latter * Heading 15.2 Offering a Choice of Authentication Schemes No bookmark Added bookmark "Choice_of_Authentication" * Ref [26] Improving HTTP Latency No bookmark Added bookmark "RefLatency" * Ref [27] Analysis of HTTP Performance No bookmark & URL was bookmarked as "OLE_LINK2" Added bookmark named "RefPerformance" & Removed "OLE_LINK2" * Ref [29] , RFC 1951 No bookmark Added bookmark "Ref1951" * Ref [30] Analysis of HTTP Performance Problems No bookmark Added bookmark "RefPerfProbs" * Ref [31] , RFC 1950 No bookmark Added bookmark "Ref1950" * Ref [32] Work In Progress for Digest authentication No bookmark, but "DigestRef" bookmark spanned refs [26] & [27] Moved "DigestRef" bookmark to ref [32] * Heading 18. Authors "Authors" bookmark spanned previous section Moved start of bookmark to start of heading * Roy T. Fielding hyper link duplicated internal to macrobutton Opened up with <SHIFT>F9 & removed one of two * Heading 19.4 Differences Between HTTP Entities and RFC 1521 Entities Bookmark ended a word early Shifted bookmark end right * Heading 19.5.1 Changes to Simplify Multi-homed Web Servers and Conserve IP Addresses was bookmarked as "Changes_For_Host_Support" and "OLE_LINK4" & nested within it was the "AppHost" bookmark Removed last two bookmarks * Heading 19.8 Notes to the RFC Editor and IANA Bookmark "Notes_to_RFC_Editor" spanned previous two sections Moved start of bookmark to start of heading * Heading 19.8.1 Charset Registry No bookmarks Added bookmark "Charset_Registry" * Heading 19.8.2 Content-coding Values No bookmarks Added bookmark "Content_coding_Values" * Heading 19.8.3 New Media Types Registered No bookmarks Added bookmark "New_Media_Types_Registered" * Heading 19.8.4 Possible Merge With Digest Authentication Draft No bookmarks Added bookmark "Possible_Merge_With_Digest_Authen" * Heading 19.8.5 Media type parameters named "q" No bookmarks Added bookmark "Media_type_parameters_named_q" Step 2: WINWORD Fix bullet lists and numbered lists that have been done manually rather than as a style Step 3: WINWORD: Add internal hyperlinks to refs. from all [xx] references, including ones invisible within Word datastructure (revealed with <ALT><F9>) Save (http117a.doc) Step 4: WINWORD: Correct typos accepted by working group (haven't done this) Save () Step 5: WINWORD: Run my TypeBkmks Word Macro Step 6: WINWORD: Format Heading Numbering Remove Save (http117b.doc) Step 7: WINWORD: Save As http117b.htm with Internet Assistant v2.03z Step 8: UNIX: dos2unix http117b.htm http117c.htm Step 9: UNIX: sed -f sedfile http117c.htm > http117d.htm ================================================================================ sedfile ================================================================================ s!<A NAME="_Toc[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]">{{A NAME={{!<A NAME="!gp s!{{A NAME={{.*}}!!gp s!}}!">!gp s!<A NAME="_Toc[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]">\(.*\)</A>!\1!gp s!<FONT [^>]*>!!gp s!</FONT>!!gp /<H[2-8]><A NAME=/ { N s!<A NAME=! <A NAME=! s!\n! ! s!</H!\ </H!gp } ================================================================================ Step 10: TEXT EDITOR: Dealt with any remaining anchors split across lines containing '{{' manually. Removed six remaining _Toc anchors manually Added another </H1> at end of title to match second <H1> at start Removed some spurious <B><LWS></B> and altered three top <B> headings to <H2> Step 11: UNIX: awk -f awkfile d.htm > e.htm ================================================================================ awkfile ================================================================================ BEGIN { th2 = ""; th3 = ""; th4 = ""; th5 = ""; p2 = ". "; p3 = " "; p4 = " "; p5 = " "} /<H2> <A NAME=/ { th2 = ++h2; if (h2 > 9) p2 = "."; $1 = $1 th2 p2; h3 = 0; h4 = 0; h5 = 0; p3 = " "; p4 = " "; p5 = " "} /<H3> <A NAME=/ { th3 = ++h3; if (h3 > 9) p3 = ""; $1 = $1 th2 "." th3 p3; h4 = 0; h5 = 0; p4 = " "; p5 = " "} /<H4> <A NAME=/ { th4 = ++h4; if (h4 > 9) p4 = ""; $1 = $1 th2 "." th3 "." th4 p4; h5 = 0; p5 = " "} /<H5> <A NAME=/ { th5 = ++h5; if (h5 > 9) p5 = ""; $1 = $1 th2 "." th3 "." th4 "." th5 p5} {print} ================================================================================ Step 12: UNIX: sed -n -f sedfile2 http117e.htm > contents.htm ================================================================================ sedfile2 ================================================================================ /^<H[2-4]>[0-9.]* *<A NAME=/ {s!<A NAME="!<A HREF="#! s!^<H2>!! s!^<H3>! ! s!^<H4>\([0-9]\.\)! \1! s!^<H4>\([0-9][0-9]\)! \1! s!</H[2-4].*>!! p } ================================================================================ Step 13: TEXT EDITOR: Paste contents.htm into http117e.htm surrounded by <PRE></PRE> Step 14: Learn to avoid using Microsoft Internet Assistant for anything complex Step 15: Say phew ================================================================================ ____________________________________________________________________________ From: Bob Briscoe, BT, Distributed Systems Post: B54 74, BT Labs, Martlesham Heath, Ipswich, IP5 7RE, England E-Mail: rbriscoe@jungle.bt.co.uk Tel: +44 1473 645196 Fax: +44 1473 640929 WWW: http://www.jungle.bt.co.uk/people/rbriscoe.html (BT intranet only)
Received on Thursday, 17 October 1996 08:59:34 UTC