Repost: Word-HTML "Pipeline" for HTTP/1.1 draft

[Re-posted - (I missed one correction of the Word file, I've now included
the WordBasic macro source and my awk was bugged).]

A "pipeline" for converting the Word document of the spec. into HTML is
attached (I needed this for internal reasons, but I thought it might be
useful generally).

This is not as easy as it sounds (unless someone knows something I don't) as
Internet Assistant for Word wipes all the internal anchors on the section
headings if you've got an auto-generated table of contents in there. I even
had to lower myself to learning pidgin Word Basic macro language, as well as
remembering my sed and awk.

Also, the internal structure of the spec has become terribly messy since
it's been in Word. I spent a lot of time fixing all the internal links and
anchors (there are 335 links in draft 7) and other things like bulletted and
numbered lists done inconsistently. I've carefully documented the internal
changes I've made below.

The pipeline isn't particularly generic and involves some manual
intervention (it wouldn't do if I could be bothered to learn perl), but it
works and you get clean HTML out the other end.

I've passed draft 7 through it, so when the RFC comes out, someone may wish
to use this pipeline for that too.

If anyone wants, I can stick these on our ftp server:
a) A clean Word file of draft 7 with all the internal linkage ready for
conversion.
b) A clean HTML file of draft 7

I would add that neither a) nor b) would produce straight text that would be
identical to the current plain/text draft 7. This is because:
- Many of the internal links to the references became squashed out of the
spec. at various stages, but still exist internal to the Word file data
structure, so I've re-revealed these.
- I tried to avoid altering the linear white space, but probably failed.

Whether this can still be called draft 7 is up for debate. This is why I
haven't made it available already, in case someone objects on a change
control basis.

Attached is the process.

Bob

================================================================================

Step 0: Removed auto-generated table of contents
 (unfortunately this doesn't remove the anchors it refers to)

Step 1: WINWORD: Unpicked bugs in bookmark naming and linkage as follows...

* Heading 3.11 Entity Tags
Double bookmarked as "Entity_Tags" & "Opaque_Tags"
Removed latter
* Heading 3.12 Range Units
Double bookmarked as "Range_Units" & "Range_Protocol_Param"
Removed latter
* Heading 8.1.3 Proxy Servers
No bookmark
Added bookmark "Persist_Proxy_Servers"
* Heading 9.1 Safe and Idempotent Methods
No bookmark
Added bookmark "Safe_Idem_Methods"
* Heading 12.1 Server-driven Negotiation
No bookmark
Added bookmark "Server_driven_Negotiation"
* Heading 12.2 Agent-driven Negotiation
No bookmark
Added bookmark "Agent_driven_Negotiation"
* Heading 12.3 Transparent Negotiation
No bookmark
Added bookmark "Transparent_Negotiation"
* Heading 13.3.2 Entity Tag Cache Validators
was bookmarked as "Tags" & nested within it was the
"Entity_Tag_Cache_Validators" bookmark
Removed former and replaced it with latter
* Heading 13.4 Response Cachability
Triple bookmarked as "Response_Cachability", "Caching_and_Status_C" &
"Constructing_Respons"
Removed all but first
* Heading 13.6 Caching Negotiated Responses
Double bookmarked as "Vary_Header_Use" & "Caching_and_Varying_"
Removed latter
* Section 14.4
Arbitrary sentence at end was bookmarked "OLE_LINK8"
Removed bookmark
* Section 14.5
Whole section was bookmarked as "OLE_LINK1"
Removed bookmark
* Heading 14.9.2 What May be Stored by Caches
Bookmark ended a character early
Shifted bookmark end right
* Heading 14.16 Content-MD5
Bookmark ended after para mark
Shifted bookmark end left
* Heading 14.25 If-Match
Double bookmarked as "If_Match" & "If_Valid"
Removed latter
* Heading 14.28 If-Unmodified-Since
Double bookmarked as "If_Unmodified_Since" & "Unless_Modified_Sinc"
Removed latter
* Heading 14.44 Via
Double bookmarked as "Via" & "Forwarded"
Removed latter
* Heading 15.2 Offering a Choice of Authentication Schemes 
No bookmark
Added bookmark "Choice_of_Authentication"
* Ref [26] Improving HTTP Latency
No bookmark
Added bookmark "RefLatency"
* Ref [27] Analysis of HTTP Performance
No bookmark & URL was bookmarked as "OLE_LINK2"
Added bookmark named "RefPerformance" & Removed "OLE_LINK2"
* Ref [29] , RFC 1951
No bookmark
Added bookmark "Ref1951"
* Ref [30] Analysis of HTTP Performance Problems
No bookmark
Added bookmark "RefPerfProbs"
* Ref [31] , RFC 1950
No bookmark
Added bookmark "Ref1950"
* Ref [32] Work In Progress for Digest authentication
No bookmark, but "DigestRef" bookmark spanned refs [26] & [27]
Moved "DigestRef" bookmark to ref [32] 
* Heading 18. Authors
"Authors" bookmark spanned previous section
Moved start of bookmark to start of heading
* Roy T. Fielding hyper link
duplicated internal to macrobutton
Opened up with <SHIFT>F9 & removed one of two
* Heading 19.4 Differences Between HTTP Entities and RFC 1521 Entities 
Bookmark ended a word early
Shifted bookmark end right
* Heading 19.5.1 Changes to Simplify Multi-homed Web Servers and Conserve IP
Addresses
was bookmarked as "Changes_For_Host_Support" and "OLE_LINK4" & nested within
it was the "AppHost" bookmark
Removed last two bookmarks
* Heading 19.8 Notes to the RFC Editor and IANA
Bookmark "Notes_to_RFC_Editor" spanned previous two sections
Moved start of bookmark to start of heading
* Heading 19.8.1 Charset Registry
No bookmarks
Added bookmark "Charset_Registry"
* Heading 19.8.2 Content-coding Values
No bookmarks
Added bookmark "Content_coding_Values"
* Heading 19.8.3 New Media Types Registered
No bookmarks
Added bookmark "New_Media_Types_Registered"
* Heading 19.8.4 Possible Merge With Digest Authentication Draft
No bookmarks
Added bookmark "Possible_Merge_With_Digest_Authen"
* Heading 19.8.5 Media type parameters named "q"
No bookmarks
Added bookmark "Media_type_parameters_named_q"

Step 2: WINWORD Fix bullet lists and numbered lists that
 have been done manually rather than as a style

Step 3: WINWORD: Add internal hyperlinks to refs. from all [xx] references,
 including ones invisible within Word datastructure (revealed with <ALT><F9>)

Save (http117a.doc)

Step 4: WINWORD: Correct typos accepted by working group (haven't done this)

Save ()

Step 5: WINWORD: Run my TypeBkmks Word Macro

Step 6: WINWORD: Format Heading Numbering Remove

Save (http117b.doc)

Step 7: WINWORD: Save As http117b.htm with Internet Assistant v2.03z

Step 8: UNIX: dos2unix http117b.htm http117c.htm
Step 9: UNIX: sed -f sedfile http117c.htm > http117d.htm

================================================================================
sedfile
================================================================================
s!<A NAME="_Toc[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]">{{A NAME={{!<A
NAME="!gp
s!{{A NAME={{.*}}!!gp
s!}}!">!gp
s!<A NAME="_Toc[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]">\(.*\)</A>!\1!gp
s!<FONT [^>]*>!!gp
s!</FONT>!!gp
/<H[2-8]><A NAME=/ { N
			s!<A NAME=!  <A NAME=!
			s!\n! !
			s!</H!\
</H!gp
}
================================================================================

Step 10: TEXT EDITOR:
Dealt with any remaining anchors split across lines containing '{{' manually.
Removed six remaining _Toc anchors manually
Added another </H1> at end of title to match second <H1> at start
Removed some spurious <B><LWS></B> and altered three top <B> headings to <H2>

Step 11: UNIX: awk -f awkfile d.htm > e.htm

================================================================================
awkfile
================================================================================
BEGIN			 { th2 = ""; th3 = ""; th4 = ""; th5 = "";
					p2 = ". "; p3 = " "; p4 = " "; p5 = " "}
/<H2>  <A NAME=/ { th2 = ++h2;
					if (h2 > 9) p2 = ".";
					$1 = $1 th2 p2;
					h3 = 0; h4 = 0; h5 = 0;
					p3 = " "; p4 = " "; p5 = " "}
/<H3>  <A NAME=/ { th3 = ++h3;
					if (h3 > 9) p3 = "";
					$1 = $1 th2 "." th3 p3;
					h4 = 0; h5 = 0;
					p4 = " "; p5 = " "}
/<H4>  <A NAME=/ { th4 = ++h4;
					if (h4 > 9) p4 = "";
					$1 = $1 th2 "." th3 "." th4 p4;
					h5 = 0;
					p5 = " "}
/<H5>  <A NAME=/ { th5 = ++h5;
					if (h5 > 9) p5 = "";
					$1 = $1 th2 "." th3 "." th4 "." th5 p5}
				 {print}
================================================================================

Step 12: UNIX: sed -n -f sedfile2 http117e.htm > contents.htm

================================================================================
sedfile2
================================================================================
/^<H[2-4]>[0-9.]*  *<A NAME=/ {s!<A NAME="!<A HREF="#!
					s!^<H2>!!
					s!^<H3>!    !
					s!^<H4>\([0-9]\.\)!         \1!
					s!^<H4>\([0-9][0-9]\)!          \1!
					s!</H[2-4].*>!!
					p
}
================================================================================

Step 13: TEXT EDITOR: Paste contents.htm into http117e.htm surrounded by
<PRE></PRE>

Step 14: Learn to avoid using Microsoft Internet Assistant for anything complex

Step 15: Say phew
================================================================================
____________________________________________________________________________
From:    Bob Briscoe,                                BT, Distributed Systems
Post:    B54 74, BT Labs,    Martlesham Heath,   Ipswich, IP5 7RE,   England
E-Mail:  rbriscoe@jungle.bt.co.uk
Tel:     +44 1473 645196                                Fax: +44 1473 640929
WWW:     http://www.jungle.bt.co.uk/people/rbriscoe.html  (BT intranet only)

Received on Sunday, 20 October 1996 14:00:22 UTC