Microsoft XML File formats for Office Suite & Binary XML

Though I normally leave it to other corporate members to speak for 
themselves, we no longer have a Microsoft employee on the TAG, and I 
thought the recent announcement regarding "Office" file formats would be 
of interest.  As I understand it, the next major revision of Microsoft 
Office will default to saving files in a new XML-based format that uses a 
zip file container for both structuring and compression. 

Traditional office files were based on container format known informally 
as Docfile.  Within a single Docfile were multiple logical streams; 
typically one such stream contained the "main" part of the file, while 
another might have summary properties (author, etc.), and yet others were 
used for OLE embeddings.  So, if you have a spreadsheet inside a word 
processor document, that's another stream.

In the new formats, the role of the Docfile is now taken over by a 
Zip-format container.  The "main" stream at each level is an XML document, 
and the zip encoding provides compression.  Other streams are used for 
images (e.g. jpegs), etc.  Embeddings are apparently handled by nesting 
Zip files within the substreams of the outer zip.  The files will 
apparently have default extensions of .docx, .pptx, etc.  Partly because 
the original binary docfiles were actually quite verbose, the new formats 
are claimed to be much smaller in many cases.

Anyway, I mention all this because it seems to bear at least indirectly on 
the Binary XML discussion.  At least one major vendor will be using zip'd 
XML to achieve compression.  FYI, an interesting video interview with one 
of the Office designers is available in the blog entry at [1]. 

Also, though I am not particularly expert in OpenOffice, my impression is 
that there are significant similarities between the Microsoft approach and 
that deployed in OpenOffice [2,3,4], which also uses XML and Zip. 



Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142

Received on Monday, 27 June 2005 18:35:39 UTC