More: describing binary content in SML

So I've been digging around through the team here and have two more 
approaches. 

First,  see the generic version of MTOM, aka XOP
http://www.w3.org/TR/2004/WD-xop10-20040608/
and the pre-CR editor's draft at
http://www.w3.org/2000/xp/Group/3/06/Attachments/XOP.html  - this is one 
way of handling binary data.

The XML Binary Characterization Working Group has a more comprehensive 
list of approaches.  Here's the public page 
http://www.w3.org/XML/Binary/ - those of you that are members of the W3C 
might want to join the working group on this issue (and if you're not a 
member, the ability to move this type of work along ought to be a reason 
to join!).
.....................................................................................................
In particular, I thought this description of floating point arrays in 
energy might be relevant (from the use case doc linked below) in that it 
requires the interchange of two types of data sets, one "moderately 
large" and one "really large":

http://www.w3.org/TR/xbc-use-cases/#FPenergy3.2

Floating Point Arrays in the Energy Industry
3.2.1 Description

The upstream segment of the energy industry is concerned with 
exploration for and production of oil and gas. XML-based techniques have 
made very little penetration into the upstream technology part of the 
energy industry. The most basic reason for this is the nature of the 
data, which does not at this time lend itself to being represented 
usefully in XML.

There are basically two core types of data in this industry: well logs 
and seismic data. Well logs are moderately large datasets while seismic 
datasets are real large, typically in the order of gigabytes. Although 
the Petrotechnical Open Standards Consortium (POSC) has produced an XML 
schema for well logs, it has not been adopted by the industry. At the 
time of writing, nobody has even considered defining a schema for 
seismic data.

Both seismic and well log data include control data, easily represented 
in XML, as well as large arrays of floating point numbers, not easily 
represented efficiently in XML. Although in practice an XML 
representation is not used, such data may be represented as shown in the 
following fragment (with a whole document consisting of a large number 
of these fragments):


<header>
<linename>westcam 2811</linename>
<ntrace>1207</ntrace>
<nsamp>3001</nsamp>
<tstart>0.0</tstart>
<tinc>4.0</tinc>
<shot>120</shot>
<geophone>7</geophone>
</header>
<trace>0.0, 0.0, 468.34, 3.245672E04, 6.9762345E05, ... (3001 
floats)</trace>



3.2.2 Domain

The scope within the Energy Industry as discussed above is very broad, 
encompassing a very large number of technical issues and usage scenarios 
involving, for example, integration of drilling information, processing 
of seismic and well data, integration of seismic and well data into 
interpretation systems, and so on.
3.2.3 Justification

There are a number of dominant technology vendors in this sector as well 
as a number of small companies that "work around the edges". The 
dominant technology vendors (which are none of the technology giants) 
provide proprietary solutions that do not interoperate easily with each 
other. Providing communications between these products within a company, 
or between companies, is a constant problem: this is the main motivator 
to develop Web service interfaces for these products. A second motivator 
for a standard is that it will open the door for smaller companies to 
provide useful add-on products. Large budgets in this sector are 
allocated to the purchase of software packages and display devices, but 
these budgets are small compared to the leverage of mass-market devices, 
so a longer term objective is to encourage a situation were more 
technologies with mass-market cost leverage can be used.
3.2.4 Analysis

Given that this scenario involves interoperability between companies 
using disparate systems, XML is a natural choice due to its ubiquity and 
tool availability.

The main shortcoming of XML for this application is the expense incurred 
while converting floating point data to and from a character 
representation, as well as the extra size of some of these 
representations. Thus, the main requirement for this use case is the 
ability to represent sequences of floating point numbers in a binary 
format (as close to the native representation as possible), in order to 
facilitate efficient binding into programmatic objects (primarily, 
floating point arrays). In the example shown above, the header 
information would still have a textual representation (useful for any 
infoset-based processing), but the trace of floating point numbers would 
appear as an opaque binary stream.

The format chosen to represent a floating point number must be platform 
independent, with tools supporting conversions to and from the 
appropriate native format. In practice, most operations involve moving 
data between machines with the same floating point formats, so the 
solution should not impose undue overhead on the most common situation 
in order to handle the less common ones.
3.2.5 Alternatives

*

Data Compression: One expert in this area has said, "For us, binary 
compression is probably not that important because transmission speeds 
are constantly improving. The additional time needed to compress and 
decompress seismic data would probably slow things down. We also place a 
greater value in the message structures than the transmission 
mechanics". Or, in more picturesque words, again from an expert in the 
field when asked about compressing seismic data, "Been there, done that, 
doesn't work, not interested". Bear in mind that this epigram 
encapsulates decades of experience and highly sophisticated R&D.
*

CORBA: There is, in fact, a CORBA-based integration platform currently 
deployed (although perhaps not widely) in this space. Without diving 
into technical details, it is clear that some companies would prefer an 
approach based on Web services.
*

XML Protocol Attachments: It is possible to represent seismic data 
control information in XML and to put the floating point arrays in a 
binary attachment using XOP. This data architecture is certainly viable, 
assuming that the issues involving floating point numbers are addressed, 
as evidenced by the fact that many of the proprietary vendor data 
formats work this way. It is, however, less flexible than the 
header-trace architecture described above, which is probably one reason 
why the latter is used in industry-wide seismic data standards (e.g. 
SEGY). Nonetheless, Web services that return data using XOP are an 
attractive alternative for dealing with seismic data.

3.2.6 References

1.

[POSC] 

Received on Thursday, 5 August 2004 10:27:44 UTC