- From: john wilbanks <jwilbanks@rcn.com>
- Date: Thu, 05 Aug 2004 10:27:25 -0400
- To: John Wilbanks <wilbanks@w3.org>
- Cc: Eric.Neumann@aventis.com, public-semweb-lifesci@w3.org
So I've been digging around through the team here and have two more approaches. First, see the generic version of MTOM, aka XOP http://www.w3.org/TR/2004/WD-xop10-20040608/ and the pre-CR editor's draft at http://www.w3.org/2000/xp/Group/3/06/Attachments/XOP.html - this is one way of handling binary data. The XML Binary Characterization Working Group has a more comprehensive list of approaches. Here's the public page http://www.w3.org/XML/Binary/ - those of you that are members of the W3C might want to join the working group on this issue (and if you're not a member, the ability to move this type of work along ought to be a reason to join!). ..................................................................................................... In particular, I thought this description of floating point arrays in energy might be relevant (from the use case doc linked below) in that it requires the interchange of two types of data sets, one "moderately large" and one "really large": http://www.w3.org/TR/xbc-use-cases/#FPenergy3.2 Floating Point Arrays in the Energy Industry 3.2.1 Description The upstream segment of the energy industry is concerned with exploration for and production of oil and gas. XML-based techniques have made very little penetration into the upstream technology part of the energy industry. The most basic reason for this is the nature of the data, which does not at this time lend itself to being represented usefully in XML. There are basically two core types of data in this industry: well logs and seismic data. Well logs are moderately large datasets while seismic datasets are real large, typically in the order of gigabytes. Although the Petrotechnical Open Standards Consortium (POSC) has produced an XML schema for well logs, it has not been adopted by the industry. At the time of writing, nobody has even considered defining a schema for seismic data. Both seismic and well log data include control data, easily represented in XML, as well as large arrays of floating point numbers, not easily represented efficiently in XML. Although in practice an XML representation is not used, such data may be represented as shown in the following fragment (with a whole document consisting of a large number of these fragments): <header> <linename>westcam 2811</linename> <ntrace>1207</ntrace> <nsamp>3001</nsamp> <tstart>0.0</tstart> <tinc>4.0</tinc> <shot>120</shot> <geophone>7</geophone> </header> <trace>0.0, 0.0, 468.34, 3.245672E04, 6.9762345E05, ... (3001 floats)</trace> 3.2.2 Domain The scope within the Energy Industry as discussed above is very broad, encompassing a very large number of technical issues and usage scenarios involving, for example, integration of drilling information, processing of seismic and well data, integration of seismic and well data into interpretation systems, and so on. 3.2.3 Justification There are a number of dominant technology vendors in this sector as well as a number of small companies that "work around the edges". The dominant technology vendors (which are none of the technology giants) provide proprietary solutions that do not interoperate easily with each other. Providing communications between these products within a company, or between companies, is a constant problem: this is the main motivator to develop Web service interfaces for these products. A second motivator for a standard is that it will open the door for smaller companies to provide useful add-on products. Large budgets in this sector are allocated to the purchase of software packages and display devices, but these budgets are small compared to the leverage of mass-market devices, so a longer term objective is to encourage a situation were more technologies with mass-market cost leverage can be used. 3.2.4 Analysis Given that this scenario involves interoperability between companies using disparate systems, XML is a natural choice due to its ubiquity and tool availability. The main shortcoming of XML for this application is the expense incurred while converting floating point data to and from a character representation, as well as the extra size of some of these representations. Thus, the main requirement for this use case is the ability to represent sequences of floating point numbers in a binary format (as close to the native representation as possible), in order to facilitate efficient binding into programmatic objects (primarily, floating point arrays). In the example shown above, the header information would still have a textual representation (useful for any infoset-based processing), but the trace of floating point numbers would appear as an opaque binary stream. The format chosen to represent a floating point number must be platform independent, with tools supporting conversions to and from the appropriate native format. In practice, most operations involve moving data between machines with the same floating point formats, so the solution should not impose undue overhead on the most common situation in order to handle the less common ones. 3.2.5 Alternatives * Data Compression: One expert in this area has said, "For us, binary compression is probably not that important because transmission speeds are constantly improving. The additional time needed to compress and decompress seismic data would probably slow things down. We also place a greater value in the message structures than the transmission mechanics". Or, in more picturesque words, again from an expert in the field when asked about compressing seismic data, "Been there, done that, doesn't work, not interested". Bear in mind that this epigram encapsulates decades of experience and highly sophisticated R&D. * CORBA: There is, in fact, a CORBA-based integration platform currently deployed (although perhaps not widely) in this space. Without diving into technical details, it is clear that some companies would prefer an approach based on Web services. * XML Protocol Attachments: It is possible to represent seismic data control information in XML and to put the floating point arrays in a binary attachment using XOP. This data architecture is certainly viable, assuming that the issues involving floating point numbers are addressed, as evidenced by the fact that many of the proprietary vendor data formats work this way. It is, however, less flexible than the header-trace architecture described above, which is probably one reason why the latter is used in industry-wide seismic data standards (e.g. SEGY). Nonetheless, Web services that return data using XOP are an attractive alternative for dealing with seismic data. 3.2.6 References 1. [POSC]
Received on Thursday, 5 August 2004 10:27:44 UTC