- From: Carl Eric Codere <carl.codere@optimasc.com>
- Date: Thu, 04 Oct 2007 23:31:46 -0400
- To: "Vogelheim, Daniel" <daniel.vogelheim@siemens.com>
- CC: public-exi@w3.org
Vogelheim, Daniel wrote:
> Hello Carl,
>
> You wrote:
>> Working with different file formats, I am the creator
>> of tools that permit identification of different resources (files). I
>> have noticed that EXI Structure only contains a 2-bit signature to
>> differenciate it from standard XML Documents. As stated in the
>> draft standard, this may be changed in the future. For me, for
>> easy identification and to avoid conflicts with other formats, this
>> would require at least a 32-bit signature/magic value. Has the issue
>> been decided on?
>>
>> Any clarification would be greatly appreciated.
>
> Thanks for sharing your concerns. The group has not yet reached a
> conclusion, so unfortunately I can't really over you any clarifications
> yet.
>
> If you have specific concerns, arguments, or use cases and would want to
> share them with us here, I'd gladly present them to the WG during
> discussions.
>
>
> Sincerely,
> Daniel Vogelheim
>
Greetings,
Here are my arguments, based on both my personal experience,
and on existing standards:
Overview of file format identification
======================================
As we all know identification of file formats can be done in several
ways, today mostly two ways are used by major operating systems.
The first one uses file extensions to validate the file format, and then
passes it to the current application. Assuming that there is no magic
data to clearly identify the file format, it will be difficult for
developers to easily validate the he file format. They will need to be
able to take into consideration all possible cases (EXI Events should
all be well formed).
Because of this, software developed for this file format is much more
difficult to qualify, and a lot of efforts will be needed to be put on
quality insurance just to take into account all wrongly formed binary
encodings.
On the other hand, the other way of identifying files, as used by most
UNIX operating systems identify files by their magic value. This is
less error prone because, if the magic is recognized, it is practically
assumed that the rest of the file format is valid, or at least generally
follow the required structure. Therefore not all error cases of wrongly
formed files do not need to be taken into account. This simplifies the
quality insurance phase of the applications that will process the file
format.
Rationale for size of magic value
=================================
The rationale for using a magic value of at least 32-bits is simple.
With the multiplication of file formats in existence, 2 byte
identifiers, as used in some early file formats (on UNIX systems) now
conflict with other file formats and are no longer enough to strictly
and unambiguously identify files of a specific type.
Rationale for value of magic value
===================================
As specified in the ISO/IEC 15444 (JPEG2000) standard, as well as in the
ISO/IEC 15948 (PNG Specification), the magic value, or file signature
can be used for these purposes:
- Permits immediate detection of common file-transfer problems.
- The magic value should contain a CR-LF sequence which permits
catching bad file transfers that alter newlines sequences.
- The control-Z character stops file display under MS-DOS. The final
line feed checks for the inverse of the CR-LF translation problem.
Therefore a good model for a signature, based on the ISO standards
above could be (I'm not the one who decided on the signature its
up to you, these are merely general suggestions):
ASCII C notation:
\211 E X I \r \n \032 \n or
\211 X M L \r \n \032 \n or
something similar to that.
If you need further clarifications, please let me know.
Thanks for taking the time to consider my proposition,
Sincerely yours,
Carl Eric Codère
http://www.optimasc.com
Received on Friday, 5 October 2007 03:44:33 UTC