W3C home > Mailing lists > Public > xmlschema-dev@w3.org > April 2011

Re: base64Binary lexical/octet length

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Mon, 11 Apr 2011 10:20:04 -0600
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, xmlschema-dev@w3.org
Message-Id: <1C39667A-A967-40CB-820A-24C9B0B16FC9@blackmesatech.com>
To: xmlplus custodians <xmlplus.custodians@gmail.com>

On Apr 9, 2011, at 2:49 PM, xmlplus custodians wrote:

> Hi
> 
> The XSD1.1 DataTypes spec in the base64Binary section gives following pseudo-code for calculating octet length of a base64Binary encoded string.
> 
> ---------------------------------------------------------------------------------
> 1) lex2   := killwhitespace(lexform)    -- remove whitespace characters
> 2) lex3   := strip_equals(lex2)         -- strip padding characters at end
> 3) length := floor (length(lex3) * 3 / 4)         -- calculate length
> ---------------------------------------------------------------------------------
> 
> 
> My understanding is that, for a base64Binary encoded string, it's lexical length would be a multiple of 4 and it's octet length would be a multiple of 3. 

It's been a while since I read the base64 spec, but my recollection is that base64 encodes
octet sequences of any length, not just octet sequences whose length is a multiple of three.

The lexical length (ignoring whitespace) will indeed always be a multiple of four; the 
padding characters are added at the end in order to ensure that this is so.  

> 
> As an example if we take a base64Binary encoded string, which doesn't contain whitespaces or padding
> chars(=), so that lexform is same as lex3 in above code. Now let us take a lex3 of length 10 then,
> according to above code, the octet length would be 7(not a multiple of 4).

Yes, precisely.  If the lexical form, ignoring whitespace, is twelve characters long
and the last two characters are equals signs, then what you have is two
clusters of four characters, each of which encodes three octets, followed
by a final cluster of two non-padding characters, which encodes the final
octet.

> Are octet-lengths which are not multiple of 4, valid in case of base64Binary encoded string ?

Yes.

> Also, what should be the formulae for calculating lexical-length from the octet-length of a base64Binary string ?
> Should it be something like this:
> 
> lexical-length := ceil( octet-length*4/3)
> 
> If we take an example with octet-length=10, the lexical-length is not a multiple of 4.
> I am clueless here. Appreciate your help on the same. 

In base64 encoding, any input octet stream is subdivided into 24-bit
(i.e. three-octet) groups, each of which is encoded in four base64
digitis.  If there are fewer than 24 bits in the final group of bits, then
padding characters are used.  So if you wish to calculate the minimum
length of the base64 encoding for an arbitrary sequence of octets (i.e.
the length of an encoding without any white space), then I think the 
formula you want will be 4 * ceil( octet-length / 3).  It is a good idea,
though, to follow the recommendations in the RFC for adding
whitespace and newlines; it makes debugging problems easier, if
nothing else.

You may find it helpful to read RFC 3548, which is normatively referred
to from the XSD spec.

http://www.ietf.org/rfc/rfc3548.txt

I hope this helps.

 
-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************
Received on Monday, 11 April 2011 16:20:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 11 April 2011 16:20:30 GMT