Re: Header Serialization Discussion from Frédéric Kayser on 2013-04-16 (ietf-http-wg@w3.org from April to June 2013)

From: Frédéric Kayser <f.kayser@free.fr>
Date: Tue, 16 Apr 2013 04:23:41 +0200
To: ietf-http-wg@w3.org
Message-Id: <24370F45-C4B4-41A2-8515-5B239766A943@free.fr>

Hello,
If text fields can effectively be encoded as UTF-8 would it be wise to use it to send IRIs (RFC3987)?
without punycode:
http://xn--acadmie-franaise-npb1a.fr/ vs. http://acadÃ©mie-franÃ§aise.fr/
http://www.xn--cigacz-2ib.pl/ vs. http://www.Å›cigacz.pl/
http://xn--rlcuo9h.xn--wkc4axeaevb3oqbg.xn--xkc2al3hye2a/ vs. http://à®¤à®³à®®à¯.à®†à®³à¯à®•à®³à®®à¯ˆà®¯à®®à¯.à®‡à®²à®™à¯à®•à¯ˆ/
http://xn--mgbggrfi2ikdb7d.xn--mgberp4a5d4ar/ vs. http://Ù…Ø±ÙƒØ²Ø§Ù„ØªØ³Ø¬ÙŠÙ„.Ø§Ù„Ø³Ø¹ÙˆØ¯ÙŠØ©/

and without percent encoding:
zdj%C4%99cia vs. zdjÄ™cia
g%C3%B6r%C3%BCnt%C3%BC vs. gÃ¶rÃ¼ntÃ¼

I wouldn't mind if HTTP/2 clearly took the bull by the horns regarding I18N.

The easiest way to (re)encode UTF-8 using variable code length would be to collect/define statistics only for the leading octet and store the continuation octets as fixed 6-bit values (since they are restricted to the 80-BF range, 64 values).

-- 
FrÃ©dÃ©ric Kayser

James M Snell wrote :

> Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag
> following the type code. All text strings are prefixed by it's length
> given as an unsigned variant length integer
> 
[snip]
> 
> For ISO-8859-1 Text, the Static Huffman Code used by Delta would be
> used for the value. If we can develop an approach to effectively
> handling Huffman coding for arbitrary UTF-8, then we can apply Huffman
> coding to that as well.

Received on Tuesday, 16 April 2013 02:24:10 UTC