Fast, small, simple, robust UTF-8 decoder in C from Bjoern Hoehrmann on 2009-04-01 (www-archive@w3.org from April 2009)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 02 Apr 2009 01:12:31 +0200
To: www-archive@w3.org
Message-ID: <23t7t4pn29jsobkvjgcb1dl30i8ovtaake@hive.bjoern.hoehrmann.de>

Hi,

  The attached code is a fast, small, simple, robust UTF-8 decoder in C;
on my antiquated system mere validation of in-memory data proceeds at a-
bout 1 GB / s. Also capturing the code points is about 25% of that. The
algorithm used is online allowing byte-level streaming decoding. It pro-
tects against all forms of illegal UTF-8. The algorithm proceeds thus:

  - Each byte is associated with a character class and a mask
  - The character class is used to advance a finite automaton
  - The mask is used to strip off leading bits from the byte
  - The remaining bits are combined into a Unicode code point
  - A code point is complete if the DFA enters the final state

The table used for byte mapping and the DFA is 288-416 bytes in size;
the smaller version requires a branch instruction in the decoder, the
larger version does not. The automaton was built using

  http://search.cpan.org/dist/Unicode-SetAutomaton/
  http://search.cpan.org/dist/Set-IntSpan-Partition/

The latter was used for the character class mapping. I've tested this
on some properly encoded files and it works as expected. I could not
find an easily usable test set with malformed documents. I also could
not find interesting UTF-8 decoders to compare this one with, most of
them are rather scary.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Attachments

text/plain attachment: u8decdfa.c

Received on Wednesday, 1 April 2009 23:13:16 UTC