W3C home > Mailing lists > Public > www-archive@w3.org > April 2009

Fast, small, simple, robust UTF-8 decoder in C

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 02 Apr 2009 01:12:31 +0200
To: www-archive@w3.org
Message-ID: <23t7t4pn29jsobkvjgcb1dl30i8ovtaake@hive.bjoern.hoehrmann.de>

  The attached code is a fast, small, simple, robust UTF-8 decoder in C;
on my antiquated system mere validation of in-memory data proceeds at a-
bout 1 GB / s. Also capturing the code points is about 25% of that. The
algorithm used is online allowing byte-level streaming decoding. It pro-
tects against all forms of illegal UTF-8. The algorithm proceeds thus:

  - Each byte is associated with a character class and a mask
  - The character class is used to advance a finite automaton
  - The mask is used to strip off leading bits from the byte
  - The remaining bits are combined into a Unicode code point
  - A code point is complete if the DFA enters the final state

The table used for byte mapping and the DFA is 288-416 bytes in size;
the smaller version requires a branch instruction in the decoder, the
larger version does not. The automaton was built using


The latter was used for the character class mapping. I've tested this
on some properly encoded files and it works as expected. I could not
find an easily usable test set with malformed documents. I also could
not find interesting UTF-8 decoders to compare this one with, most of
them are rather scary.

Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Wednesday, 1 April 2009 23:13:16 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:43:29 UTC