- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Thu, 02 Apr 2009 01:00:11 +0200
- To: www-archive@w3.org
- Message-ID: <gen7t4haff03mjt0p6fcab8iuilpceehsp@hive.bjoern.hoehrmann.de>
Hi, The attached code is a fast, small, simple, robust UTF-8 decoder in C; on my antiquated system mere validation of in-memory data proceeds at a- bout 1 GB / s. Also capturing the code points is about 25% of that. The algorithm used is online allowing byte-level streaming decoding. It pro- tects against all forms of illegal UTF-8. The algorithm proceeds thus: - Each byte is associated with a character class and a mask - The character class is used to advance a finite automaton - The mask is used to strip off leading bits from the byte - The remaining bits are combined into a Unicode code point - A code point is complete if the DFA enters the final state The table used for byte mapping and the DFA is 288-416 bytes in size; the smaller version requires a branch instruction in the decoder, the larger version does not. The automaton was built using http://search.cpan.org/dist/Unicode-SetAutomaton/ http://search.cpan.org/dist/Set-IntSpan-Partition/ The latter was used for the character class mapping. I've tested this on some properly encoded files and it works as expected. I could not find an easily usable test set with malformed documents. I also could not find interesting UTF-8 decoders to compare this one with, most of them are rather scary. regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Attachments
- text/plain attachment: u8decdfa.c
Received on Wednesday, 1 April 2009 23:00:48 UTC