- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 10 Jan 2012 01:39:55 +0100
- To: Henry Story <henry.story@bblfish.net>
- Cc: "www-tag@w3.org List" <www-tag@w3.org>
* Henry Story wrote: >Perhaps bringing out some of these aspects more clearly would help >clarify why simplicity is important, in a way that appeals to education >don't do so well. Well, the protocol and format designers and implementers I typically run across generally do not seem to understand language theory well, and you can't really blame them, as there is pretty much no benefit in doing so. That's largely because there are no tools to work with formal languages. Simple example: you have a specification for a URI scheme and you want to know whether the scheme's syntax is a subset of the generic syntax. Now, the latest fad is to use turing machines as formal specifications, like you have "algorithms" instead of grammars in the "HTML5" specifica- tion, so you would probably start by reverse-engineering the "algorithm" into a grammar, and then you would have to write software that can read that grammar and the generic URI syntax grammar into a common data model and then you would have to implement something that can tell whether the two are different, which would give you a solution if you've done every- thing right. Even if you were lucky and the specific scheme was provided as some ABNF grammar that's compatible with the ABNF grammar for the generic URI syn- tax, there is no tool that I am aware of, that would, for instance, turn all regular productions in an ABNF grammar into a regular expression and there is no tool that I am aware of that would take two expressions, and tell you whether one is a subset of the other. Nothing easily discover- able and usable that I did not write myself anyway. Even though this is a very simple problem, you turn the grammars into a NFAs, the NFAs into DFAs, toggle the acceptance of all states in one, compute a product au- tomaton, minimize it, and look at the result. A 4th semester computer science student should be familiar with all the individual steps. But no mainstream tool for this exists as far as I am aware. The same goes for parser generators and whatever else, even though this is a very regularily recurring topic in the CSS Working Group, there is no simple online tool that would tell you whether some proposed syntax is possible under the generic syntax the CSS Working Group promises to never change. Even for much simpler languages like URIs there is no such service, except <http://www.websitedev.de/temp/rfc3986-check.html.gz>. It does not seem likely to me that protocol and format designers will do much to go into the direction proposed in the talk you referenced unless the tools get much, much better. One point raised in the talk is that it is common that input validation happens in many uncoordinated places in code, and at times code is written assuming that surely by the time some input reaches it, it has been validated (which may be true when it was written but just means a seemingly unrelated change that invalidates the assumption will break the whole thing). One case of that I've seen a number of times is character string input. You validate the an input string is properly formed UTF-8 when reading it the first time, and later on you don't validate it again because that is probably inefficient. The popular ICU library for instance has tools that decode UTF-8 slowly while performing error checks, and tools that decode quickly but do not check for errors. I arrived at this from a bit of a different angle, I wanted to make standalone C-programs that decode UTF-8 properly, but the amount of code the usual UTF-8 decoder employed made using any of them unappealing, never minding that they tended to be so complex you couldn't fully verify that they work correctly. Worse, my general use case was reading from files, and decoders typically made it either difficult or impossible to deal with incomplete input, and that's just asking for trouble when programming in C. So I wrote my own: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ In writing it I applied what I knew about formal languages and automata; it's rather unusual in that the caller controls the input loop, normally the caller would collect a reasonably large chunk of input and then call the decoder, one reason being that function calls are expensive, and the typically decoder is too large to "inline". That means you have to study how the decoder, for instance, deals with cases where your input buffer ends in the middle of a character, and how to call it to resume. That is not even possible with some decoders, and it's a bit hard to test, since you would have to design code and tests so you actually hit these cases. My code does not have that problem, you always feed it exactly one byte. You also can quite easily mathematically or automatically prove the cor- rectness of my implementation, the code is very simple and short, and it is possible to simulate all possible states and inputs and be sure that you have done so, at least if you use proper type information (the code uses uint32_t where it should use uint8_t due to compiler "bugs"). More importantly, my decoder is very fast, on the machine I developed it on it outperformed all other decoders that fully validate their input, that I could find and benchmarked anyway, and with a minor optimization it performed just as well as the ICU code that does not validate its in- put at all, meaning there is no incentive to use a less robust but much faster decoder in "inner layer" code. And performance has improved since I first published this, a whole instruction that's applied to every byte in the original is unnecessary as explained later in the document. And there are other benefits, ICU for instance has different code paths, for performance reasons, when you have a fixed-length input string, and null-terminated strings, and you typically have different code for deco- ing and verification (verification would be much slower if you also de- code but then discard the result without using it), which for my decoder the compiler can detect and optimize on its own. The underlying idea is a simple classic, regular language, use a DFA to recognize it. The lookup tables you would get for the 256 character al- phabet were too large, so I used one level of indirection to minimize the space requirements (characters are translated to character classes, so for instance all ASCII bytes are treated identically), and then the remaining problem was that you have to strip certain bits from starter bytes (the first byte in an UTF-8 sequence for a multi-byte charecter), and I found a way to use the character classes for this purpose. So, in this sense, if you understand formal languages well, and you can apply this knowledge to practical problems, you will find solutions that are much simpler, more secure, more easily re-usable, easier to verify, but as before, even the first step in constructing my decoder would re- quire some tool to discover the DFA tables for the simple verification of input. http://search.cpan.org/dist/Unicode-SetAutomaton/ is probably what I used, after writing it weeks prior. No tools. The talk suggests parser generation from grammars. As it is, we would've to re-educate protocol and format designers, in certain circles anyway, to even value and provide grammars to begin with. Their prefer the prose turing machines that people who want a grammar have to reverse-engineer. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 10 January 2012 00:40:22 UTC