- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sun, 17 Mar 2013 01:16:00 +0100
- To: www-archive@w3.org
So, I started the other day to list criteria for recognising text that is fixed, pre-formatted, possibly list-like, that should not be reflown. I did not get very far, but... * Repeating prefixes and repeating suffixes that are not shared with other pieces or original text. The most common prefix would be white space for indented code or other text, but there are also a number of other cases, like lists of references [1] ... [2] ... and other forms of code have similar patterns, like <example> <... /> <... /> </example> and it might be worth to consider matching punctuations as matches, { ... } and for suffixes there are things like color: red; background-color: white; margin: 0; /* ... */ padding: 0; /* ... */ where ";" and "*/" are repeating suffixes. * Similarily, there is alignment, when you get the same character in the same column position in many lines, which might be considered repeating infixes, example => 1, other => 2, ... where there is alignment based on the " => " infix, in addition to the repeating suffix ",". * Similarily, the notion of repeating infixes can be extended beyond looking for aligned repetitions, for instance, Outlook blurps have From: ... To: ... Data: ... Subject: ... the repeating infix ": ". A simple first step would be to count the "words" instead of the characters and then check for alignment. In my "data model" I look for URIs as "words", and a typical case where you would not want to reflow is when lines consist of almost nothing but a single URI (ends-with-URI suffix). * Similarily, a simple tolerance could be to accept white space in place of a sequence of non-white-space characters as weak criterion, that would more easily detect wrapped list items 1) ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### 2) ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### ### which technically could be easily reflowed in many cases, but it'd often look a bit messy. It's also possible that this would be taken care of by recognizing that prefixes repeat not from line to line but within the block. * A big one is line length, lines with fixed formatting tend to be a good bit shorter than flowed text and when flowed text is ordinarily wrapped, long lines similarily indicate fixed formatting. This is a bit of a chicken-egg problem though, so this is a very weak factor, it works best when you have a lot of both in a single mail. However, with the other indicators this is probably a very good verifyer. * Relatedly, white space distribution is interesting. If you know the typical line length of original text, you can extend other "areas" of text to roughly that length and look at the white space and you will find that short lines have a lot of white space "to the right" and a lot of white space overall, especially relative to "letters". White space prefixes, indentation, works the same way, and other forms of alignment are the same, to recall an earlier example, example => 1, other => 2, ... That has a sequence of three spaces, which would be very unusual in flowed text. Another example would be cases of no inter-word white space, * ABC * DEF * GHI And frequent and unusually short sequences of non-white-space f 0 _ = 0 f _ 0 = 0 f x y = x + y Or unusually long words example = new AbstractSingletonProxyFactoryBean(); Or for that matter, document.getElementsByTagNameNS("http://www.w3.org/2000/svg", ... * Distribution of other characters can also be interesting, which can be as simple as looking at the letter to non-letter ratio... * ... There are various false positives to be aware of: * Some people indent all their paragraphs, and some of them also use overly long lines, so you get a repeating pattern but the text is actually reflowable. * Some people use code-like markers to indicate the text they wrote, something like this for instance <john> ... </john> Similarily, [john] ... [john] ... * ... regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Sunday, 17 March 2013 00:16:28 UTC