- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sat, 26 Oct 2013 01:02:22 +0200
- To: www-archive@w3.org
Greetings, We are still trying to identify quoted and reformattable text parts in mails. Now, it was clear from the beginning that since some people send HTML-only mails, even as replies in list discussions, I would probably have to handle them in some form. But seeing how horrendously bad common HTML-to-text implementations are, and various anomalies they cause, and that includes "generate HTML and text alternatives from internal data", where interesting issues like "is not marked as quoted in the HTML but the plain text alternative has a not-manually-entered `>` in front", I figure the HTML parts need some more attention. Interestingly, when I started looking into this I did do some cusory, manual comparisons between plain text versions and HTML versions, and to my surprise the HTML version were often in no better shape than the text versions, so I did not put much faith into that. Even in theory it may well be that sometimes the HTML version is much better than the text one while at other times the plain text version is much better, and there is a lack of good automated HTML to text formatting tools, a big problem being that you would want to put a `>` in front of `<blockquote>` lines. In any case, for these issues it would be helpful to be able to compare plain text versions and HTML alternatives. But last I looked into HTML diff utilities, when I created <http://www.w3.org/wiki/HtmlDiff>, things did not look good, to put it mildly. So I figured if it involves diffing and HTML, I am going to have a bad time. Then again, perhaps it's not so difficult. So one thing I made is <https://gist.github.com/7126471>. In order to use the script stand-alone it takes just a HTML document as input and calls `lynx` to get a derived plain text version. It then uses XML::LibXML's HTML parser to get a DOM representation. It then splits the text version into tokens, and then splits DOM text nodes also into tokens (trying not to pick up ones in <script> and <head> and so on), and then uses the familiar `sdiff` function to match the two lists of tokens up with one another, so every token in the plain text version can know the corresponding text node in the DOM of the HTML version. That way the plain text tokens can have properties like "in the HTML version this token is a descendant of a blockquote element" or "this token does not appear in the HTML version" and so on and so forth. One thing I did with that was running it over a HTML-heavy list archive and noted changed tokens with non-ascii content and tabulated them, 3053 c ’ ' 1808 c “ " 1799 c ” " 1149 c – - 363 c – 324 c ’ ¹ 173 c … . 161 c ” ² 159 c “ ³ 124 c · * 59 c — - are the most common ones in this data set, mostly asciification and some encoding errors, disturbingly frequent encoding errors actually. Note that this covers only single token changes, the ellipsis character is most likely replaced by three full stops, but only one full stop shows up as a change, the others are additions the way I wrote the code. On an individual message basis it can look like this: <63294A1959410048A33AEE161379C8023D02102C2D@SP2-EX07VS02.ds.corp...> c 1 1st - st c ’ ' c ’ ' + public-tracking@w3.org<mailto: One problem here is that there are at least two ways to deal with e.g. <x>A<y>B</y></x> or <x>A</x><y>B</y> It could be handled as "A B" or as "AB" and depending on what `<x>` and `<y>` are, and what the style sheet says, one or the other behavior may be best. For that matter, in some cases it may also be best to consider everything on a single character basis (instead, my code tokenizes with roughly /[a-z]|./, so punctuation is a single token but ascii words are possibly very long sequences; that is often the best choice, but not in all cases). In the example above, it's a case of `1<sup>st</sup>`. If it had been a case of `1<p>st</p>` the problem would not show up. Keeping in mind that the plain text generators are sometimes excrutiatingly bad at their job, there are no clear rules for this, but there is room for improvement. It would also be possible to specifically scan the `sdiff` output for cases such as this, but that is not clearly needed as yet. The last case is a result of turning <a href='http://example.org'>example website</a> into something like example website<http://example.org> without considering that's a very stupid thing to do when the input is <a href='http://example.org'>http://example.org</a> as is almost always the case. There are all sorts of "funny" issues with URLs as it turns out, one popular instance of text that is added making the plain text version is this: 30 + <file://localhost/sip/zakim@voip.w3.org ... 13 + <file:///sip/zakim@voip.w3.org Another typical pattern is something like this: <CA+Z3oObD7SWU4upN+dXm5gu=k539LKREOVnyj0f94-n3mSBLrA@mail.gmail.com> + ******** + ********* + * + * + * + * + * + * + * + * + * + ******** + ******** + ******** + **** + **** + **** + **** + ******** Right, you see, if you have an `<u>` then that's mapped to `*`, and if you have an `</u>` that's mapped to another `*`, so with literally <u></u><u></u> you get `****`, so readers of the plain text version can see your under- lined empty strings more clearly. These braindead techniques can also be combined, for instance, http://example.org/**abc-**def<http://example.org/abc-def> is a common pattern. Here the `<u>` elements may perhaps serve as marks for wrapping opportunities, but that does not really lower the what-the- f...actor for anybody. There are also interesting cases where the text version is abruptly cut off and lacks much of the text in the HTML variant (consisting of quotes of quotes of quotes of quotes of quoted text from email terrorists). One thing I will have to deal with to make use of such information is mails that pretend to have a plain text alternative but actually do not, think "your mailer does not support frames^W HTML" style messages though I am not sure that's actually ever encoded as alternative, but with spam for instance it's common to have big differences between HTML and text. Anyway, the idea would probably be to give little weight to parts that are not in both HTML and text version when determining which parts are quoted, and of course as the script above does using information like an element having a blockquote ancestor in the HTML variant to be an indi- cation that the text is in fact quoted (from the parent mail), but there will have to be some research to see how unreliable that is. In the one test I've tried it worked good... regards, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Friday, 25 October 2013 23:02:48 UTC