- From: Leonard R. Kasday <kasday@acm.org>
- Date: Tue, 04 Jan 2000 17:02:15 -0500
- To: "Chris Ridpath" <chris.ridpath@utoronto.ca>
- Cc: w3c-wai-er-ig@w3.org
A more general way to find ASCII Art would be to use statistical properties of English (or whatever language is in use). For example, if you look at the frequency of letter pairs, some are relatively high like "th" and some are relatively low, like "mq". There are lots of refs on this. It's a classic topic. If you're into 50's style Experimental Psychology, You can find references in any intermediate psych textbook that deals with "information theory". Garner's a good author. For the Engineering inclined, check out elmentary infromation theory textbooks. Computer science fans can check out compression theory. Cryptography devotees can check out elementary methods for substitution cyphers. As you see, it's used all over place. And there are standard statistical tests to see if distributions match. See any intermediate stat book. So if you compare the contents of <PRE> or <XMP> with the statistics for English (or whatever language is in use) and the match is poor, it's probably ACSII Art. Unless you have someone who likes to write long strings of Acronyms. But hey, acronyms are arguably ACSII Art in a sense anyway. So you may want to set your intern loose on this approach... Actually, what you really want are statistics that take into account use of other characters like underlines, spaces, other ACSII characters. The sorts of things that showed up in the ad hoc rules in http://www.w3.org/WAI/ER/IG/ert/AsciiArt.htm. So what you really want is a program that just does those statistics, which you can turn loose on ordinary web pages, and get distributions to compare against. (of course you can also check statistics of strings of 3, 4, 5... letters but for this purpose I bet 2 is enough.) Len p.s. These statistics can also be used to check what language something is written in. At 03:38 PM 1/3/00 -0500, you wrote: >For technique 1.1.K (http://www.w3.org/WAI/ER/IG/ert/#Technique1.1.K) we >need to determine if a page contains ASCII art. Our intern had a look at a >how ASCII art is used on the web and prepared the following document: > >http://www.w3.org/WAI/ER/IG/ert/AsciiArt.htm > >Note: this does not deal with emoticons " :) " etc. > > >From this report I think we can create an algorithm that will reliably test >a page for ASCII art. I'll code something this week and test it on several >sites. > >Chris > ------- Leonard R. Kasday, Ph.D. Institute on Disabilities/UAP, and Department of Electrical Engineering Temple University 423 Ritter Annex, Philadelphia, PA 19122 kasday@acm.org http://astro.temple.edu/~kasday (215) 204-2247 (voice) (800) 750-7428 (TTY)
Received on Tuesday, 4 January 2000 17:00:27 UTC