Re: ISSUE-66: image analysis heuristics - suggest closing on 2009-09-03 from Matt May on 2009-08-21 (public-html@w3.org from August 2009)

From: Matt May <mattmay@adobe.com>
Date: Fri, 21 Aug 2009 09:39:06 -0700
To: Maciej Stachowiak <mjs@apple.com>, "public-html@w3.org WG" <public-html@w3.org>
Message-ID: <C6B437DA.55F5%mattmay@adobe.com>
I object.

On 8/21/09 12:45 AM, "Maciej Stachowiak" <mjs@apple.com> wrote:
> This issue raises an objection that is editorial in nature, and I do
> not believe it will have a material effect on normative requirements.

I disagree. The image analysis statement:

"User agents may also apply image analysis heuristics to help the user make
sense of the image when the user is unable to make direct use of the image,
e.g. due to a visual disability or because they are using a text terminal
with no graphics capabilities."

...is presented in the context of a segment on @alt which, as should be
clear to everyone by now, is highly contentious. Taken with the guidance on
missing @alt, it suggests that authors can rely on browser technology to
repair semantics they have left out for whatever reason--and therefore, @alt
is not as necessary as before. It is a dangerous juxtaposition, particularly
to an outside observer.

> Further, I believe the premise of the objection is false. The
> objection categorically says that state-of-the-art image analysis
> heuristics cannot recover useful information from an image, "not even
> close". 

I stand by that remark.

> There exist optical character recognition algorithms that
> could recover text from an image of text with high probability of
> success.

OCR is achievable, and has been for years. If the sentence read "User agents
may also apply image analysis heuristics for OCR," then I'd be in favor of
that. One tool that I know of, WebVisum, does that today.

But instead it says heuristics could be used "to make sense of the image,"
which is still a pipe dream.

> There are also image analysis algorithms that can detect
> specific features with fairly good accuracy. For references see
> <http://en.wikipedia.org/wiki/Machine_vision

The successes in machine vision (as described in the Wikipedia article) have
been in pattern-matching analysis of highly-constrained objects for
measurement or imperfections, as used for quality control in manufacturing.

The entry also says:
"One should not confuse machine vision and computer vision. Computer vision
is more general (in the solution of visual problems), whereas machine vision
is an engineering discipline mainly concerned with industrial problems."

So let's look at the Wikipedia entry for computer vision:
http://en.wikipedia.org/wiki/Computer_vision

In the section titled "State of the art", the article itself acknowledges
the significant limitations of general-purpose image analysis:

"[T]here is no standard formulation of how computer vision problems should
be solved. Instead, there exists an abundance of methods for solving various
well-defined computer vision tasks, where the methods often are very task
specific and seldom can be generalized over a wide range of applications."

Adobe employs many, many people who have expertise in this work, and who
keep pace with (or advance) the state of the art. I have talked with a
number of them personally. None of them have any confidence that
general-purpose image analysis heuristics done in-browser with the level of
detail that is required for meaningful alt text could be done in less than
10 years of sustained R&D. Even then, the author's intent, which is the best
indicator of what alt text should be, is not a part of the equation, so this
should be considered a last-ditch repair attempt at best.

> I believe the current state of research is beyond what is descried
> in that article and at the links.

HTML5 is not a research document, it's a specification. And in a
specification which relies so heavily on empirical fact and the current
state of browser technology, a pie-in-the-sky statement like this is even
more incongruous.

To date, no one, including the editor, has offered any evidence that the
approach he suggests is achievable now or in the foreseeable future, much
less applicable to users who can't see. And no browser vendor has even
hinted that they may be interested in pursuing image analysis. I would
assume that if one were working on it, they would correct me publicly. Given
that some features of the language that have actually been implemented in
one form or another have been removed due to insufficient implementation, I
have to wonder why a passage that specifies something no one has done or
plans to do (or can do with state-of-the-art technology) should stay in.

If a browser developer sees value in image analysis heuristics at any point
in the future, there is nothing that prevents them from implementing them.
Taking that paragraph out would not impact that. However, leaving it in
would lead readers to believe that this is actually a viable approach. It is
not.

-
m
Received on Friday, 21 August 2009 16:39:49 UTC