Re: Practice Describing Pictures, anyone game?

I much appreciate the valid comments given by Rich, Steve and 
Bruce. Your constructive criticism and feedback is very useful,
as it helps me to better understand where I should focus my 
attention. I'll try to go into some of your comments, and we 
will see to what extent we can find further common ground.
I am aware that technology is not even half the story here, and
we are rapidly approaching the point where we do have a powerful
new accessibility option and tool from a technical perspective, 
but still need to find out and decide for ourselves what we can
do or would want to do with it, if anything. Indeed there is 
a need to find better and more convincing answers to Bruce's 
legitimate questioning of any real world applications.

Rich wrote

> Some may argue that given enough training, this can become a viable
> way of "seeing." I think the learning process would be very painful, 
> slow, and frustrating. What do you-all think?

Yes, to really learn to see with sound could be painful, it will 
be very slow, and it might be frustrating at times. Moreover, we 
do not know yet how good people can get at it in the end. 

On the other hand, the process of learning a foreign language is 
also often painful, boring, very slow, and frustrating. I have no 
experience learning to read Braille, or learning to use the cane 
for safe travel, but I imagine that that too is by no means great 
fun until after you have mastered it to a certain degree. Dots of 
Braille do not make any sense to me, nor does the Spanish language. 
Somehow we or you work on mastering some of these things, because 
we think it pays off one way or the other. Will learning to see 
with sound pay off for you? I don't know, I cannot promise that.

Depending on one's personal background, attitude, expectations or
interests, it could also be fun for those who had no prior vision, 
getting an exciting "hands-on" exploration of vision by using a 
cheap PC camera, experiencing the effects of visual perspective, 
occlusion, parallax, visual texture, and so on and so forth, but 
it definitely won't be easy if you want to fully master the 
interpretation of arbitrary soundscapes. Would it be possible to
view and treat it more like playing a game?

The technology has only during the last two years become affordable, 
through the use of standard PC's and PC cameras, and with worldwide 
availability of software and information through the Internet. We 
don't have any convincing success stories from users to tell yet. 
The technology is there all right, it provably preserves a lot of 
visual information in the soundscapes while meeting several key 
parameters known to limit human auditory perception, it is technically 
reliable through the use of mass-produced hardware components, it 
is affordable through cheap $50 cameras and it provides unprecedented
access to visual information, but all that information is certainly
very dense, and presented in a way that no human being has ever had
access to before in history. We don't know if the human brain, your
brain, my brain, can learn to cope with that, or rather to what 
extent it can learn to do that, and whether it is really worth all
the trouble. Now how do we proceed - if we do? Ideas are welcome.

Steve wrote

> I think this is a very innovative idea, but I too could not make 
> sense of much of anything beyond simple straight lines. I think 
> that an SVG or XML approach still provides the best means of getting
> information such as what objects are on the screen and how they are
> connected.

You are quite right, Steve, from the perspective of accessing
structured information on the screen. This technology was actually
developed and meant for accessing arbitrary (unprepared, untagged) 
visual information, going well beyond the screen, specifically 
for gaining access to the visual information from our real-life 
local environment by using a camera. There is no XML describing 
my room, my house, my neighbourhood, or XML describing the 
architecture and art in the city of Rome. Now if (and indeed 
stressing the big "if") we can learn to interpret that arbitrary
visual information through sound, we will as a bonus also be able
to interpret whatever shows on the screen without additional 
tagging, just like sighted folks interpret images on web pages. 
For the moment, I can only demonstrate that many things are 
"audible" within soundscapes, not that they are "understandable".

How could I show that the Chinese language makes sense and can be
learnt? Unless you already know Chinese, you probably take it for
granted, because many people apparently speak that language, but 
without such a priori historical evidence available, how does one
go about proving things?

I hope that Kynn will have some nice photographs of buildings
or architecture, such that I can discuss how that translates 
into certain rhythms and sweeps for rows of gates, pillars
and such plus the effect of visual perspective on that. Again, 
these soundscapes of complex scenes will currently not make 
sense to you without my explanation, but with an explanation, 
the various visual items should at least appear audible, thus 
illustrating the principles while hopefully adding an element 
of plausibility to the whole soundscape approach. For specific 
restricted environments such as the graphical user interface 
of the computer, dedicated solutions will always work much 
better and easier, just like O C R with synthetic speech for
printed text is a lot faster and a lot less painful than trying
to figure out printed text from the corresponding soundscapes 
of printed words.

So in a sense my use of screen items to illustrate the soundscape
technology may be a poor or perhaps even confusing choice, because 
I'm not proposing to use soundscapes for that, but merely wish 
to illustrate the generality of visual access offered by jumping
into anything visual, even while better solutions do exist for 
a limited number of specific domains.

Bruce wrote

> I think the technology may have promise for real time use (where 
> the user is controlling the up/down left/right component), but 
> products that work that way (for navigating the real world) are 
> already available.

What products are you referring to here? GPS systems? Electronic
compasses? Talking signs?

> By the time AI is sufficiently advanced to process these sounds
> intelligibly, we would already have better automated pattern/graphic
> recognition!

Even if this became feasible, and machine vision is far away from
understanding anything but the simplest of scenes in very restricted
environments, there would still remain the fundamental problem of 
getting either machine censorship to decide for you what is relevant
or interesting to mention, or else allow for five minutes to hear 
all items in a single scene listed. With soundscapes, you get the
raw uncensored visual information of a scene in one second, today,
but the big burden of interpretation is indeed on you, the user.

After all this "heavy" stuff, it may be useful to note that there
are some easy applications of the software as well. For instance, 
it can act as a cheap color identifier: pressing function key F10 
lets the software speak the color name of anything in the center 
of the view, be it a camera image, an imported image file or an
image from your TWAIN scanner. There is also a built-in accessible
graphing calculator for function plotting under function key F8. 

Sorry for this long-winded reply. Not everything is hard...

Best wishes,

Peter Meijer


Soundscapes from The vOICe - Seeing with your Ears!
http://ourworld.compuserve.com/homepages/Peter_Meijer/winvoice.htm

Received on Tuesday, 23 November 1999 06:50:49 UTC