Regarding Transcriptions/Captions (was RE: acceptable fallbacks [was: Re: Is longdesc a good solution? ...])

Justin James wrote:
> I cannot imagine doing a podcast or some other type of production on
> a regular basis and then trying to put together a transcript, without
> the benefit of some sort of "speech to text" software combined with
> some special software that would let me shuttle through the audio and
> compare it to the recognized text, basically a dictation machine in
> reverse. I am sure such software exists, I can't imagine how
> expensive it is.      


While I make it a point to comment to the lists using my private address (I
*am* aware that I can be militant at times, and so do not want my personal
views confused with my employer), some folks also know that I work in Higher
Education, where I focus on web accessibility education and out-reach.
Providing captions to video content is a huge issue for me, and something
that I have been working on for some time now.

I have done a fair bit of testing using speech to text technology, and can
state that this route is fraught with a number of issues which culminate in
an accuracy rate problem.  Speech to text requires a significant amount of
"training" - i.e. establishing voice profiles so that the software can
recognize the speakers voice and "know" what words are being spoken, as well
as continued "tweaking" to improve accuracy.  Issues such as regional
dialects and international accents can seriously hinder accuracy; as well
there are issues surrounding specialized vocabularies (for example medical,
legal, engineering, etc.) which have a significant impact. Finally, multiple
voices in the same asset are hugely problematic, as often you must also
indicate "who" is speaking what in your caption, something that the software
is simply incapable of determining.

However if the "speaker" is the same person every time (and thus has
established a voice profile, and probably also the start of a customized
dictionary/vocabulary), the software has reached the point of often reaching
the 90%+ accuracy range most times, so that the 'brute force' first pass
leaves the content author with nothing more than a need to proof-read/edit.
Tools such as Dragon Naturally Speaking ( are improving with
each new release, and companies such as DocSoft ( are
incorporating them into larger product offerings that often deliver what is
required to end creators, as they streamline the
transcription/editing/captioning process.  Unfortunately in the environment
that I work in, setting up the strict pre-conditions required for reliable
accuracy is problematic: often the media that needs to be captioned is of a
guest lecturer or other outside guest, and so establishing a voice profile
for this speaker is usually out of the question.

This leaves me with no other choice than to out-source the audio track to a
transcription service.  I am in the process of establishing a
workflow/service which will contract out to a number of different companies
that provide such services, with pricing linked to turn-around time: a high
cost of $5.00/minute for same day delivery to a low of $1.00/minute for 5
working days delivery.  Today there are also web-service companies such as
AutomaticSynch ( which will similarly provide both
transcription and time-stamping services, and they have made a significant
foothold into the higher-education market, although the cost per minute is
slightly higher.

Of course, there *is* a cost associated to this, but for simple
transcription a low-cost of $60.00/hour is not (IMHO) unreasonable for hour
long lectures, and for short video pieces (i.e. promo pieces that run less
than 10 minutes), I argue that the cost is negligible - even same day
turnaround is under a hundred bucks!  Surely large corporate entities that
spend thousands (and even hundreds of thousands) of dollars on their web
presence can afford these minimal costs?  Advocacy here is important!

> So while I personally push for transcripts everywhere, I recognize
> that we will never get them.

Rule #1 - never say never. 

> I also tell folks that I know that
> providing transcripts is a HUGE competitive advantage, because:  
> * Search engines now pick up all of your content
> * You can now get your content in front of people who could not
> ordinarily make use of it 
> * Even non-disabled users who are not able (or don't like to) to turn
> their speakers on can now consume your content 
> The benefits of providing a transcript are endless, but even with all
> of that, transcripts on the Web in any capacity are exceedingly rare.

Absolutely!! Selling the benefits of captioned media is the best way of
achieving buy-in, and you have identified some of the key points.  There is
also some experimentation happening that allows users to search *inside*
longer videos and, using the time-stamped captions, focus in on the place in
the video where a key word is mentioned, allowing navigating to that point
in the video; in my case improving the pedagogical value of lectures and
other long-form media assets (trust me, in a research institution this is
huge!).  It is also worth pointing out that YouTube now allows for content
authors to caption their videos (with a tip of the hat to YouTube for being
so pro-active: that companies such as Flickr could learn from them [re:
@alt]...), so increasingly I believe that we will start to see more
captioned media on the web.  Up until now it has been tricky and quite
manual/labour intensive, but that is all starting to change for the better.

If you or others are interested in more details, feel free to contact me off



Received on Wednesday, 10 September 2008 21:36:59 UTC