static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p… · Web viewThis...
Transcript of static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p… · Web viewThis...
Building “Zombie” Voice Models:
Audio-mining the Past Using Voice Recognition for Transcription of Audio Artifacts
By Erin Donohue
Supervisors: Tanya Clement & Quinn Stewart
Spring 2014
Donohue
Introduction
The Building “Zombie” Voice Models project was designed to determine the
feasibility of using DocSoft:AV transcription software to create computer-generated
transcripts of archival audio recordings. For this project, I attempted to create a working
“voice model” or speaker profile of popular monologist Spalding Gray in the DocSoft:AV
software. Gray’s personal papers are housed in UT Austin’s Harry Ransom Center for
Humanities Research, and the collection includes hundreds of audio and VHS recordings of
Gray’s performances (Stine and Cooper). This project originated when a researcher in
Chicago, Jim Sitar, was working with the on a project with Gray’s recordings. Gray’s audio
recordings do not have transcripts, but there were far too many audiocassettes and VHS
tapes to hand-transcribe each one. Thus, this project was created to apply contemporary
voice recognition software—DocSoft:AV—to archival audio recordings and use the
software to automatically create computer-generated transcripts of Spalding’s monologues.
First and foremost, this project aimed to make Spalding Gray’s recordings more
accessible to researchers. Secondly, this project aimed to assess the feasibility of large-scale
automated transcription of archival audio recordings in general. Archival audio is
notoriously difficult to access in archives, and increasing access to transcribed audio would
help make audio collections of all kinds easier to search and utilize. Thirdly, this project
aimed to devise a set of best practices and training materials for University of Texas
Libraries staff members who may wish to use the software.
In DocSoft:AV, I created voice models (speaker profiles customized for a particular
voice) for three voices: Spalding Gray, a podcast character named Cecil, and my own voice.
The success of these voice models was highly variable, ranging from the error-plagued
Spalding transcripts to the nearly perfect transcripts of my voice. In generating a successful
voice model, having a clear, good-quality recording is paramount. While the project was not
successful in generating a good voice model for Spalding Gray, it did provide more insight
into what is required to make good computer-generated transcripts and ultimately
generated useful software tutorial materials. Though the Spalding voice model did not
output useful transcripts, this should not suggest that DocSoft cannot ever successfully
transcribe archival recordings. Rather, this project shows that there are factors in the
2
Donohue
recordings that may make them more or less amenable to DocSoft transcription—so
archival recordings must be considered on the case-by-case basis.
An Overview of DocSoft:AV
DocSoft:AV is a piece of speech-to-text software that captures and transcribes
spoken-word audio. It matches the sounds of audio recordings to phonemes and then uses
these phonemes to create words and sentences. (DocSoft, Inc.) DocSoft is used to generate
text transcripts as well as captions, as each transcript comes with timing codes to sync the
text with its corresponding media file. DocSoft:AV is conventionally used to audio-mine
contemporary audio and video recordings, such as lectures and presentations.
One unique DocSoft feature is the software’s ability to be trained to individual
voices, learning their nuances and foibles. This allows for transcription unique to each
speaker who has a profile on the software, rather than a one-size-fits all approach to
transcribing any voice. DocSoft does this through the creation of speaker profiles, also
known as voice models. The software is “trained” on a particular person’s voice, and via
machine learning, improves its performance and refines its ability to successfully
transcribe that voice.
3
Donohue
The training process entails giving
DocSoft both an audio recording and an
exact transcript of that recording, which it
uses to “learn” about a particular voice.
DocSoft matches the text of the transcript
to the phonemes in the audio recording,
and, as such, can create customized
models for particular voices. As DocSoft is
given more and more training materials, it learns more about the speaker’s voice. When
given just audio without a corresponding transcript, DocSoft will output a computer-
generated transcript. This resulting transcript can be corrected and then fed back into
DocSoft as training material, improving the overall accuracy of the voice model (See Figure
1).
The existing documentation for DocSoft:AV and its corresponding transcript editor,
DocSoft:TE is rather simplistic in many cases, so one goal of this project was to better
understand the intricacies of using the software, and generating educational materials to
facilitate its use by others.
Spalding
Background
To build a Spalding Gray voice model in DocSoft, I worked with thirteen recordings
of Spalding’s famous monologue, Swimming to Cambodia. Developing a voice model for
Spalding presented a variety of challenges. First and foremost, the recordings I was given
were of middling quality. While difficult, this also made Spalding an ideal test case, as many
archival recordings are not impeccable studio quality. The best quality files were the audio
tracks of Spalding’s Swimming to Cambodia film, directed by Jonathan Demme. While
Spalding’s voice is reasonably clear in these recordings, they had a lot of background music
and sound effects that made isolating Spalding’s words difficult. The other recordings were
lower quality, clearly not recorded from a soundboard but instead from out in the
Train
Transcribe
Edit
Figure 1: The voice model refinement process
4
Donohue
audience. One recording in particular1 had substantial microphone problems that rendered
parts of the recording almost unlistenable. If a human could not transcribe this recording, I
did not hold out much hope for DocSoft. But we had few recordings that had full, accurate
transcripts, so I had to use the recordings that were available.
Another issue, independent of the quality of the recordings was the content of the
Swimming to Cambodia monologue itself. Because the piece is focused on Gray’s
experiences in Asia, it contains a high number of non-English words that would be
stumbling blocks for the software. Finally, Spalding’s voice itself, even in a crystal clear
recording consisting only of English words, would be difficult for a computer to transcribe.
Spalding speaks incredibly quickly and dramatically—and with a heavy Rhode Island
accent. Because he is emphatically telling a story in these recordings, he does not take the
time to slowly, clearly articulate every word the way someone might if they were speaking
for the express purpose of using voice recognition software. All together, the variable
quality of the recordings, the difficult vocabulary, and the heavy accent and quick speed of
speech would make for a very challenging voice model indeed.
Process
I trained my initial Spalding voice model on the four Swimming to Cambodia film
recordings. These pieces were the best for training because they were the highest quality
and featured the clearest speech.
After training the Spalding model on the four film audio recordings, I tested the
software by having it transcribe a different performance2. The performance was, in a word,
abysmal. The number of incorrect words in the resulting DocSoft-generated transcript was
so high as to be uncountable without a significant investment of time.
One thing that was particularly difficult about this project was figuring out how to define
and measure accuracy. Should I simply count the number of corrections that I had to make?
Should I factor in a degree of difficulty for particular vocabulary? Do I monitor how certain
words change over time, or do I consider only the transcript as a whole? This proved quite
1 The performance of Swimming to Cambodia for an Amnesty International benefit at UC Berkeley’s Wheeler Hall on October 17, 19872 The transcribed performance was the infamously poor recording of the UC Berkeley show, as it was the only other performance for which I had an existing transcript at the time.
5
Donohue
tricky, and though I had initially planned to measure success based on the number of
individual corrections I had to make, the Spalding transcripts were too messy for this to be
practical. Throughout the training and retraining process, Spalding’s transcripts averaged
between 100 and 150 errors in the first five minutes alone. Counting each error in a 90-
minute performance was untenable.
Ultimately, instead of meticulously documenting each error, I instead looked for
general patterns, noting words that seemed to consistently be wrong or other oddities that
popped up repeatedly in the transcripts. While the transcripts were very messy and error-
ridden in general, the foreign vocabulary in particular was a substantial problem and had
to be corrected every time it came up.
DocSoft also had problems with all the music in the recording. The software is
particularly fond of transcribing the beats of background music as “And him.” Many of my
recordings contained music, resulting in enormous strings of “And him and him and him”
throughout the transcript. I learned quickly that background music should be avoided if at
all possible.
To mitigate the music problem, I cut all the leading and ending music from the
recordings. I could not cut the music in the middle, however, as this would be incredibly
tedious and would mess up the time codes used to sync the transcript and the original
recording. Without the time codes, the transcripts cannot be used for closed captioning or
indexing, so I had to leave some music in the Spalding recordings. This got rid of some of
the pesky “and him” strings, but did not eliminate them completely.
To help DocSoft deal with the foreign vocabulary, I explored the process of
vocabulary training—improving DocSoft’s vocabulary by giving it text documents and
word lists related to what the speaker is talking about. For the Spalding voice model, I
added a list of all Cambodian cities, as well as other East Asian place names. I also uploaded
Swimming to Cambodia transcripts for which I had no corresponding audio recordings. This
was tremendously helpful; where DocSoft would consistently get city names wrong, once
the supplemental vocabulary was added, it correctly identified these words nearly every
time they appeared.
Though I tried many different tricks to improve the Spalding voice model’s
performance, such as changing the default vocabulary when creating the model, and
6
Donohue
changing the audio file size and duration, adding supplemental vocabulary made the
biggest difference in improving transcript accuracy.
I was also curious about how DocSoft’s voice models “learned” using the training
materials. I wanted to see if the Spalding model could generalize from being trained on a
snippet of a recording to a clean transcription of the entire recording. If this worked, it
would mean that, should a voice model perform poorly on a recording (as Spalding was
consistently doing), a human could manually transcribe a small piece of the recording to
get the software accustomed to interpreting under those particular conditions and
ultimately improve the overall accuracy of the computer-generated transcript.
To test this on Spalding, I first got the software to transcribe an entire 90-minute
recording. I then cut a five-minute clip (devoid of music and most background noise) out of
this recording and trained Spalding on the clip, providing it with both the snippet and a
correct transcript. After the training was complete, I then asked DocSoft to transcribe the
entire 90-minute performance again to see if the transcript quality improved.
Unfortunately, the transcript did not improve and while some errors were fixed, other new
errors appeared in their place.
Because Spalding could not generalize from being trained on a part to improving on
the larger whole, I decided to narrow the scope of testing and see if Spalding could
generalize from being trained on a clip to successfully transcribing that same clip. I tested
Spalding by getting it to transcribe a 5-minute, music-free clip. As usual, the results were
very messy. I then trained Spalding on the exact same clip I had asked it to transcribe—
meaning I gave DocSoft the same audio it had just transcribed, along with an “answer key”
in the form of a completely correct transcript. Finally, I tested DocSoft on the same audio
clip once again, theorizing that the transcript would be nearly perfect this time, as DocSoft
had already been given a corrected transcript. And while the training did indeed make a
difference, and Spalding performed much better after the training than it had in the test
before the training, there were still a surprisingly high number of errors, and the transcript
was far from clean.
Ultimately, when my final, best-trained Spalding model was given a new, never-
before-heard audio recording, it did not perform well. Out of a sample of 480 words,
7
Donohue
DocSoft made 101 errors (53%) and got 90 words correct (47%). Because the model is
wrong over half the time, using DocSoft to transcribe Spalding was not a fruitful exercise.
A Note on File Size
Apart from the issues inherent in transcribing Spalding’s voice, I also encountered
some technical problems with DocSoft. The only locations in which I could upload WAV
files were the iSchool and the Tarlton Law Library. At other locations, such as the Benson
Latin American Collection, the PCL, and in my apartment, DocSoft would time out before
uploading the file, resulting either in an error message or a blank screen. Having a very fast
Internet connection and not relying on Wi-Fi seems to be the only way to successfully
upload very large files to DocSoft.
As an alternative, I found that uploading smaller MP3 files instead of WAVs worked
very well. There was no difference between DocSoft’s performance with a Spalding WAV
file and the same file as an MP3. The DocSoft documentation stresses that people should
use the best quality files possible, but due to the uploading issues with very high quality
files, it seems better to use MP3s if necessary. After all, a lower-quality uploaded file is
preferable to no uploaded file at all.
Spalding Conclusions
Ultimately, the Spalding voice model was not successful and it failed to generate any
transcripts that were clean enough to justify editing them (as opposed to simply hand-
transcribing the recordings). From this experiment, several lessons are clear. First and
foremost, if the quality of the original recording is poor, no amount of fiddling with DocSoft
can improve the transcript. Because several of the recordings I used would be difficult for a
human to understand and transcribe, it was unlikely that DocSoft would be able to parse
them. Secondly, Spalding’s talking speed and accent were problematic. Based on voice
models of other speakers (explained below), the way in which the speaker talks in the
recording makes a big difference in whether or not a clean transcript can be made. Because
of Spalding’s mannerisms and accent, he was a particularly difficult case for DocSoft.
Finally, DocSoft can easily recognize foreign words or highly technical vocabulary if these
8
Donohue
terms are added to the vocabulary as related documents or word lists. Taking this simple
step can result in dramatic transcription improvements.
Cecil
Background
As noted above, Spalding presented a number of technical and other difficulties
(including voice and recording quality and the limited number of recordings I had for
testing). Because of this, I decided to create another archival model test case, using
recordings that were not originally created with the goal of transcription. For this, I used
the popular comedic podcast Welcome to Night Vale, which primarily features monologues
delivered by a community radio host character named Cecil.
Using the Cecil recordings had several advantages over the Spalding recordings.
Most importantly, Cecil’s voice is very clear. These podcasts were professionally recorded
in a studio and Cecil speaks more slowly than Spalding, and without a New England accent.
The other advantage of using Night Vale podcasts was that there were a lot more available
recordings (over 40), all with perfect transcripts.
There were still downsides to the Cecil voice model, however. The recordings
existed only as streaming-quality MP3s, so doing tests on high-quality files was not
possible. And like the Swimming to Cambodia Jonathan Demme film audio, the Night Vale
recordings contain lots of music at the beginning, middle, and end. As with Spalding, this
podcast also uses quirky vocabularies and mentions characters and places with strange
names, increasing the likelihood of transcription error.
Process
To begin, I trained the Cecil model on the pilot episode of Welcome to Night Vale, and
then used the second episode for testing. To see how much training could improve
performance, I then trained Cecil on the third episode of the series and tested it on the
second episode again. Even after training on only the pilot episode, the transcript of
episode two was much cleaner and easier to fix than even the best Spalding results. When I
gave Cecil the second round of training on episode three, its transcription of episode two
9
Donohue
did improve somewhat, but many of the same errors remained. For example, Cecil
consistently got “Night Vale” wrong, despite this phrase appearing dozens of times in all of
the training materials.
Because using related vocabulary worked so well with Spalding, I decided to try it
with Cecil as well. The Night Vale podcast has a large fan base that produces lots of
supplementary information and related documents, including an elaborate Wiki site. To see
if I could get DocSoft to learn “Night Vale,” I used the “Night Vale” page on the Wiki. This
page contained many people and place names commonly mentioned on the show, and used
the phrase “Night Vale” 52 times—hopefully enough to get the point across3. Once I
uploaded this Wiki page, the Cecil model correctly identified “Night Vale” and several other
odd names every single time. Cecil did still have major problems with music, as music is
prominently featured in the show, so there were lots of “and him” sentences, but apart
from cutting out large swathes of music from the beginning and end as I did with Spalding,
this problem was unavoidable.
As with Spalding, I did the “generalize from part to whole” and “generalize from part
to the same part” tests with Cecil. I got Cecil to transcribe episode 4, “PTA Meeting” and
then trained Cecil on the first six music-free minutes of that episode. When I then asked
Cecil to transcribe the same six minute clip, a few of the errors were gone, but some still
remained. When I had Cecil transcribe the entire “PTA Meeting” episode after the training,
the errors did not visibly decease. As with Spalding, though many of the errors were gone,
new errors had popped up in their place.
From trying this test with both Spalding and Cecil, without seeing notable
improvement, I concluded that DocSoft probably does not “remember” whole text/audio
file pairings and look for matches. If this were the case, the Spalding and Cecil models
would have performed flawlessly on the audio they had already been trained on. Instead, it
seems that the software looks for general patterns extrapolated from an aggregation of the
individual training sessions. So, unfortunately, my hope that users could hand-transcribe
audio snippets for training and thus improve the quality of the overall DocSoft transcript
seems untenable, at least on a small scale.
3 See Welcome to Night Vale Wiki in the citations.
10
Donohue
Ultimately, though, Cecil’s performance was quite good. On my best-trained version
of the Cecil model, for a brand new audio recording, out of a sample of 480 words, DocSoft
made 41 errors (9%) and got 439 (91%) of the words correct.
Cecil Conclusions
The main conclusion I drew from working with the Cecil model is that having a clear
recording of a careful speaker is much more important than having a high-quality audio
file. The clear and deliberate voice of Cecil, even when delivered through low-quality
streaming MP3s, performed much better than Spalding’s high-quality WAV files. I suspect
this is partly due to differences in the speech patterns and habits of the two speakers, and
partly due to the poor recording conditions of the Spalding monologues. The lack of
audience and other ambient noise (excluding music and sound effects) improved DocSoft’s
performance with Cecil considerably.
The Cecil model also reiterates the importance of related vocabulary. As with
Spalding, without the related documentation of the Night Vale Wiki page, Cecil’s error rate
for unusual words and names was almost 100%. With the inclusion of a single Wiki page,
the unusual and technical words were then transcribed correctly every time. Finally, the
biggest problem for Cecil was background music. When cleaning up the transcripts, I had to
delete large swathes of “And Him,” which decreased the efficiency of using DocSoft and also
likely hindered the accuracy of the voice model.
Building a second Cecil voice model would be an interesting future experiment to
test the impact of less (or more) training on transcript accuracy. While Cecil did reasonably
well after training on just a couple of documents, creating a separate model with only one
training document and testing an entire series of generated transcripts could be useful in
establishing a “bare minimum” for a clear but low-quality audio file. During my work with
Spalding, I created two voice models, one which received half as much training as the other
(trained on two Swimming to Cambodia film clips, rather than four), and there was no
significant difference between the two models’ performance—namely, they were both
unimpressive. But there were many more variables to control with Spalding (the quality of
various recordings and recording equipment, audience noise, the variable nature of
Spalding’s voice, etc.). Cecil would thus be an interesting case for establishing a training
11
Donohue
“bare minimum” because the recordings are quite consistent, recorded on the same
equipment and under the same conditions, so it may prove easier to ascertain whether
errors are due to insufficient training or another factor.
Erin
Background
After having moderate success with Cecil and no success with Spalding, I then
decided to use DocSoft for a more conventional purpose: a contemporary voice model of a
speaker who knows their recordings will be transcribed. Naturally, I decided to create a
voice model of myself. A voice model of my own voice presented a level of flexibility that
working with preexisting archival recordings did not. Not only could DocSoft adjust to my
voice but I could also adjust my own performance based on what DocSoft identifies or
misses.
Process
To train my voice model, I recorded myself reading the script of a tutorial I wrote on
how to use the DocSoft software4. I paired my recording with the unannotated script I read
and triple-checked to ensure 100% transcript accuracy. After training on a single file, I
tested the Erin model on a second tutorial script. Despite having been trained only on one
five-minute recording of my voice, the model did very well on the new recording: the
transcript contained almost no errors. Ironically, DocSoft got the word “DocSoft” wrong
every single time, however. Overall, the first test of the Erin voice model contained twenty-
five errors in five minutes—as opposed to the Spalding transcripts, which sometimes
contained over 150 errors in five minutes.
Using my best-trained Erin voice model on a brand new recording, out of a sample
of 480 words, DocSoft made 26 errors (5%) and correctly identified 454 (95%) words. The
positive performance was likely because I recorded my audio on high-quality studio
equipment (as with the Cecil model) and spoke clearly and deliberately. Being aware of
how to talk to facilitate DocSoft transcription seems to greatly help the software’s accuracy.
4 the experience of training DocSoft using materials about how to train DocSoft was unsettlingly meta
12
Donohue
DocSoft therefore seems most likely to successfully parse recordings made for the specific
purpose of subsequent transcription.
Erin Conclusions
Working with the Erin model was a much more positive illustration of what DocSoft
is capable of. DocSoft does very well with the types of recordings it was designed to handle,
namely high-quality monologue recordings without any music or ambient noise, where the
speaker is aware of how to speak (clearly, deliberately, and naturally) and doesn’t have a
heavy accent. Under these conditions, DocSoft performed wonderfully. Archival recordings,
or those not recorded with the express goal of future automated transcription (i.e.,
Spalding and Cecil), are more of a mixed bag.
Transcript Editing with Glifos and DocSoft:TE
Throughout the project, one perpetual hurdle was editing the transcripts that
DocSoft produced. Over the course of the semester, there were many technical problems in
this regard that can hopefully be avoided in future projects.
The DocSoft:AV transcription software has an accompanying transcript editor,
DocSoft:TE, which can be run locally on the user’s computer, but not accessed via the
Internet, unlike DocSoft:AV. DocSoft:TE allows for nearly real-time text editing
simultaneous with audio playback, as well as seamless integration with DocSoft:AV’s voice
model training features. At the start of the project, UT did not have the latest version of
DocSoft:TE, and the version I could access did not work properly.
As a result, for the first half of the project, I edited all DocSoft transcripts in Glifos, a
rich-media content management system also used by UT. Glifos is “designed to integrate
digital video, audio, text, and image documents through a process that ‘automates the
production, cataloguing, digital preservation, access, and delivery of rich-media over
diverse data transport platforms and presentation devices’”(Van Deusen Phillips). Among
other features, Glifos allows users to synchronize a video and a corresponding transcript,
so is a useful way to make the fruits of DocSoft’s labor accessible online. However, because
it was not primarily designed as a transcript editor, Glifos proved difficult to use for the
13
Donohue
extensive, heavy transcript editing that Spalding required. Even with the audio playback
speed slowed, the process of stopping, starting and rewinding the recording was quite
tedious. Furthermore, the Glifos interface itself was not designed to allow for smooth,
seamless editing. Rather, editing requires users to scroll through large blocks of text with
time codes to keep up with the recording or find pieces that require editing (see Figure 2).
Figure 2: The Glifos editing interface with a Spalding transcript
While editing Spalding transcripts using Glifos, I also experienced an odd glitch that
caused me to lose several hours of editing work. If using Glifos for editing, it is of
paramount importance to save one’s progress frequently, as the program can be buggy at
times. Ultimately, I found it faster to simply transcribe the Spalding recordings by hand
than to edit them in Glifos, defeating the purpose of using automatic transcription. This was
partly because of the high error rate inherent in the Spalding transcripts and partly due to
the Glifos interface, which was not designed for the intensive editing that Spalding
required.
Midway through the project, DocSoft:TE, the DocSoft transcript editor was updated
to the latest version, so I switched from Glifos to TE. Because it was specifically designed
for editing DocSoft transcripts, this software proved easier to use than Glifos. One useful
14
Donohue
DocSoft:TE feature was the color-coded “confidence bar” running down the center of the
screen. Using colors ranging from green (high confidence) to red (low confidence), DocSoft
indicates how likely different portions of the transcript are to be correct. This allows users
to easily see at a glance areas that may require particularly extensive editing. I found this
feature useful, particularly for the very long transcripts I was working with, but I also noted
that the DocSoft:TE confidence bar tended to be overconfident; that is, many areas that
were bright green indicating “very confident” contained several errors.
Another useful feature was the ability to export from DocSoft:TE back to AV.
Because both the TE and AV software are made by DocSoft, they work in tandem nicely.
Once editing in TE is complete, it is easy for users to export transcripts directly back into
DocSoft:AV. This very convenient feature allows the voice model to be automatically
trained on the edited transcript, without requiring the user to log into DocSoft:AV manually
and upload the transcript for training.
Figure 3: The DocSoft:TE Editing Interface with a Cecil transcript
I edited about half of the Spalding transcripts and all of the Cecil transcripts in
DocSoft:TE. Because the error rate was so low for the Erin transcripts, I simply edited them
in a basic text editing program. TE is designed to allow users to edit a transcript in real
time, as the audio is playing, as well as to quickly jump forward and backward in the
transcript as needed without ever pausing the audio. Overall, TE had better text grouping
15
Donohue
and formatting options than Glifos, and it was much easier to scan and edit large
transcripts. The TE interface did have its problems, however. For instance, because TE
makes it so easy to jump all around the transcript, it was often difficult to navigate without
jumping to the wrong spot in the transcript. It was also easy to unwittingly delete the
wrong word or add new words in the wrong place. Adding spaces for new groups of words
(called “utterances” in DocSoft lingo) also proved difficult, and I believe I found a bug in the
software that made entering the time codes for new utterances quite tedious.
For relatively clean transcripts, like the results of the Cecil model, DocSoft:TE
worked well enough. But for dirtier transcripts, like Spalding’s, as was the case with Glifos,
TE was less efficient. It took almost the same amount of time to edit the Spalding
transcripts in TE as it did to hand-transcribe them with the audio slowed down. DocSoft:TE
was definitely an improvement over Glifos, but was still a slog for very rough transcripts.
General Conclusions
The difference between the Spalding and Erin voice models over the course of the
project was remarkable. From this project, it is clear that there are certain factors that can
make particular recordings good candidates for automatic transcription, or that can help
create the best possible voice model. When creating a recording for transcription or
considering whether or not to use DocSoft for an existing recording, keep these factors in
mind.
1. Speaker and recording quality is paramount.
The best speakers are those that speak naturally, but also clearly and
carefully. DocSoft seems to perform best when speakers know (1) that their
recordings are going to be computer-transcribed and (2) how they can change their
voices (by, for instance speaking more slowly, enunciating, or avoiding stammering)
to facilitate good performance by the software.
Recordings should be made with high-quality recording equipment, ideally in
a silent environment. Background music, audience noise, or other ambient noise is
difficult for DocSoft and should be avoided.
16
Donohue
Past a certain threshold, the actual audio file quality does not seem to matter
very much. In this project, I did not notice any difference in accuracy between a
WAV file and an MP3 of the same recording. In fact, I had better success with
streaming MP3s (in the Cecil case) than with high-quality WAVs (Spalding) simply
because the speaker was clearer and the recording quality was better. While a high-
resolution audio file is ideal, it is not critical to a clean transcript.
2. The “ideal” number of training audio files is unclear.
While more training files are intuitively better, I could not establish a
definitive minimum number of training materials that should be used to generate a
good voice model. This differed for all three voice models I made. Cecil started off
fairly inaccurate but improved a lot after three more training recordings. Erin was
nearly perfect after just one training recording. Spalding was trained on many files
several times over and never substantially improved. Users should use the highest
number of training materials available, but the “ideal” number for a good model
varies depending on the recordings and speaker in question.
3. Related documents and word lists are great for improving accuracy.
Related documents and word lists should be used to supplement the base
vocabulary whenever they are available. This feature is particularly useful when
speakers use highly technical vocabulary or lots of acronyms and initialisms. Using
supplemental vocabulary dramatically improved all of my voice models
immediately.
4. Archival audio should be considered for use with DocSoft on the case-by-case basis.
From this project, I learned that archival audio can be a mixed bag when
working with DocSoft. Collections of recordings need to be evaluated individually to
determine if they might work in DocSoft and how to tinker with the software or the
recordings to improve performance. I could not determine a series of one-size-fits-
all recommendations for all archival audio.
17
Donohue
Instead, it seems best to consider sets of recordings on their own merits,
taking into account the speaker, the quality and clarity of the recordings, and the
presence of other speakers, noise, or music. Some recordings will lend themselves to
easy transcription; others will not. For clear monologue recordings made under
good conditions, DocSoft can be instrumental in making these recordings more
accessible and searchable. For other recordings, however, DocSoft may be more
trouble than it’s worth, and hand-transcription should be used. While DocSoft’s
utility for archival recordings is hit or miss, for lectures and other contemporary
uses, DocSoft is quite promising indeed.
18
Donohue
Works Cited
DocSoft, Inc.,. Docsoft:AV Appliance User Guide. 2nd ed. DocSoft, Inc., 2009. Web. 25 Apr. 2014.
Stine, Matt, and Stephen Cooper. 'Spalding Gray: A Preliminary Inventory Of His Papers At The Harry Ransom Center'. Harry Ransom Center. N. p., 2011. Web. 25 Apr. 2014.
Van Deusen Phillips, Sarah. 'GLIFOS-Media: Rich Media Archiving'. The Documentalist 2009. Web. 25 Apr. 2014.
Welcome to Night Vale Wiki,. 'Night Vale'. N. p., 2014. Web. 25 Apr. 2014.
19