static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis...

29
Building “Zombie” Voice Models: Audio-mining the Past Using Voice Recognition for Transcription of Audio Artifacts By Erin Donohue

Transcript of static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis...

Page 1: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Building “Zombie” Voice Models:

Audio-mining the Past Using Voice Recognition for Transcription of Audio Artifacts

By Erin Donohue

Supervisors: Tanya Clement & Quinn Stewart

Spring 2014

Page 2: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

Introduction

The Building “Zombie” Voice Models project was designed to determine the

feasibility of using DocSoft:AV transcription software to create computer-generated

transcripts of archival audio recordings. For this project, I attempted to create a working

“voice model” or speaker profile of popular monologist Spalding Gray in the DocSoft:AV

software. Gray’s personal papers are housed in UT Austin’s Harry Ransom Center for

Humanities Research, and the collection includes hundreds of audio and VHS recordings of

Gray’s performances (Stine and Cooper). This project originated when a researcher in

Chicago, Jim Sitar, was working with the on a project with Gray’s recordings. Gray’s audio

recordings do not have transcripts, but there were far too many audiocassettes and VHS

tapes to hand-transcribe each one. Thus, this project was created to apply contemporary

voice recognition software—DocSoft:AV—to archival audio recordings and use the

software to automatically create computer-generated transcripts of Spalding’s monologues.

First and foremost, this project aimed to make Spalding Gray’s recordings more

accessible to researchers. Secondly, this project aimed to assess the feasibility of large-scale

automated transcription of archival audio recordings in general. Archival audio is

notoriously difficult to access in archives, and increasing access to transcribed audio would

help make audio collections of all kinds easier to search and utilize. Thirdly, this project

aimed to devise a set of best practices and training materials for University of Texas

Libraries staff members who may wish to use the software.

In DocSoft:AV, I created voice models (speaker profiles customized for a particular

voice) for three voices: Spalding Gray, a podcast character named Cecil, and my own voice.

The success of these voice models was highly variable, ranging from the error-plagued

Spalding transcripts to the nearly perfect transcripts of my voice. In generating a successful

voice model, having a clear, good-quality recording is paramount. While the project was not

successful in generating a good voice model for Spalding Gray, it did provide more insight

into what is required to make good computer-generated transcripts and ultimately

generated useful software tutorial materials. Though the Spalding voice model did not

output useful transcripts, this should not suggest that DocSoft cannot ever successfully

transcribe archival recordings. Rather, this project shows that there are factors in the

2

Page 3: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

recordings that may make them more or less amenable to DocSoft transcription—so

archival recordings must be considered on the case-by-case basis.

An Overview of DocSoft:AV

DocSoft:AV is a piece of speech-to-text software that captures and transcribes

spoken-word audio. It matches the sounds of audio recordings to phonemes and then uses

these phonemes to create words and sentences. (DocSoft, Inc.) DocSoft is used to generate

text transcripts as well as captions, as each transcript comes with timing codes to sync the

text with its corresponding media file. DocSoft:AV is conventionally used to audio-mine

contemporary audio and video recordings, such as lectures and presentations.

One unique DocSoft feature is the software’s ability to be trained to individual

voices, learning their nuances and foibles. This allows for transcription unique to each

speaker who has a profile on the software, rather than a one-size-fits all approach to

transcribing any voice. DocSoft does this through the creation of speaker profiles, also

known as voice models. The software is “trained” on a particular person’s voice, and via

machine learning, improves its performance and refines its ability to successfully

transcribe that voice.

3

Page 4: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

The training process entails giving

DocSoft both an audio recording and an

exact transcript of that recording, which it

uses to “learn” about a particular voice.

DocSoft matches the text of the transcript

to the phonemes in the audio recording,

and, as such, can create customized

models for particular voices. As DocSoft is

given more and more training materials, it learns more about the speaker’s voice. When

given just audio without a corresponding transcript, DocSoft will output a computer-

generated transcript. This resulting transcript can be corrected and then fed back into

DocSoft as training material, improving the overall accuracy of the voice model (See Figure

1).

The existing documentation for DocSoft:AV and its corresponding transcript editor,

DocSoft:TE is rather simplistic in many cases, so one goal of this project was to better

understand the intricacies of using the software, and generating educational materials to

facilitate its use by others.

Spalding

Background

To build a Spalding Gray voice model in DocSoft, I worked with thirteen recordings

of Spalding’s famous monologue, Swimming to Cambodia. Developing a voice model for

Spalding presented a variety of challenges. First and foremost, the recordings I was given

were of middling quality. While difficult, this also made Spalding an ideal test case, as many

archival recordings are not impeccable studio quality. The best quality files were the audio

tracks of Spalding’s Swimming to Cambodia film, directed by Jonathan Demme. While

Spalding’s voice is reasonably clear in these recordings, they had a lot of background music

and sound effects that made isolating Spalding’s words difficult. The other recordings were

lower quality, clearly not recorded from a soundboard but instead from out in the

Train

Transcribe

Edit

Figure 1: The voice model refinement process

4

Page 5: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

audience. One recording in particular1 had substantial microphone problems that rendered

parts of the recording almost unlistenable. If a human could not transcribe this recording, I

did not hold out much hope for DocSoft. But we had few recordings that had full, accurate

transcripts, so I had to use the recordings that were available.

Another issue, independent of the quality of the recordings was the content of the

Swimming to Cambodia monologue itself. Because the piece is focused on Gray’s

experiences in Asia, it contains a high number of non-English words that would be

stumbling blocks for the software. Finally, Spalding’s voice itself, even in a crystal clear

recording consisting only of English words, would be difficult for a computer to transcribe.

Spalding speaks incredibly quickly and dramatically—and with a heavy Rhode Island

accent. Because he is emphatically telling a story in these recordings, he does not take the

time to slowly, clearly articulate every word the way someone might if they were speaking

for the express purpose of using voice recognition software. All together, the variable

quality of the recordings, the difficult vocabulary, and the heavy accent and quick speed of

speech would make for a very challenging voice model indeed.

Process

I trained my initial Spalding voice model on the four Swimming to Cambodia film

recordings. These pieces were the best for training because they were the highest quality

and featured the clearest speech.

After training the Spalding model on the four film audio recordings, I tested the

software by having it transcribe a different performance2. The performance was, in a word,

abysmal. The number of incorrect words in the resulting DocSoft-generated transcript was

so high as to be uncountable without a significant investment of time.

One thing that was particularly difficult about this project was figuring out how to define

and measure accuracy. Should I simply count the number of corrections that I had to make?

Should I factor in a degree of difficulty for particular vocabulary? Do I monitor how certain

words change over time, or do I consider only the transcript as a whole? This proved quite

1 The performance of Swimming to Cambodia for an Amnesty International benefit at UC Berkeley’s Wheeler Hall on October 17, 19872 The transcribed performance was the infamously poor recording of the UC Berkeley show, as it was the only other performance for which I had an existing transcript at the time.

5

Page 6: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

tricky, and though I had initially planned to measure success based on the number of

individual corrections I had to make, the Spalding transcripts were too messy for this to be

practical. Throughout the training and retraining process, Spalding’s transcripts averaged

between 100 and 150 errors in the first five minutes alone. Counting each error in a 90-

minute performance was untenable.

Ultimately, instead of meticulously documenting each error, I instead looked for

general patterns, noting words that seemed to consistently be wrong or other oddities that

popped up repeatedly in the transcripts. While the transcripts were very messy and error-

ridden in general, the foreign vocabulary in particular was a substantial problem and had

to be corrected every time it came up.

DocSoft also had problems with all the music in the recording. The software is

particularly fond of transcribing the beats of background music as “And him.” Many of my

recordings contained music, resulting in enormous strings of “And him and him and him”

throughout the transcript. I learned quickly that background music should be avoided if at

all possible.

To mitigate the music problem, I cut all the leading and ending music from the

recordings. I could not cut the music in the middle, however, as this would be incredibly

tedious and would mess up the time codes used to sync the transcript and the original

recording. Without the time codes, the transcripts cannot be used for closed captioning or

indexing, so I had to leave some music in the Spalding recordings. This got rid of some of

the pesky “and him” strings, but did not eliminate them completely.

To help DocSoft deal with the foreign vocabulary, I explored the process of

vocabulary training—improving DocSoft’s vocabulary by giving it text documents and

word lists related to what the speaker is talking about. For the Spalding voice model, I

added a list of all Cambodian cities, as well as other East Asian place names. I also uploaded

Swimming to Cambodia transcripts for which I had no corresponding audio recordings. This

was tremendously helpful; where DocSoft would consistently get city names wrong, once

the supplemental vocabulary was added, it correctly identified these words nearly every

time they appeared.

Though I tried many different tricks to improve the Spalding voice model’s

performance, such as changing the default vocabulary when creating the model, and

6

Page 7: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

changing the audio file size and duration, adding supplemental vocabulary made the

biggest difference in improving transcript accuracy.

I was also curious about how DocSoft’s voice models “learned” using the training

materials. I wanted to see if the Spalding model could generalize from being trained on a

snippet of a recording to a clean transcription of the entire recording. If this worked, it

would mean that, should a voice model perform poorly on a recording (as Spalding was

consistently doing), a human could manually transcribe a small piece of the recording to

get the software accustomed to interpreting under those particular conditions and

ultimately improve the overall accuracy of the computer-generated transcript.

To test this on Spalding, I first got the software to transcribe an entire 90-minute

recording. I then cut a five-minute clip (devoid of music and most background noise) out of

this recording and trained Spalding on the clip, providing it with both the snippet and a

correct transcript. After the training was complete, I then asked DocSoft to transcribe the

entire 90-minute performance again to see if the transcript quality improved.

Unfortunately, the transcript did not improve and while some errors were fixed, other new

errors appeared in their place.

Because Spalding could not generalize from being trained on a part to improving on

the larger whole, I decided to narrow the scope of testing and see if Spalding could

generalize from being trained on a clip to successfully transcribing that same clip. I tested

Spalding by getting it to transcribe a 5-minute, music-free clip. As usual, the results were

very messy. I then trained Spalding on the exact same clip I had asked it to transcribe—

meaning I gave DocSoft the same audio it had just transcribed, along with an “answer key”

in the form of a completely correct transcript. Finally, I tested DocSoft on the same audio

clip once again, theorizing that the transcript would be nearly perfect this time, as DocSoft

had already been given a corrected transcript. And while the training did indeed make a

difference, and Spalding performed much better after the training than it had in the test

before the training, there were still a surprisingly high number of errors, and the transcript

was far from clean.

Ultimately, when my final, best-trained Spalding model was given a new, never-

before-heard audio recording, it did not perform well. Out of a sample of 480 words,

7

Page 8: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

DocSoft made 101 errors (53%) and got 90 words correct (47%). Because the model is

wrong over half the time, using DocSoft to transcribe Spalding was not a fruitful exercise.

A Note on File Size

Apart from the issues inherent in transcribing Spalding’s voice, I also encountered

some technical problems with DocSoft. The only locations in which I could upload WAV

files were the iSchool and the Tarlton Law Library. At other locations, such as the Benson

Latin American Collection, the PCL, and in my apartment, DocSoft would time out before

uploading the file, resulting either in an error message or a blank screen. Having a very fast

Internet connection and not relying on Wi-Fi seems to be the only way to successfully

upload very large files to DocSoft.

As an alternative, I found that uploading smaller MP3 files instead of WAVs worked

very well. There was no difference between DocSoft’s performance with a Spalding WAV

file and the same file as an MP3. The DocSoft documentation stresses that people should

use the best quality files possible, but due to the uploading issues with very high quality

files, it seems better to use MP3s if necessary. After all, a lower-quality uploaded file is

preferable to no uploaded file at all.

Spalding Conclusions

Ultimately, the Spalding voice model was not successful and it failed to generate any

transcripts that were clean enough to justify editing them (as opposed to simply hand-

transcribing the recordings). From this experiment, several lessons are clear. First and

foremost, if the quality of the original recording is poor, no amount of fiddling with DocSoft

can improve the transcript. Because several of the recordings I used would be difficult for a

human to understand and transcribe, it was unlikely that DocSoft would be able to parse

them. Secondly, Spalding’s talking speed and accent were problematic. Based on voice

models of other speakers (explained below), the way in which the speaker talks in the

recording makes a big difference in whether or not a clean transcript can be made. Because

of Spalding’s mannerisms and accent, he was a particularly difficult case for DocSoft.

Finally, DocSoft can easily recognize foreign words or highly technical vocabulary if these

8

Page 9: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

terms are added to the vocabulary as related documents or word lists. Taking this simple

step can result in dramatic transcription improvements.

Cecil

Background

As noted above, Spalding presented a number of technical and other difficulties

(including voice and recording quality and the limited number of recordings I had for

testing). Because of this, I decided to create another archival model test case, using

recordings that were not originally created with the goal of transcription. For this, I used

the popular comedic podcast Welcome to Night Vale, which primarily features monologues

delivered by a community radio host character named Cecil.

Using the Cecil recordings had several advantages over the Spalding recordings.

Most importantly, Cecil’s voice is very clear. These podcasts were professionally recorded

in a studio and Cecil speaks more slowly than Spalding, and without a New England accent.

The other advantage of using Night Vale podcasts was that there were a lot more available

recordings (over 40), all with perfect transcripts.

There were still downsides to the Cecil voice model, however. The recordings

existed only as streaming-quality MP3s, so doing tests on high-quality files was not

possible. And like the Swimming to Cambodia Jonathan Demme film audio, the Night Vale

recordings contain lots of music at the beginning, middle, and end. As with Spalding, this

podcast also uses quirky vocabularies and mentions characters and places with strange

names, increasing the likelihood of transcription error.

Process

To begin, I trained the Cecil model on the pilot episode of Welcome to Night Vale, and

then used the second episode for testing. To see how much training could improve

performance, I then trained Cecil on the third episode of the series and tested it on the

second episode again. Even after training on only the pilot episode, the transcript of

episode two was much cleaner and easier to fix than even the best Spalding results. When I

gave Cecil the second round of training on episode three, its transcription of episode two

9

Page 10: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

did improve somewhat, but many of the same errors remained. For example, Cecil

consistently got “Night Vale” wrong, despite this phrase appearing dozens of times in all of

the training materials.

Because using related vocabulary worked so well with Spalding, I decided to try it

with Cecil as well. The Night Vale podcast has a large fan base that produces lots of

supplementary information and related documents, including an elaborate Wiki site. To see

if I could get DocSoft to learn “Night Vale,” I used the “Night Vale” page on the Wiki. This

page contained many people and place names commonly mentioned on the show, and used

the phrase “Night Vale” 52 times—hopefully enough to get the point across3. Once I

uploaded this Wiki page, the Cecil model correctly identified “Night Vale” and several other

odd names every single time. Cecil did still have major problems with music, as music is

prominently featured in the show, so there were lots of “and him” sentences, but apart

from cutting out large swathes of music from the beginning and end as I did with Spalding,

this problem was unavoidable.

As with Spalding, I did the “generalize from part to whole” and “generalize from part

to the same part” tests with Cecil. I got Cecil to transcribe episode 4, “PTA Meeting” and

then trained Cecil on the first six music-free minutes of that episode. When I then asked

Cecil to transcribe the same six minute clip, a few of the errors were gone, but some still

remained. When I had Cecil transcribe the entire “PTA Meeting” episode after the training,

the errors did not visibly decease. As with Spalding, though many of the errors were gone,

new errors had popped up in their place.

From trying this test with both Spalding and Cecil, without seeing notable

improvement, I concluded that DocSoft probably does not “remember” whole text/audio

file pairings and look for matches. If this were the case, the Spalding and Cecil models

would have performed flawlessly on the audio they had already been trained on. Instead, it

seems that the software looks for general patterns extrapolated from an aggregation of the

individual training sessions. So, unfortunately, my hope that users could hand-transcribe

audio snippets for training and thus improve the quality of the overall DocSoft transcript

seems untenable, at least on a small scale.

3 See Welcome to Night Vale Wiki in the citations.

10

Page 11: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

Ultimately, though, Cecil’s performance was quite good. On my best-trained version

of the Cecil model, for a brand new audio recording, out of a sample of 480 words, DocSoft

made 41 errors (9%) and got 439 (91%) of the words correct.

Cecil Conclusions

The main conclusion I drew from working with the Cecil model is that having a clear

recording of a careful speaker is much more important than having a high-quality audio

file. The clear and deliberate voice of Cecil, even when delivered through low-quality

streaming MP3s, performed much better than Spalding’s high-quality WAV files. I suspect

this is partly due to differences in the speech patterns and habits of the two speakers, and

partly due to the poor recording conditions of the Spalding monologues. The lack of

audience and other ambient noise (excluding music and sound effects) improved DocSoft’s

performance with Cecil considerably.

The Cecil model also reiterates the importance of related vocabulary. As with

Spalding, without the related documentation of the Night Vale Wiki page, Cecil’s error rate

for unusual words and names was almost 100%. With the inclusion of a single Wiki page,

the unusual and technical words were then transcribed correctly every time. Finally, the

biggest problem for Cecil was background music. When cleaning up the transcripts, I had to

delete large swathes of “And Him,” which decreased the efficiency of using DocSoft and also

likely hindered the accuracy of the voice model.

Building a second Cecil voice model would be an interesting future experiment to

test the impact of less (or more) training on transcript accuracy. While Cecil did reasonably

well after training on just a couple of documents, creating a separate model with only one

training document and testing an entire series of generated transcripts could be useful in

establishing a “bare minimum” for a clear but low-quality audio file. During my work with

Spalding, I created two voice models, one which received half as much training as the other

(trained on two Swimming to Cambodia film clips, rather than four), and there was no

significant difference between the two models’ performance—namely, they were both

unimpressive. But there were many more variables to control with Spalding (the quality of

various recordings and recording equipment, audience noise, the variable nature of

Spalding’s voice, etc.). Cecil would thus be an interesting case for establishing a training

11

Page 12: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

“bare minimum” because the recordings are quite consistent, recorded on the same

equipment and under the same conditions, so it may prove easier to ascertain whether

errors are due to insufficient training or another factor.

Erin

Background

After having moderate success with Cecil and no success with Spalding, I then

decided to use DocSoft for a more conventional purpose: a contemporary voice model of a

speaker who knows their recordings will be transcribed. Naturally, I decided to create a

voice model of myself. A voice model of my own voice presented a level of flexibility that

working with preexisting archival recordings did not. Not only could DocSoft adjust to my

voice but I could also adjust my own performance based on what DocSoft identifies or

misses.

Process

To train my voice model, I recorded myself reading the script of a tutorial I wrote on

how to use the DocSoft software4. I paired my recording with the unannotated script I read

and triple-checked to ensure 100% transcript accuracy. After training on a single file, I

tested the Erin model on a second tutorial script. Despite having been trained only on one

five-minute recording of my voice, the model did very well on the new recording: the

transcript contained almost no errors. Ironically, DocSoft got the word “DocSoft” wrong

every single time, however. Overall, the first test of the Erin voice model contained twenty-

five errors in five minutes—as opposed to the Spalding transcripts, which sometimes

contained over 150 errors in five minutes.

Using my best-trained Erin voice model on a brand new recording, out of a sample

of 480 words, DocSoft made 26 errors (5%) and correctly identified 454 (95%) words. The

positive performance was likely because I recorded my audio on high-quality studio

equipment (as with the Cecil model) and spoke clearly and deliberately. Being aware of

how to talk to facilitate DocSoft transcription seems to greatly help the software’s accuracy.

4 the experience of training DocSoft using materials about how to train DocSoft was unsettlingly meta

12

Page 13: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

DocSoft therefore seems most likely to successfully parse recordings made for the specific

purpose of subsequent transcription.

Erin Conclusions

Working with the Erin model was a much more positive illustration of what DocSoft

is capable of. DocSoft does very well with the types of recordings it was designed to handle,

namely high-quality monologue recordings without any music or ambient noise, where the

speaker is aware of how to speak (clearly, deliberately, and naturally) and doesn’t have a

heavy accent. Under these conditions, DocSoft performed wonderfully. Archival recordings,

or those not recorded with the express goal of future automated transcription (i.e.,

Spalding and Cecil), are more of a mixed bag.

Transcript Editing with Glifos and DocSoft:TE

Throughout the project, one perpetual hurdle was editing the transcripts that

DocSoft produced. Over the course of the semester, there were many technical problems in

this regard that can hopefully be avoided in future projects.

The DocSoft:AV transcription software has an accompanying transcript editor,

DocSoft:TE, which can be run locally on the user’s computer, but not accessed via the

Internet, unlike DocSoft:AV. DocSoft:TE allows for nearly real-time text editing

simultaneous with audio playback, as well as seamless integration with DocSoft:AV’s voice

model training features. At the start of the project, UT did not have the latest version of

DocSoft:TE, and the version I could access did not work properly.

As a result, for the first half of the project, I edited all DocSoft transcripts in Glifos, a

rich-media content management system also used by UT. Glifos is “designed to integrate

digital video, audio, text, and image documents through a process that ‘automates the

production, cataloguing, digital preservation, access, and delivery of rich-media over

diverse data transport platforms and presentation devices’”(Van Deusen Phillips). Among

other features, Glifos allows users to synchronize a video and a corresponding transcript,

so is a useful way to make the fruits of DocSoft’s labor accessible online. However, because

it was not primarily designed as a transcript editor, Glifos proved difficult to use for the

13

Page 14: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

extensive, heavy transcript editing that Spalding required. Even with the audio playback

speed slowed, the process of stopping, starting and rewinding the recording was quite

tedious. Furthermore, the Glifos interface itself was not designed to allow for smooth,

seamless editing. Rather, editing requires users to scroll through large blocks of text with

time codes to keep up with the recording or find pieces that require editing (see Figure 2).

Figure 2: The Glifos editing interface with a Spalding transcript

While editing Spalding transcripts using Glifos, I also experienced an odd glitch that

caused me to lose several hours of editing work. If using Glifos for editing, it is of

paramount importance to save one’s progress frequently, as the program can be buggy at

times. Ultimately, I found it faster to simply transcribe the Spalding recordings by hand

than to edit them in Glifos, defeating the purpose of using automatic transcription. This was

partly because of the high error rate inherent in the Spalding transcripts and partly due to

the Glifos interface, which was not designed for the intensive editing that Spalding

required.

Midway through the project, DocSoft:TE, the DocSoft transcript editor was updated

to the latest version, so I switched from Glifos to TE. Because it was specifically designed

for editing DocSoft transcripts, this software proved easier to use than Glifos. One useful

14

Page 15: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

DocSoft:TE feature was the color-coded “confidence bar” running down the center of the

screen. Using colors ranging from green (high confidence) to red (low confidence), DocSoft

indicates how likely different portions of the transcript are to be correct. This allows users

to easily see at a glance areas that may require particularly extensive editing. I found this

feature useful, particularly for the very long transcripts I was working with, but I also noted

that the DocSoft:TE confidence bar tended to be overconfident; that is, many areas that

were bright green indicating “very confident” contained several errors.

Another useful feature was the ability to export from DocSoft:TE back to AV.

Because both the TE and AV software are made by DocSoft, they work in tandem nicely.

Once editing in TE is complete, it is easy for users to export transcripts directly back into

DocSoft:AV. This very convenient feature allows the voice model to be automatically

trained on the edited transcript, without requiring the user to log into DocSoft:AV manually

and upload the transcript for training.

Figure 3: The DocSoft:TE Editing Interface with a Cecil transcript

I edited about half of the Spalding transcripts and all of the Cecil transcripts in

DocSoft:TE. Because the error rate was so low for the Erin transcripts, I simply edited them

in a basic text editing program. TE is designed to allow users to edit a transcript in real

time, as the audio is playing, as well as to quickly jump forward and backward in the

transcript as needed without ever pausing the audio. Overall, TE had better text grouping

15

Page 16: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

and formatting options than Glifos, and it was much easier to scan and edit large

transcripts. The TE interface did have its problems, however. For instance, because TE

makes it so easy to jump all around the transcript, it was often difficult to navigate without

jumping to the wrong spot in the transcript. It was also easy to unwittingly delete the

wrong word or add new words in the wrong place. Adding spaces for new groups of words

(called “utterances” in DocSoft lingo) also proved difficult, and I believe I found a bug in the

software that made entering the time codes for new utterances quite tedious.

For relatively clean transcripts, like the results of the Cecil model, DocSoft:TE

worked well enough. But for dirtier transcripts, like Spalding’s, as was the case with Glifos,

TE was less efficient. It took almost the same amount of time to edit the Spalding

transcripts in TE as it did to hand-transcribe them with the audio slowed down. DocSoft:TE

was definitely an improvement over Glifos, but was still a slog for very rough transcripts.

General Conclusions

The difference between the Spalding and Erin voice models over the course of the

project was remarkable. From this project, it is clear that there are certain factors that can

make particular recordings good candidates for automatic transcription, or that can help

create the best possible voice model. When creating a recording for transcription or

considering whether or not to use DocSoft for an existing recording, keep these factors in

mind.

1. Speaker and recording quality is paramount.

The best speakers are those that speak naturally, but also clearly and

carefully. DocSoft seems to perform best when speakers know (1) that their

recordings are going to be computer-transcribed and (2) how they can change their

voices (by, for instance speaking more slowly, enunciating, or avoiding stammering)

to facilitate good performance by the software.

Recordings should be made with high-quality recording equipment, ideally in

a silent environment. Background music, audience noise, or other ambient noise is

difficult for DocSoft and should be avoided.

16

Page 17: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

Past a certain threshold, the actual audio file quality does not seem to matter

very much. In this project, I did not notice any difference in accuracy between a

WAV file and an MP3 of the same recording. In fact, I had better success with

streaming MP3s (in the Cecil case) than with high-quality WAVs (Spalding) simply

because the speaker was clearer and the recording quality was better. While a high-

resolution audio file is ideal, it is not critical to a clean transcript.

2. The “ideal” number of training audio files is unclear.

While more training files are intuitively better, I could not establish a

definitive minimum number of training materials that should be used to generate a

good voice model. This differed for all three voice models I made. Cecil started off

fairly inaccurate but improved a lot after three more training recordings. Erin was

nearly perfect after just one training recording. Spalding was trained on many files

several times over and never substantially improved. Users should use the highest

number of training materials available, but the “ideal” number for a good model

varies depending on the recordings and speaker in question.

3. Related documents and word lists are great for improving accuracy.

Related documents and word lists should be used to supplement the base

vocabulary whenever they are available. This feature is particularly useful when

speakers use highly technical vocabulary or lots of acronyms and initialisms. Using

supplemental vocabulary dramatically improved all of my voice models

immediately.

4. Archival audio should be considered for use with DocSoft on the case-by-case basis.

From this project, I learned that archival audio can be a mixed bag when

working with DocSoft. Collections of recordings need to be evaluated individually to

determine if they might work in DocSoft and how to tinker with the software or the

recordings to improve performance. I could not determine a series of one-size-fits-

all recommendations for all archival audio.

17

Page 18: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

Instead, it seems best to consider sets of recordings on their own merits,

taking into account the speaker, the quality and clarity of the recordings, and the

presence of other speakers, noise, or music. Some recordings will lend themselves to

easy transcription; others will not. For clear monologue recordings made under

good conditions, DocSoft can be instrumental in making these recordings more

accessible and searchable. For other recordings, however, DocSoft may be more

trouble than it’s worth, and hand-transcription should be used. While DocSoft’s

utility for archival recordings is hit or miss, for lectures and other contemporary

uses, DocSoft is quite promising indeed.

18

Page 19: static1.squarespace.comstatic1.squarespace.com/.../1399005243162/Spalding+p…  · Web viewThis project originated when a researcher in Chicago, Jim Sitar, was working with the on

Donohue

Works Cited

DocSoft, Inc.,. Docsoft:AV Appliance User Guide. 2nd ed. DocSoft, Inc., 2009. Web. 25 Apr. 2014.

Stine, Matt, and Stephen Cooper. 'Spalding Gray: A Preliminary Inventory Of His Papers At The Harry Ransom Center'. Harry Ransom Center. N. p., 2011. Web. 25 Apr. 2014.

Van Deusen Phillips, Sarah. 'GLIFOS-Media: Rich Media Archiving'. The Documentalist 2009. Web. 25 Apr. 2014.

Welcome to Night Vale Wiki,. 'Night Vale'. N. p., 2014. Web. 25 Apr. 2014.

19