Virtual Podium with HTC Vive Abstract

1

CS294W Final Paper Team Soapbox Jesse Min(jesikmin, 05786379), JeongWoo Ha (jwha, 05833965), Min Kim(tomas76, 05860540)

Virtual Podium with HTC Vive Abstract

Public speaking is a difficult task for many people because speakers can be intimidated by

speaking before large groups and speaking in new environments. To improve speaking ability, speakers

would benefit from getting feedback, but it is not feasible for most people to receive constant feedback.

Our project tries to solve the problem by creating a Virtual Reality application with diverse virtual

presenting environments with augmented features such as custom powerpoint slides and speaker notes.

To give meaningful speech feedback, Hound SDK and Fitbit API are incorporated to provide automatic

feedback of verbal tics and heart rates.

I. Introduction

In order to make speech practicing more effective, we have created a virtual presentation

application with which a speaker can practice: Virtual Podium with HTC Vive. In this VR environment,

speakers can experience the circumstances in which they are going to give a speech in advance and

thereby become easily be accustomed to the setting and familiarized with a large audience as well. In

addition, our VR app provides personalized speech feedback such as report on heart rate and verbal tics

using Natural Language Processing and third party APIs. With Hound SDK, we are able to convert voice

into text and search the verbal tics such as “like”, and “you know.” With Fitbit API, we can retrieve the

heart rate of a user during mock presentations.

This paper is organized as follows. Firstly, some of related works that are relevant will be

reviewed. Secondly, we will provide high level and detailed description of our application. Thirdly, the

2

process and results of user study will be discussed. Fourthly, conclusion and possible future work will be

provided. Lastly, we will discuss the things we have learned throughout the project.

II. Related Work

There are a number of web/mobile/standalone speech practice applications. Some mobile

applications like “Articulation Station” are aimed at comparing pronunciation of the users to the standard

pronunciation and provides analysis on in which part the user should correct his or her pronunciation.

Most relevant apps are aimed at people with special needs- patients who particularly need speech therapy

for stroke, aphasia, and autism. In addition, most of the existing speech practice or analysis applications

provide users intensive feedback, such as how long they have arbitrarily halted during their speech and

overall pace of the speech.

In the field of virtual reality, there has been applications that provide users extensive and

realistic vicarious experience, but more, there are VR applications that make people feel as if they are in

the Star Wars movie or as if they are standing in the middle of the Sahara..

However, there has not been any substantial effort to combine these two fields of technology to

provide an app for public speaker who would like to practice and improve their ability as well as receive

feedback. We were motivated by the idea that if we utilize open APIs such as Hound SDK and Fitbit

API, we could give people speech analysis and rich feedback as existing commercial speech practice

apps do. We can also render virtual conference room or 360-video recorded environments in virtual

reality (more precisely on HTC Vive HMD). As a result, we decided to integrate those two state-of-art

technologies to offer users with new paradigm of speech practice in VR machine.

3

III. High-level Project Description

Each execution thread of Virtual Podium consists of three major parts: entry stage, practice

stage, and analysis stage. The below Figure 1 describes the three stages in more detail.

Figure 1. Program execution stages

In the entry stage, after a user runs the program, he first needs to sign into the app so that the

program can retrieve all the personalized presets and basic personal data from the web server. Then, the

user can select between two practice modes – virtual conference room mode and real room mode

rendered by 360-video. We currently have rough mockups for these but still need to complete the work

in the future.

In the practice stage, if the user chose to practice in 360-video mode, he can make a choice

among many options such as small classroom setting and large auditorium setting. The user can actually

begin practicing his or her speech from this point. On the HTC Vive head mounted display (or HMD),

the user will be able to view simple statistics including time and heart rate on real-time basis. In addition,

the user can provide a typed script and powerpoint slides to the program beforehand so that he can view

the script and slides in the HMD while practicing and move onto the next sentence or slide by giving a

4

trigger with two HTC controllers. The user can also remove the text script, if he wants to, by holding his

head up and make the text appear again by bowing his head slightly.

During the analysis stage, as soon as the user finishes practicing, the program utilizes Fitbit API

to analyze the overall heart rate. It also employs Hound SDK to report how many and often the user

unconsciously exhibited verbal tics such as “like” and “you know” during the practice. After fetching

HTTP response from those APIs, Virtual Podium displays a neatly organized report onto its web

dashboard personalized for each user.

IV. Detailed Project Description

Vizard was used to build the main virtual reality application. The advantage of using Vizard is

that many Python libraries as well as Python scripts can be incorporated into our application. To model

3D conference room, we used predefined models in Sketchup and modified them according to our needs.

Our main feature of the program consists of practicing speech in the modeled virtual environment with

3D rendered powerpoints and script. To control powerpoint slides and script, we utilized two HTC Vive

controllers. Vizard automatically parses user provided powerpoint slides and script to render to the

virtual environment. When the user clicks the right controller, the powerpoint slide will move to the

next slide. Similarly, the next line of the speaker note will be rendered if the left controller is clicked. In

addition, we rendered a 3D timer in the virtual environment to show the user how long the time elapsed.

The below two figures illustrate how the rendered powerpoint slides and speaker notes look like in

virtual environment.

5

Figure 2. Screenshot of custom speaker notes in virtual modeled environment.

Figure 2. Screenshot of powerpoint slides in virtual modeled environment.

To provide the user with useful feedback, we incorporated Fitbit API and Hound API. The Fitbit

API is used to provide the user with heart rates recorded during the presentation, and Hound API is used

to run NLP algorithms. On the desktop that runs the Vizard application, a set of GUI elements is

displayed to enable call to Hound and Fitbit API. When the user clicks a Hound button, a Hound API

6

will be activated to record user’s speech into text. If the Fitbit button is clicked, the heart rate from the

presentation start time to presentation end time will be recorded. To avoid interference with the main

virtual reality program, Fitbit API and Hound API calls are made through multithreading. Appropriate

tokens and authorizations are provided through REST API calls to access personal data from the server.

We discovered that single thread calls will freeze the program. The program will exit when the analysis

is done. The analysis result is stored in the MySQL database to be used by the Dashboard web

application.

Dashboard web application is built with RoR, and it will be hosted through Heroku with

Postgres database. We enable user signup and user signin to access personalized analysis data. D3.js is

used to display the Fitbit and Hound analysis data into the web dashboard.

We allow the user a few different modes with different room sizes. We filmed a 360 video of a

large classroom and small classrooms, both filled with students to use as our virtual reality application (a

CS106A review session, a CS161 class, and a CS294S class were filmed with the consent of instructors).

The videos were converted to sphere format and rendered in HTC virtual environment. A user can

choose which room he or she wants to present in and start presenting in the virtual environment selected.

The limitation of this approach is that the users are not able to walk around in the virtual environment

while the 3D modeled virtual environment allows people to move in the virtual space.

7

Figure 3. Application multi threading execution flow

V. User Research

Our user research strategy consisted of two different parts – first, each user study participant was

interviewed on overall impression toward the application and completed a survey (Appendix B) and

second, we analyzed every user’s heart rate and voice recorded. We conducted the user study on 8

potential users, five recruited and three of us. Each subject of study of five recruited users was a Stanford

undergraduate student from different majors and years. The participants of the study were debriefed of

instructions such as how to calibrate the focus of HMD and how to use HTC controllers. Every one of

them practiced their own speech in three different modes: virtual conference room mode, 360-angle

small lecture room, and 360-angle large lecture hall. We then asked users to complete the survey form,

conducted interview on general comment on our app, and analyzed on collected heart rate and verbal tick

data.

Figure 4 displays heart rate change with different virtual environment mode for each user. Six

out of eight participants or 75% of participants showed tendency of increasing heart rate as the

8

environment was shifted from virtual conference room mode to 360-video small lecture room and from

360-video small lecture room to 360-video large lecture hall mode. Except subject 4 and 7, a

participant’s heartbeat became faster or, in other words, a participant felt more anxiety as room changed

from a virtual room to a more realistic setting and a small room to a spacious lecture hall. This heart rate

data collected by Fitbit on a real-time basis was coherent with the user's explicit remark during interview

and survey that virtual room was interesting yet unrealistic, while 360-video setting, particularly

360-video large lecture hall mode, was surprisingly realistic, compelling, and effective not only because

of the size of the room but also due to the larger size of audience.

Figure 4. Average heart rates of subjects according to virtual environments

Besides the heartbeat, we converted voice data of testers into text on a real-time basis using

Hound SDK. Such processed text data was linearly searched to find certain verbal ticks including “like,”

“well,” and “you know.” The example statistic for subject 6 is shown below:

9

Figure 5. Verbal tics for subject 6

Users evaluated this verbal-tick-specific speech feedback as one of the most useful features in

this application as most of them were not aware of their occasional habits such as “like” while they were

speaking. However, there was still a limitation as well. As our tester recruiting was done during the week

9, one of the busiest weeks for every Stanford student, we could not force testers to prepare for 5 or 10

minutes long speech. Because users could give only 2 to 3 minutes speech at most, capturing verbal tick

in such short speech sometimes did not result in meaningful result. Nonetheless, according to the survey,

users were satisfied the feature.

In addition, many users provided with several suggestions for future work in their survey

response. Some people suggested of employing more realistic audio sound along with the current

realistic visual environment – for instance, users are able to listen to their sound with appropriate

echo/ambient sound through earphones. Others suggested for a high-resolution 360 video. From our

experience, 1080K resolutions video, which is usually considered as a fairly high-resolution video, was

not enough for 360 setting where sphere mapping onto HMD is required. Our Samsung 360 Gear was

acceptable for our prototype but not sufficient for a complete application in the future.

10

VI. Conclusion

VR technologies have become more widely approachable to the general public. There are much

more ways the VR technologies could be used besides gaming and entertainment. There are already

some implementation of VR technology in the medical field - to cure the users with specific phobias.

Although public speaking is one of the many skills that is important and helpful for many people, many

people have difficulties acquiring the desired skills, VR programs could help them reduce public

speaking anxiety and to help them practice before an actual speech. Leveraging various existing

technologies, such as Hound SDK and Fitbit API, we have and created a virtual environment by which

users can practice public speaking skills anywhere at anytime. According to our user research, we have

discovered that people are indeed interested in using such a technology and they do feel that virtual

environments can partially replace the actual presentation environments for practice.

This project is just a first step in what could potentially become a very powerful application.

What we could make out of it really depends on how we create new demands and meet the future needs.

Virtual reality could change behaviors of human beings, and this powerful methods could and should be

used wisely to assist the users to change for the better. As we will be able to incorporate other high-end

technologies such as haptic sensories, it would be possible to create more realistic virtual

speech-practicing environment to relieve public speaking anxiety.

VII. Future Work

In this section, we will discuss the limitations of our current technology and four possible future

improvements that can make our application more feasible and effective.

Firstly, more realistic rendering of the real-world environment where the users would actually

give a speech would be crucial. Although we have created virtual 3D environments using SketchUp and

360 videos to make them look real, our app still has some limits. One crucial addition would be to enable

user to interact with the virtual audience in HTC Vive. What makes speakers in real life get really

11

nervous is the intimidati audience. If we could program the virtual environment to have audience who

react depending on the user’s speech, volume, gestures, and etc., it would be more helpful for the users

to get a good better of what it is like to give a pitch in front of the anticipating audience.

Secondly, another powerful addition would be the more accurate integration of various feedback

features. Our current integration of Hound and Fitbit is just a start of what we could provide through the

virtual speech practice. We could implement gestures trackers through Leap Motion to observe the user’s

hand gestures and walking patterns on the stage; we could implement eye gaze trackers to analyze how

much time the user spends to engage with the each audience in the room; we could also record the user

practicing through the VR machine so that the user would see himself.

Thirdly, enabling multiple users to join the same virtual environment to practice public speaking

would be another key addition to this project. Multiple users in the same virtual environment could help

people who are not physically together to practice ahead of time with the anticipated speech

environment.

Lastly, we could potentially have a platform where people can share virtual speech environment

that they created to filmed. The more options the users have, the better for them to experience diverse

environments with dynamic audiences.

VIII. Discussion

The VR industry has made some great progress in the recent years, and we currently have many

technologies that are implementable with the VR techs in many ways. There are powerful software tools

such as Unity, Unreal Engine, and Vizard to create high quality virtual 3D environments. Furthermore,

360-video technology is available and now being used widely in various fields. Integrating real-world

videos of various conference rooms, lecture halls, auditoriums helped the users be familiar with the

actual environment much better than the 3D modeled room. Accessing these real environments through

12

virtual reality machines helped many testers to feel comfortable when they actually gave speeches in the

real environments.

We were able to integrate other helpful features that could help the users to get fruitful feedbacks

after the practice. Using Hound SDK, we were able to detect how many times the users used the filler

words such as ‘like,’ ‘um,’ and ‘ah’. Providing with the users at which point of their speech they used the

filler words helped them to locate where they were feeling more anxious and uncomfortable about the

contents. Another technology we implemented was Fitbit - analyzing the user’s heart rate aside with their

speech helped them focus which part of their speech to focus on. In addition, we could potentially use

the heart rate data to analyze how we could help the users prepare in certain speech environments.

Another key aspect of the project was the user testing and user research. We wanted to get

comprehensive feedback from people with different academic backgrounds. We also wanted to analyze

the psychological effects of this project. There were many useful feedback such as dual controller being

too burdensome as they would not be holding the controllers in a real speech environment. Nevertheless,

many subjects really liked the options of choosing the pre-filmed real world 360-video mode and the

virtual 3D model mode. Some of them said that multiple practices with the virtual environment would

actually help them reduce anxiety about public speaking. However, more accurate measures of heart rate

and other scientific measurements are needed to prove that this application would actually help reduce

public speaking anxiety.

IX. Team Member Contributions

All of our members equally contributed to the project for the entire quarter. We all worked

together to tackle major issues regarding Vizard IDE, sphere mapping of 360 video, integration of

Hound SDK / Fitbit API with the HTC Vive application, and so on. Each member still had some

particular portions of the project more dedicated than others. Jesse Min put most of his time in

sphere-mapping 360-video onto the HMD and implementing Fitbit API and Hound SDK for measuring

13

heart rate and speech analysis. JeongWoo Ha was dedicated to setting up the Sketchup environment and

rendering it smoothly onto the HTC HMD through Vizard. He put a great deal of effort into fine-tuning

visual details in the VR environment, such as coordinating and display of text script/slides. Min Kim

committed most of his time to implement motion sensing part of VR application and sending trigger to

the program with HTC controller. He also worked a great deal on streamlining the VR application with

the analysis stage (Fitbit API / Hound SDK) including the web dashboard.

14

X. Bibliography

1. Hound SDK Documentation: https://www.houndify.com/docs 2. Houndify Python Github Example: https://github.com/Mause/houndipy 3. Fitbit API Documentation: https://dev.fitbit.com/docs/ 4. Fitbit API Blog Tutorial:

https://roboticape.wordpress.com/2014/01/13/first-steps-into-the-quantified-self-getting-to-know-the-fitbit-api/ Vizard 5.0 Documentation: http://docs.worldviz.com/vizard/

5. PythonMulti-threading: https://www.toptal.com/python/beginners-guide-to-concurrency-and-parallelism-in-python

6. Samsung GEAR 360 Camera How-to: http://www.samsung.com/global/galaxy/gear-360/how-to/get-start XI. Appendix

A. JSON Response of Hound SDK and Fitbit API

https://www.houndify.com/docs

https://github.com/Mause/houndipy

https://dev.fitbit.com/docs/

http://docs.worldviz.com/vizard/

https://www.toptal.com/python/beginners-guide-to-concurrency-and-parallelism-in-python

15

B. User Research Survey

User Test Questionnaire

Name Gender Age ● Personal Speech Experience

Q1. Do you usually feel nervous giving a speech in front of the public? Not At All 1 2 3 4 5 6 Definitely

Q2. What part of the public speaking are you most uncomfortable with?

Q3. Have you completed Stanford PWR2 course?

No Yes Q3-1. If yes, how did you practice your final presentation?

Q3-2. If yes , do you think the app might have helped your PWR2 presentations? Not At All 1 2 3 4 5 6 Definitely

Q3-3. If No , do you think this app will help your future PWR2 presentations? Not At All 1 2 3 4 5 6 Definitely

_____________________________________________________________________________

● Virtual Reality Experience

Q4. How real was the virtual conference room? Not At All 1 2 3 4 5 6 Definitely

Q5. How real was the small 360-video rendered room? Not At All 1 2 3 4 5 6 Definitely

Q6. How real was the large 360-video rendered room? Not At All 1 2 3 4 5 6 Definitely

Q7. Did the large 360-video rendered room make you more nervous than the small room?

Not At All 1 2 3 4 5 6 Definitely Q8. Would you use the 360-video rendered room or the 3D virtual conference room for practicing?

Not At All 1 2 3 4 5 6 Definitely

16

_____________________________________________________________________________

● Overall Experience

Q9. Will you use this app if this app is officially released in the future? Not At All 1 2 3 4 5 6 Definitely

Q10. What were some advantages of this app?

Q11. What were some drawbacks of this app?

Q12. Please briefly summarize and comment on overall impression after using this VR speech practicing application. (Any suggestion is also welcomed.)

Q13. What additional features would you like to see in this application?

Virtual Podium with HTC Vive Abstract

Documents

Transcript of Virtual Podium with HTC Vive Abstract