Towards a Smart Wearable Tool to Enable People with SSPI to … · •A system to enable people...
Transcript of Towards a Smart Wearable Tool to Enable People with SSPI to … · •A system to enable people...
Towards a Smart Wearable Tool to Enable
People with SSPI to
Communicate by Sentence Fragments https://www.youtube.com/watch?v=qzPAY4f8Wc8
András Németh (presenter, constructor, ELTE) – on movie
Anita Verő (natural language processing, ELTE) – on movie
András Sárkány (natural language processing, ELTE) – on movie
Gyula Vörös (natural language processing, ELTE) – on movie
Brigitta Miksztay-Réthey (special need expert, ELTE)
Takumi Toyama (eye-tracking solution, DFKI)
Daniel Sonntag (head of the German group, DFKI)
András Lőrincz (project leader, ELTE)
Challenge Handicap et Technologie
2014, May 26-27, Lille, France
Motivation
• The ability to communicate with others is of paramount importance for mental
well-being.
• SSPI: Severe speech and physical impairments / communication disorders
– Cerebral palsy, stroke, ALS/motor neurone disease (MND), muscle spacticity
etc.
• People with SSPI face enormous challenges in daily life activities.
• A person who cannot speak may be forced to only communicate directly to
his closest relatives, thereby completely relying on them for interacting with
the external world.
• Understanding people using traditional alternative and augmentative
communication (AAC) methods (such as gestures and communication
boards) requires training.
• These methods restrict possible communication partners to those who are
already familiar with AAC.
– e.g., shopping, in the bakery
• Understanding people with SSPI requires
practice
– They use special communication methods
Motivation
• The ability to communicate with others is of paramount importance for mental
well-being.
• In AAC, utterances consisting of multiple symbols are often telegraphic: they
are unlike natural sentences, often missing words to allow for successful and
detailed communication. (“Water now!” instead of “I would appreciate a glass of water,
immediately.”)
– Some systems allow users to produce whole utterances or sentences
that consist of multiple words. The main task of the AAC system is to
store and retrieve such utterances.
– However, using a predefined set of sentences severely restricts the
applicability.
• Other approaches allow for a generation of utterances from an unordered,
incomplete set of words, but they use predefined rules that constrain
communication.
– e.g., shopping, in the bakery
• Understanding people with SSPI requires
practice
– They use special communication methods
4
Augmented Reality in Medicine
Video: https://dl.dropboxusercontent.com/u/48051165/ERmed_CHI2013video.mp4
System Architecture: Mobile Web and App Design
XML-RPC ERmed Proxy
Ermed Bridge
TCP
Speech-based interaction
HMD & Gaze-based interaction
Pen-based interaction
Goals of presented work
• To enable broad accessibility and
communication possibilities for people with
SSPI
• Technical Challenges:
– Overcome physical impairments (for user
input)
– Help non-trained people to understand people
with SSPI (towards generation)
7
Approach
• The most effective way for people to communicate would be spontaneous
language generation.
• Novel utterance generation: the ability to say “almost everything,” without a
strictly predefined set of possible utterances/generations/productions.
• We attempt to give people with SSPI the ability to say almost anything. For
this reason, we would like to build a general system that produces novel
utterances without predefined rules. We chose a data-driven approach in the
form of statistical language modeling.
• In some conditions (e.g., cerebral palsy), people suffer from communication
and very severe movement disorders at the same time. For them, special
peripherals are necessary. Eye tracking provides a promising alternative to
people who cannot use their hands.
8
Fourfold solution
(1) Smart glasses with gaze trackers (thereby
extending ) (gaze tracking and
interpretation/graphical symbol selection)
(2) Symbol / Utterance selection
(3) Language generation (sentence generation,
thereby using language models to propose best
hypotheses)
(4) Text-To-Speech functionality (TTS)
• Eye tracking glasses (ETG)
– Forward-looking camera
– Eye-tracking cameras
• Head mounted display (HMD)
– See-through
• Motion Processing Unit (MPU)
– Capture head gestures that indicate
recalibration need
Hardware components
10
Setup: Moverio Glass with Display
Inertial Motion Unit
WheelPhone
WheelPhone + Smart Phone + Constructor
Setup with eye-tracking
Eye Tracking Glasses by
SensoMotoric Instruments GmbH
AiRScouter Head Mounted Display
Brother Industries
MPU-9150 Motion Processing Unit
InvenSense Inc.
In the experiments
Main functions
• Symbol selection with gaze tracking
– Calibration is crucial and poses a problem
– Adapt or recalibrate?
• Utterance generation
– Based on selected symbols
– Uses natural language models
Symbol selection on HMD
• Idea: the user selects communication
symbols on the HMD, using eye tracking
• Crucial problems with calibration
– Eye tracking has to be calibrated
– Calibration may degrade over time
– Adapt or recalibrate?
• User might tolerate calibration errors (when their
are small and negligible)
• User should be able to initiate recalibration (head
gesture)
Symbol selection game on HMD
• Goal: select the numbers in correct order (1234123…)
• Each selection adds a small amount of error
• Participants indicate when error is too significant to be
tolerated
• 4 participants, 4 experiments each
• On average, errors up to 2.7° deviation from actual eye
focus are tolerable.
Real communication experiments
• The Brother Retina HMD was too small to show a full-
sized communication board (CB) (although resolution is
quite good)
• We use a projector / beamer to display the CB
– Simulate a large HMD
• Recognition Feedback is provided (to the user)
– Estimated gaze position + selection
• Speech generation
– word for the selected symbol synthesized
21
https://www.youtube.com/watch?v=C4DOmMtgqz0
Participant
• The participant of this example test is a 30 year old man
with cerebral palsy.
• He usually communicates with a headstick and an
alphabetical communication board, or with a PC-based
virtual keyboard controlled by head tracking.
• Mousense is already an improvement.
23
tea, two, sugar
(tea, kettő, cukor)
tea, lemon, sugar
(tea, citrom, cukor)
one, glass, soda
(egy, pohár, kóla)
Communication Setting
• The participant could move his head
• The head position must be accounted for (MPU)
• Fiducial markers around the board are recognized by
vision-based pattern recognition techniques
25
Usage and HCI details
• The estimated gaze position was indicated as a small red
circle on the projected board (similar to the previous test).
• A symbol was selected by keeping the red circle on it for
two seconds.
• The eye tracker sometimes needs recalibration;
– the user could initiate recalibration by raising his
head up (detected by the MPU.)
– Once the recalibration process was triggered, a
distinct sound was played, and an arrow indicated
where the participant had to look for doing RC.
Communication experiments
• Goal: communicate with a partner
• Two situations (with different boards)
– Buying food
– Discussing an appointment
Experimental results
• Verification
– To verify that communication was successful, the participant indicated misunderstandings using well-known yes-no gestures, which were quick and reliable. Moreover, a certified expert in AAC was present to indicate apparent communication problems.
– 205 symbol selections happened
– 23 of them were incorrect
• 89% accuracy
– The error rate was acceptable
• Real communication took place!
External symbols
• Idea
– Not all symbols are present in the system
– Optical character recognition can help
– (Object recognition can help)
• Example
– The user wants to buy a certain type of
sandwich
– In the store, there are labels with the names of
the sandwich types on it
• The OCR was simulated
35
Sentence fragment generation
• A word guessing game
– Good afternoon, how are
– I apologize for being late, I am very
– My favorite OS is
• A symbol guessing game
• LM needs to help
– To select words with the right sense (disambiguate homonyms: e.g., river
bank versus money bank)
– To select hypothesis where graphical symbols (words) are ‘in the
right place’
– To increase cohesion between words (agreement)
you?
sorry!
[Linux, Mac OS, Windows XP, Win CE].
Sentence fragment generation
• Understanding symbol communication requires practice
• Symbol communication is non-syntactical – Function words are rarely used
– Order of symbols may vary
• e.g., {lemon, sugar, tea} -> tea with lemon and sugar
• Idea – Generate fragments by inserting function words
– Rank them based on language models
– User should select from top 4 variants
37
Language Models
• Estimate the probabilities of n-grams (sequences of
words)
• P(tea with sugar) > P(tea the sugar)
– Use a corpus (collection of texts)
• Sparsity problem
– Long n-grams tend to be rare
– Smoothing and backoff methods are used to deal with
this problem
Stupid backoff
• Let denote a string of L tokens of a fixed vocabulary
• approximation reflects the Markov assumption that only the most recent n-1 tokens are relevant when predicting the next word.
• For any substring wij of w
1L let denote the frequency of occurrence of that
substring in the longer training data. The maximum likelihood probability estimates for the n-grams are given by their relative frequencies
•
• Problematic because (de-)nominator can be zero.
• Appropriate conditions are needed on the r.h.s.
• Noisy: estimates need smoothing
Language corpora in use
• Google Books n-gram corpus
– Collection of digitized books from Google
– Very large, freely available
– Represents written language
• OpenSubtitles corpus
– Collection of film and TV subtitles
– Moderate size, freely available
– Represents spoken language
Language modeling tools used
• Google Books n-gram corpus – Software: BerkeleyLM
• Pauls, A., Klein, D.: Faster and Smaller N-Gram Language Models. In: 49th Annual Meeting of the ACL: Human Language Technologies, Vol. 1, pp. 258--267. ACL, Stroudsburg, USA (2011)
– Method: Stupid Backoff • Brants, T., Popat, A. C., Xu, P., Och, F. J., Dean, J.: Large language models in machine translation. In:
EMNLP 2007, pp. 858--867.
• OpenSubtitles corpus – Software: KenLM
• Heafield, K.: KenLM: Faster and Smaller Language Model Queries. In: EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pp. 187--197. ACL, Stroudsburg, USA (2011)
– Method: Modified Kneser-Ney smoothing • Heafield, K., Pouzyrevsky, I., Clark, J. H., Koehn, P.: Scalable Modified Kneser-Ney Language
Model Estimation. In: 51st Annual Meeting of the Association for Computational Linguistics, pp. 690--696. Curran Associates, Inc., New York, USA (2013)
Prefix tree building algorithm (1)
• Representation – Work with a prefix tree, where each path represents an n-gram
• Input – Set of named entities (e.g., tea, sugars)
– Set of function words (e.g., you, with, for)
– Desired length of sentence fragment (e.g., 3 words)
• Parameters – Score threshold (minimal score, e.g., 10-30)
– Leaf limit (maximal number of open leafs, e.g., 200)
• Output – Tree (or a list) of potential sentence fragments
• Contain all the important words
• Ordered by estimated probability (score)
Prefix tree building algorithm (2)
• Essentially a breadth-first search, with some constraints
• Start with an empty tree, root node is open
• While there is an open node:
– Extend all open leafs with all available words, if the resulting n-gram’s “score” is above a certain threshold
– Discard every open leaf node that cannot be extended (falls below threshold)
– Close every open leaf except the highest scoring ones (leaf limit)
Conclusion on NLP
• A system to enable people with SSPI to
communicate with natural language
sentences
• Demonstrated the feasibility of the
approach
Our contribution
• An interaction system to reduce communication barriers for people with
severe speech and physical impairments (SSPI) such as cerebral palsy.
• The system consists of two main components: (i) the head-mounted human-
computer interaction (HCI) part consisting of smart glasses with gaze trackers
and text-to-speech functionality (which implement a communication board
and the selection tool), and (ii) a natural language processing pipeline in the
backend in order to generate complete sentences from the symbols on the
board.
• We developed the components to provide a smooth interaction between the
user and the system thereby including gaze tracking, symbol selection,
symbol recognition, and sentence generation.
• Our results suggest that such systems can dramatically increase
communication efficiency of people with SSPI.
53
Eye Tracking Requirements
• Eye gaze is a compelling interaction
modality but requires user calibration
before interaction can commence.
• State of the art procedures require the user
to fixate on a succession of calibration
markers, a task that is often experienced
as difficult and tedious.
54
Possible improvements and outlook • 3D gaze tracker that estimates the part of 3D space observed by the user
• OCR, signs and object recognition methods to
convert “symbols” to the communication board
• new HMDs and eye tracker setups (ergonomic)
• Advanced NLP to transform series of symbols to whole domain sentences
within the context
• integrated calibration: e.g., https://www.andreas-
bulling.de/fileadmin/docs/pfeuffer13_uist.pdf
• personalisation according to patient record
• CPS integration
55
http://getnarrative.com/ The Narrative Clip is a tiny, automatic camera and app
that gives you a searchable and shareable photographic
memory.