Experimental Test report - Fraunhofer Austria · Figure 47 - Kinect Data Representation,...

Experimental Test report

Document information

Project Title 6th Sense

Project Number E.02.25

Project Manager Fraunhofer Austria

Deliverable Name Verification Report “Experimental Test Report”

Deliverable ID Del 4.2

Edition 00.01.01

Template Version 03.00.00

Task contributors

Fraunhofer Austria, FREQUENTIS AG, Fraunhofer FKIE, subcontracted by Fraunhofer Austria

Abstract

The project Sixth Sense postulates that the users “body language" is different at “good” and “bad” decisions. The project follows the idea to use the whole body language of a user for communicating with a machine. In our case it is an Air Traffic Controller (ATCO) with an Air Traffic Tower CWP. Specifically we intend to analyse the correlation of the change of the behaviour of an ATCO - expressed through his body language - with the quality of the decisions he is making. For that, an experiment was set up, data about the user behaviour was collected, explored and analysed. This document is the test report of the proof of concept for the Sixth Sense prototype and its core components. Results of our work may be used for an early warning for “bad” situations about to occur or decision aids for the ATCO.

2 of 98

©SESAR JOINT UNDERTAKING, 2011. Created by Fraunhofer Austria, FREQUENTIS AG for the SESAR Joint Undertaking within the frame of the SESAR Programme co-financed by the EU and EUROCONTROL. Reprint with approval of publisher

and the source properly acknowledged.

Authoring & Approval Prepared By - Authors of the document.

Name & Company Position & Title Date Volker Settgast/Fraunhofer Austria Project Contributor 22.06.2015

Nelson Silva / Fraunhofer Austria Project Contributor 01.07.2015

Carsten Winkelholz, Jesscia Schwarz/

Fraunhofer FKIE

Subcontracted Project

Contributor 07.07.2015

Michael Poiger / Frequentis AG Project Contributor 08.07.2015

Florian Grill / Frequentis AG Project Contributor 08.07.2015

Reviewed By - Reviewers internal to the project.

Name & Company Position & Title Date Theodor Zeh / Frequentis Technical Coordinator 15.07.2015

Eva Eggeling / Fraunhofer Austria Project Manager 15.07.2015

Approved for submission to the SJU By - Representatives of the company involved in the project.

Name & Company Position & Title Date Theodor Zeh / Frequentis Technical Coordinator 30.07.2015

Eva Eggeling / Fraunhofer Austria Project Manager 30.07.20125

Rational for rejection

None.

Document History

Edition Date Status Author Justification

00.00.01 20/06/2015 Draft Eva Eggeling New Document

00.00.03 07/07/2015 Update Volker Settgast merged version

00.00.04 09/07/2015 Update all merged version

00.00.08 15/07/2015 Update all merged version

00.01.00 30/07/2015 Submission Version

Eva Eggeling Merged Version

00.01.01 15/09/2015 Review Version Eva Eggeling/all Resubmission

Intellectual Property Rights (foreground) This deliverable consists of SJU foreground.

3 of 98



Table of Contents TABLE OF CONTENTS ..................................................................................................................................... 3

LIST OF TABLES ................................................................................................................................................ 4

LIST OF FIGURES .............................................................................................................................................. 4

EXECUTIVE SUMMARY .................................................................................................................................... 6

1.1 PURPOSE OF THE DOCUMENT ............................................................................................................... 7 1.2 INTENDED READERSHIP......................................................................................................................... 7 1.3 ACRONYMS AND TERMINOLOGY ........................................................................................................... 7

2 THE EXPERIMENT ..................................................................................................................................... 9

2.1 EXPERIMENTAL SETUP ......................................................................................................................... 9 2.2 OPERATIONAL SCENARIO ................................................................................................................... 10 2.3 ROLES AND RESPONSIBILITIES ........................................................................................................... 11 2.4 TECHNICAL SETUP OF THE EXPERIMENT ............................................................................................ 12 2.5 AMQ BROKER ..................................................................................................................................... 13 2.6 THE HUMAN MACHINE INTERFACE (HMI) .......................................................................................... 13

3 PERFORMING THE EXERCISES .......................................................................................................... 16

3.1 PROFILE OF PARTICIPANTS ................................................................................................................. 16 3.2 DATA ANALYSIS, EXPLORATION AND VISUALIZATION ......................................................................... 17

3.2.1 Heart Rate vs Observations List ................................................................................................ 18 3.2.2 Eye-Tracker and mouse Analysis .............................................................................................. 19 3.2.3 Simple Metrics and Data Exploration ........................................................................................ 20

4 RESULTS ................................................................................................................................................... 23

4.1 WORKLOAD ESTIMATES BASED ON QUESTIONNAIRES ....................................................................... 23 4.2 HINTS FOR H1 - EXPLORING THE SENSOR DATA ............................................................................... 28

4.2.1 Sixth Sense Prototype Framework for Data Exploration ........................................................ 31 4.2.2 Categorization of Metrics regarding mental Aspects .............................................................. 34 4.2.3 Research Questions .................................................................................................................... 38 4.2.4 Data Exploration and Analysis ................................................................................................... 40

4.3 HINTS FOR H2 - ANALYSIS OF THE ARRIVAL AND DEPARTURE WORKFLOWS ................................... 54 4.3.1 Implementation ............................................................................................................................. 54 4.3.2 Results of the Analysis of ATC Workflow Steps ...................................................................... 55 4.3.3 Machine Learning Experiments ................................................................................................. 57

4.4 EVENT TRACE ANALYSIS .................................................................................................................... 61 4.4.1 Variable Length Markov Models (VLMM) ................................................................................. 61 4.4.2 Scatterplot Matrix for Measures ................................................................................................. 64 4.4.3 Visualization of Sequential Patterns .......................................................................................... 64 4.4.4 Insights regarding interaction sequences ................................................................................. 66 4.4.5 States corresponding to outliers and around ........................................................................... 75

4.5 CONCLUSION ....................................................................................................................................... 77 4.6 FUTURE WORK .................................................................................................................................... 79

REFERENCES ................................................................................................................................................... 80

...................................................................................................................................................... 82 APPENDIX A

TECHNICAL VERIFICATION DETAILS OF EXERCISE 1 AND 2 ............................................................... 82 A.1A.1.1 Kinect ............................................................................................................................................. 87 A.1.2 Speech Recognition ..................................................................................................................... 92

QUESTIONNAIRES ............................................................................................................ 94 APPENDIX B

4 of 98



List of tables Table 1 - Description of the workflow steps ............................................................................................ 9 Table 2- Data collection and Quality Assessment for Different Data Sets and Sensors ...................... 17 Table 3 - Resume of initial metrics to be visualized and explored ........................................................ 21 Table 4 - Outliers in negative/positive answers. ................................................................................... 28 Table 5 - Resume of most important metrics ........................................................................................ 31 Table 6 - Classification of most important metrics into categories. ....................................................... 35 Table 7- List of Main Research Questions ............................................................................................ 39 Table 8 - Resume of AOI that received most interest time from each user. ......................................... 45 Table 9 - Resume of parameters for the Kinect Head Pose ................................................................. 48 Table 10 – Filter/Query to detect airplanes that are in the workflow step TAXI. ................................... 55 Table 11 - Most frequent state sequences for the eye data (top5 for each user). ................................ 67 Table 12 - Most frequent states of each user for the eye fixation sequences. ..................................... 68 Table 13 - Most frequent state sequences for the mouse data (top 5 for each user). .......................... 69 Table 14 - Most complex state sequences for the eye tracking data (top 5 for each user) .................. 70 Table 15 - Illustration of the most complex state sequences for the eye tracking data ........................ 72 Table 16 - Most complex state sequences for the mouse data (top 5 for each user). ......................... 73 Table 17 - Illustration of the most complex state sequences for the mouse data ................................. 74 Table 18 - Examples of state sequences corresponding to outliers in the scatterplots. ....................... 76 Table 19 -Technical specifications of Kinect. ........................................................................................ 88 Table 20 - Kinect Results ...................................................................................................................... 91

List of figures Figure 1 - Update of the exercise plan as described in the experimental plan. ...................................... 9 Figure 2 - Experimental Workflow ......................................................................................................... 10 Figure 3 - Hamburg Airport ................................................................................................................... 11 Figure 4 - Arrival workflow - responsibilities .......................................................................................... 12 Figure 5 - Departure workflow - responsibilities .................................................................................... 12 Figure 6 - Setup working position ......................................................................................................... 13 Figure 7 - Components of the HMI screen ............................................................................................ 14 Figure 8 - Departure Strips .................................................................................................................... 14 Figure 9 - Arrival Strips ......................................................................................................................... 15 Figure 10 - Strip Bay Configuration / Button Bar................................................................................... 15 Figure 11 - RMSSD Formula................................................................................................................. 18 Figure 12 - Z-Score IBI vs negative observations through the total experiment time for user8. .......... 19 Figure 13 - Areas of interest of the ATC Simulator as defined in Ogama ............................................ 20 Figure 14 - Ranking of metrics and visualizations ................................................................................ 21 Figure 15 - Observation List .................................................................................................................. 22 Figure 16 - Mental Demand Results for all 8 users, 2 experiments. ..................................................... 24 Figure 17 - Physical Demand Results for all 8 users, 2 experiments. .................................................. 24 Figure 18 - Temporal Demand Results for all 8 users, 2 experiments. ................................................ 24 Figure 19 - Level of Effort Results for all 8 users, 2 experiments. ........................................................ 24 Figure 20 - Level of Frustration Results for all 8 users, 2 experiments. ............................................... 25 Figure 21 - NASA-TLX Negative Results (not considering “Level of Performance” answers). ........... 25 Figure 22 - Level of Performance (for all users, 2 experiments). .......................................................... 25 Figure 23 - NASA-TLX Correlation Matrix (taking all answers from all users). ..................................... 25 Figure 24 - SAGAT Based Questionnaire. ............................................................................................ 26 Figure 25 - SAGAT Correlated Answers. .............................................................................................. 26 Figure 26 - SASHA based Questionnaire. ............................................................................................ 27 Figure 27 - SASHA Questionnaires, correlated Plot. ............................................................................ 27 Figure 28 - NASA-TLX and SASHA Correlation Matrix. ....................................................................... 27 Figure 29 - Negative vs Positive Answers (based on all questionnaires). ............................................ 27 Figure 30 - Overview of the Sixth Sense Desktop Application Prototype ............................................. 31 Figure 31 - Screenshot of the Sixth Sense desktop application UIAction Pace Calculator. ................. 33 Figure 32 - Sixth Sense desktop application UI Actions Types Monitor. .............................................. 33

5 of 98



Figure 33 - Sixth Sense web based reports for supervisors data exploration, also printable. .............. 34 Figure 34 - Distinction between task load and workload (Hilburn & Jorna, 2001) ................................ 35 Figure 35 - Relationship between workload and performance (Veltman & Jansen, 2003) .................. 36 Figure 36 - Events from observation list with high impact on the performance of the users. ............... 37 Figure 37 - Events from observation list with more impact on the performance of each user. ............. 38 Figure 38 - Departures and arrivals (green area) vs number of negative observations (red) .............. 40 Figure 39 - Interdependence between arriving airplanes, departures and stress levels. ..................... 41 Figure 40 - Correlation between negative observations and HRV. HRV is a good indicator for periods of negative observations. ...................................................................................................................... 42 Figure 41 - Relation between mouse AOI frequencies observation list and HRV. ............................... 43 Figure 42 - Mouse AOI of user7 that received most interest time during the experiment. ................... 44 Figure 43 - Eye AOI of user7 that received most interest time during the experiment. ........................ 44 Figure 44 - Window standard deviation o(2 minutes) and number of errors, capturing very well periods with increased user errors. .................................................................................................................... 46 Figure 45 - Relation between eye and mouse movements (AOI visits) and occurrence of errors. ...... 47 Figure 46 - Kinect Head Pose Measurements Schema. ....................................................................... 48 Figure 47 - Kinect Data Representation, visualizing Detected Head Pose vs Count of Negative/Positive Observations vs Type of Observation vs User in Range (or not in range). ............. 49 Figure 48 - Kinect Data after applying filters to include only the majority of negative observations (96%). .................................................................................................................................................... 50 Figure 49 - Correlation between total number of mouse clicks and negative observations ................. 51 Figure 50 - Correlation of negative observations and difference in number of words/mouse actions .. 52 Figure 51 - Relationship between number of words spoken and negative observations. .................... 53 Figure 52 - Example of using CEP to join two different events into one. .............................................. 54 Figure 53 - The complete process of consuming, filtering and generating events. .............................. 55 Figure 54 - Analysis of the Processing Time (seconds) for arrivals (orange/brown) and Departures (Blue) for user8. .................................................................................................................................... 56 Figure 55 - DM/ML/AI module with automatically calculated metrics for arrival flights capturing repeated workflow steps (e.g., number of taxi commands or cross runways for all flight). .................. 57 Figure 56 - Discovery of Association Rules using the algorithm fp-growth. ......................................... 58 Figure 57 - Relation Between the discovered association rules and different variables of the model. 58 Figure 58 - Outliers Discovery for negative observations in the new dataset with metrics counters (captured between successive negative observations). ....................................................................... 59 Figure 59 - Decision tree to depict reasons for increasing numbers of negative occurrences for different users. ...................................................................................................................................... 60 Figure 60 - Polynomial regression analysis for creating a model to predict negative occurrences based on top most metrics (number of eye events or departure flights). ............................................. 61 Figure 61 - Relation of Probabilistic Suffix Tree (PST) and Automation PSA. ..................................... 62 Figure 62 - Hypothetical distribution of event durations, if after one event (left) or a sequence of two events (right) a specific event is observed. .......................................................... 62 Figure 63 - Illustration of the complexity measure of Grassberger. ...................................................... 63 Figure 64 - Screenshot of the user interface with displayed transition probabilities. ............................ 65 Figure 65 - Illustration of how probabilities for next events in a sequence are displayed. ................... 65 Figure 66 - Illustration of the user interface combining states with displayed scatterplot matrix. ......... 66 Figure 67 - Eye-Tracking ....................................................................................................................... 82 Figure 68 - Test Setup - Eye-Tracking .................................................................................................. 83 Figure 69 - Eye Tracking Data Analysis ................................................................................................ 84 Figure 70 - Test Person 1 – Eye Tracking ............................................................................................ 85 Figure 71 - Test Person 2 – Eye Tracking ............................................................................................ 86 Figure 72 - Test Person 3 – Eye Tracking ............................................................................................ 86 Figure 73 - Test Person 4 – Eye Tracking ............................................................................................ 87 Figure 74 - Kinect sensor ...................................................................................................................... 87 Figure 75 - Sensors included in the Kinect ........................................................................................... 88 Figure 76 - Test Setup - Kinect ............................................................................................................. 89 Figure 77 - Evaluation of distances and angle ...................................................................................... 90 Figure 78 - Test Setup – Speech Recognition ...................................................................................... 92 Figure 79 - Callsign Recognition Rate – Speech Recognition .............................................................. 93

6 of 98



Executive summary The project Sixth Sense follows the idea of using the whole body language of a user for communicating with a machine. In our case it is an Air Traffic Controller (ATCO) with an Air Traffic Tower CWP. Specifically we intend to analyse the correlation of the change of the behaviour of an ATCO - expressed through her/his body language - with the quality of the decisions she/he is making. Result of our work may be used for an early warning for “bad” situations about to occur or decision aids for the ATCO.

We used scenarios of Hamburg Airport since its layout has sufficient complexity to bring the test personnel in difficult situations which are needed to test our hypothesis. Sensors for reading the body language were:

Kinect for body movement

Eye tracking for gaze detection

Speech recognition

Mouse cursor position

Room temperature

Heartbeat of user

Expert observations

The sensors were recorded through each run together with the workflow/the tasks performed by the user. The workflow was retrospectively analysed by experts who marked bad decisions and/or bad situations arising. Combinations of sensor recordings and different visualisations therefrom were used to detect repetitive patterns of user behaviour correlating to good or bad decisions. Several test runs were performed in two batches to gain as much test data as possible in the available time frame to experiment with. Details are in the paragraphs below.

Key learnings of the work performed were:

An analysis of decision quality through experts is difficult since the intention of the test person stays hidden. Additional self-assessment will add value in future tests.

Analyses of sensor recordings offer infinite possibilities of combinations as well as visualisations therefrom. Further work on the existing data might produce even more significant findings

Conclusion: our test setup and process proofed right. Analytical tools and visualisations used are feasible although there are numerous other possibilities which might be even better. Due to the nature of this kind of exploratory research projects with restricted resources no statistical relevance in the found patterns is recognisable. The number of test persons was too low. However, the concrete patterns which have been found allow deriving early indications for good or bad decisions. There are good indications for positive results when more test data and more time is available for sensor permutation analysis.

7 of 98



1.1 Purpose of the document This document provides a results report to the experiment in the Sixth Sense project.

Chapter 2 describes the setup of the experiment and explains the exercises.

The performance of the experiment is summed up in Chapter 3.

In Chapter 4 we present the results: We give a classification of our most important metrics related with task-load, mental workload, attention, behaviour and performance. Then we have a deeper look into related research questions and describe the complex data analysis. Relationships between the sensor data streams are discussed and conclusions of the analysis are described in detail. We explain the capabilities of our software framework and explain future directions of research, for example the use of the current results to create predictive models or how to make improvements in the user interface to support the user in making more informed decisions.

1.2 Intended readership This document might be of interest for:

Sixth Sense project members, including the project manager and the core team members.

Representatives of EUROCONTROL and SJU responsible for reviewing and advising the project.

Other researchers working on the related research projects, particularly researchers on error avoidance, new technologies and interaction methods.

Personnel in air traffic management and other parts of the aviation sector.

1.3 Acronyms and Terminology

Term Definition

AI Artificial Intelligence

AMQ Active Message Queue

ARR Arrival

ATCO Air Traffic Control Officer

ATM Air Traffic Management

DEP Departure

DM Data Mining

IBI Inter Beat Interval

MFA Multilateral Framework Agreements

HRV heart rate variability

KPI Key Performance Indicator

ML Machine Learning

Negative error Negative situation that could not be resolved

8 of 98



Term Definition

Positive error Negative situation, which could be solved with effort of the user

SESAR Single European Sky ATM Research Programme

SESAR Programme The programme which defines the Research and Development activities and Projects for the SJU.

SJU SESAR Joint Undertaking (Agency of the European Commission)

SJU Work Programme The programme which addresses all activities of the SESAR Joint Undertaking Agency.

TWR Tower

.

9 of 98



2 The Experiment This section provides general information on the final design of the experiment, the preparation of the exercises and their performance. In contrast to the original plan described in the experimental plan 4.1, due to limited resources we skipped the AI-module development and did not perform Exercise 3. A few tasks of Exercise 3 (first steps towards prediction) were handled by analysing the data collected in Exercise 2.

Excercise 1 Excercise 2 Excercise 3

Test ofTracking Sensors

Development ofDM/ML/AI‐module

Expert ratings

Expert ratings

Predictions ofDM/ML/AI‐module

Test of DM/ML/AI‐module

KPIs

KPIs

Figure 1 - Update of the exercise plan as described in the experimental plan.

The prediction and DM/ML/AI-module-test, was part of Exercise 3 and could not be processed because of the limited amount of time, resources and data. First steps regarding predictions are described in Section 4.3.3.

2.1 Experimental Setup The following steps have been conducted to execute the experiment where a participant performs a simulated 60 minutes ground controller shift at a simulated ground controller position.

Name Description

Overall Experimental Briefing

Provision of an overall briefing, providing an overview about the used system and the operational scenario conducted during the exercise.

Start of Experiment Reset of the operational scenario.

A_Pre-Questionnaire Collecting information about test person (working experience, etc.).

Recording of data Start of the recording of data, to be collected into the database.

Run Exercise Start of the operational scenario and conducting the exercise

B_Supervisor Observation During the exercise the observer took notes and collected the stress level

C_Post-Questionnaire Collection about subjective feelings (situational awareness, workload).

D_Debriefing Collection of debriefing questionnaire answers of the test person.

Overall Experiment Debriefing

General Debriefing to close the experiment session.

Table 1 - Description of the workflow steps

10 of 98



Every participant received a map of the airport (Hamburg) and was asked to assume the experiment work place. The participants were informed that they could ask questions about the use of the simulator user interface to the air traffic controller supervisor, present in the room.

When all questions were answered, the experiment started, and the air traffic information was loaded into the simulator. Every 10 minutes, the supervisor asked, what was the current stress level experienced by the participant and registered extra notes about his personal evaluation point of view of the current performance of the participant.

The experiment lasted for 45 minutes, but it could run for 60 minutes maximum, depending on the current air traffic situation.

Start of Experiment

A_Pre‐Questionnaire

Overall Experiment Briefing(incl. Operational)

Overall Experiment Debriefing

Recording of data

C_Post‐Questionnaire

D_Debriefing

Run ExerciseB_Supervisor Observation

(Observation List)

Figure 2 - Experimental Workflow

Table 1 and Figure 2 provide an overview of the different workflow steps within the experimental scenario. The Questionnaires A-D can be found in Appendix B.

2.2 Operational Scenario The operational scenario was based on Hamburg Airport.

11 of 98



Figure 3 - Hamburg Airport

Following constraints have been used to prepare the scenario:

Simulation prepared for approx. 60 min.

Arrivals are automatically simulated until touchdown. (no change of route)

Departures are controlled until take off.

No Runway change is foreseen within the simulation

Taxiway Routes can be selected by the operator.

Configurations during the experiment:

Arrival Runway: 23

Departure Runway: 33

Arrivals: 31 flights

Departures: 27 flights

2.3 Roles and Responsibilities The following roles were participating in the experiment:

Ground Controller: Participant

Runway Controller: Manually Simulated

Pseudo Pilots: Manually Simulated

Observer (supervisor)

Observer (experiment leader)

12 of 98



The responsibilities within the workflow are displayed in the following Figures:

Figure 4 - Arrival workflow - responsibilities

Figure 5 - Departure workflow - responsibilities

2.4 Technical Setup of the experiment The setup is based on a single simulated controller working position. No 3D view is available at the experiment. The experiment concentrated on the ground traffic management.

The following modules have been used during the experiment:

Traffic Simulator

CWP with EFS, Support Information

AMQ Broker

Eye-Tracker

Mouse

Keyboard

Speech Recognition

13 of 98



Figure 6 - Setup working position

2.5 AMQ Broker The broker is the central distribution system for all data communication between all components. The transport protocol used is stomp and open wire.

ActiveMQ (AMQ) – allows a single point of data exchange between different systems, modules and functional blocks, through the usage of customized XML messages.

It supports a variety of cross language clients and prototcols from Java, C, C++, C#, Python …

For detailed information please refer to: http://activemq.apache.org/

2.6 The Human Machine Interface (HMI) The HMI is split into several parts which are described in more detail in the following paragraphs. In general, the right side of the screen is reserved for a representation of the flights, the smartStrips. The middle part can contain a variety of different information including an overview of the airfield as shown in the picture below. The top contains an information bar and to the side there is a sidebar containing additional information e.g. status of the system or wind data. Summarized, the screen can thus be separated into:

Sidebar

Infobar

Button bar (EFS)

Strips

Page Selection (Main Area)

Simulator Position Ground Position Observer Position

Experimental CWP

Voice RecognitionServer Components

SimulatorRWY Controller Position

Pilot Simulation

Ground Controller PositionAMQ Broker

MySQL Data Logger

Video / Screen Capturing

14 of 98



Figure 7 - Components of the HMI screen

Figure 8 explains the different fields of the departure strip. Three different sizes are available: DEP MICRO, DEP MEDIUM, and DEP MACRO. The filled explanations mean that this field can be pressed on the strip. Figure 9 explains the fields of the arrival strip. Three sizes are available: ARR MICRO, ARR MEDIUM, and ARR MACRO. The filled explanations mean that this field can be pressed on the strip.

Figure 8 - Departure Strips

15 of 98



Figure 9 - Arrival Strips

Figure 10 - Strip Bay Configuration / Button Bar

In Figure 10 the information about the configured bays and the explanation of the button bar are shown.

16 of 98



3 Performing the Exercises The first exercise of this experiment was used to assess the accuracy of the sensors integrated into the prototype to ensure the necessary quality of the used technologies.

The following sensors were tested in Exercise 1:

‐ Eye Tracker

‐ Kinect

‐ Speech Recognition

See Appendix A, Exercise 1 of the deliverable D4.2 (Verification Plan):

Exercise ID/Title: EXE-E.02.25-VP-0001.0001 /Eye-Tracking adapter

Exercise ID/Title: EXE-E.02.25-VP-0001.0002 /Kinect AMQ-adapter

Exercise ID/Title: EXE-E.02.25-VP-0001.0003 /Leap Motion AMQ-adapter

The Leap Motion was found to be not useful in a sitting mouse environment. Instead, the speech recognition module was evaluated in more detail.

More technical details about exercise 1 can be found in Appendix A of this document.

In Exercise 2 the participants followed the experimental workflow of Figure 2 and performed the simulated 60 minutes shift of a ground controller. During the exercise, the supervisor took observation notes and asked the participant for her/his stress level (on a scale from 1 to 5) every ten minutes. The observation notes consist of a time stamp and a short description of the observation. In the second part of Exercise 2 (user 5-8) this process was already automated and stored using the software framework.

In a later step, the recorded video capture of the exercise was revised by a domain expert to create the observer list. The observer list consists of selected events which are rated positive, neutral and negative. A positive event occurs when the participant can successfully resolve a negative event.

The detailed description of exercise 2 can be found in Appendix B Exercise 2 of the deliverable D4.2 (Verification Plan):

Exercise ID/Title: EXE-E.02.25-VP-0002.0001 / Collecting Sensor data, and expert reviews.

The questionnaires before, during and after the exercise can be found in Appendix.

As mentioned in Section 2, in contrast to the original plan, we did not perform Exercise 3, we could only do the first steps regarding predictions in the data analysis of Exercise 2 data.

3.1 Profile of Participants All participants work in the field of air traffic control but at different expert levels: as air traffic controllers, one En-route, two Ground, one trained as a ground controller but works only in simulations experiments.

Years of work experience: The years of professional experience differ between 2, 4, 14 and 20 years respectively.

Gender: There were two male (50%) and two female (50%) participants.

Age: One participant was aged between 20 and 30, one was aged between 30 and 40 and two were aged between 40 and 50.

Language: Two of the participants had German as their mother language, one Romanian and the other Spanish. All communications between pilots and air traffic controllers were handled in English.

17 of 98



Due to limited resources of test participants we had to reuse participants for the experiments. Learning effects caused by this reuse may not be ruled out but as we tried to measure behaviour for individual test runs, this effect can be neglected.

3.2 Data Analysis, Exploration and Visualization After performing Exercise 2 and post-processing of the data, a resume table (see Table 2) with the total number of usable events for each generated dataset (topic) was created.

Table 2- Data collection and Quality Assessment for Different Data Sets and Sensors

The first entry in Table 2 we entered the notes and annotations from the supervisor and observer of the exercise (see also Section 3.2.3 “Observations”). These notes are text notes about for example observations of errors or suboptimal situations. When talking about observations of errors we

Data Assessment

Variables Topics

(Datasets) User

1 User

2 User

3 User

4 User

5 User

6 User

7 User

8

Total Events

by Topic

Description

3 Supervisor &

Observer 51 65 13 123 91 91 107 152 693

Reports from Supervisors

and Observers

3 StressLevel 6 6 6 6 6 6 7 6 49 Stress Level reports (from

users)

42 FlighObject 616 420 57 436 302 241 340 434 2846 Flight

Information

6 Selections 420 302 17 211 197 149 209 268 1773 Strips

Selections

10 Eye 0 0 53 169 9097 61770 72534 68116 211739 Eye Tracker

4 GlobalMouse 0 72891 2929 79618 3844 12700 7588 8082 187652 Mouse UI

Hook

7 Mouse 3046 1929 110 1290 916 1838 1266 1915 12310 Mouse Listner

23 Kinect 27351 0 0 0 7561 0 0 0 34912 Kinect Listner

12 Voice 1014 1160 36 1242 256 899 1126 1587 7320 Voice Listner

4 Waspmote 0 0 0 0 1807 2754 3041 3376 10978 Waspmote

Listner

12 Heart Rate

Measurements

0 2978 2274 5347 0 4512 0 5184 20295 Heart Rate

Events Collected

13

Eye AOI (Fixations,

Gazes, Sacades)

0 0 0 0 715 1625 3396 3767 8788 Eye Tracking

Areas of Interest

13

Mouse AOI (Fixations,

Gazes, Sacades)

0 2217 0 2325 62 2729 1002 1083 9418

Mouse Tracking Areas of Interest

152 <= Total

Number of Variables

32504 81968 5495 90767 24139 89314 90616 93970 508773

Total Data Collected in

2 experiments

58 <= Total

Number of Airplanes

Number of Events

Number of Total Events

Collected

27 <= Departures

31 <= Arrivals

18 of 98



distinguish between positive and negative, where positive means that negative situations, could be resolved by some effort of the user. The second entry is the stress level. We acquired this information by asking the participant every ten minutes about the subjective stress level on a scale from 1 to 5. During the exercises we encountered several hardware issues with the Kinect sensor. In favour of a higher eye tracking frequency it was decided to deactivate the Kinect for user 6-8.

3.2.1 Heart Rate vs Observations List Not all the users agreed in wearing the heart rate monitor device (for different reasons health, privacy). For user2, user3, user4, user6 and user8 we collected at least 3 baseline measurements at rest. In addition to the heart beat per minute we measured also the heart rate variability (HRV).

The HRV indicates the fluctuations of the heart rate around an average heart rate. An average heart rate of 60 beats per minute (bpm) does not mean that the interval between successive heartbeats would be exactly 1.0 sec. Instead the interval may fluctuate/vary from 0.5 sec up to 2.0 sec. HRV is affected by aerobic fitness and HRV of a well-conditioned heart is generally large at rest. Other factors that affect HRV are age, genetics, body position, time of day, and health status. During exercise, HRV decreases as heart rate and exercise intensity increase. HRV also decreases during periods of mental stress.

The HRV is regulated by the autonomic nervous system. Parasympathetic activity decreases heart rate and increases HRV, whereas sympathetic activity increases heart rate and decreases variability. A low HRV indicates dominance of the sympathetic response, the fight or flight side of the nervous system associated with stress, overtraining, and inflammation. Therein lies the beauty of HRV: it offers a glimpse into the activity of our autonomic nervous system, an aspect of our physiology normally shrouded in mystery.

For the representation of the HRV we may use the time-domain method of the Root Mean Square of the Successive Differences (or RMSSD) as we can see in Figure 11:

HR = heart rate in beat per minute (bpm) ) = no. of R’s

R - R interval = inter-beat interval (IBI) in msec.

N = no. of R - R interval terms

11

²

Figure 11 - RMSSD Formula. Alternatively we also calculated the Z-Score and Z-Score IBI measure. The inter-beat interval (IBI) is a scientific term used in reference to the time interval between individual beats of the heart. IBI is generally measured in units of milliseconds and it is measured automatically when recording with a Polar heart rate sensor. In normal heart function, each IBI value varies from beat to beat. This natural variation is known as HRV (see above). However, certain cardiac conditions may cause the individual IBI values to become nearly constant, resulting in the HRV being nearly zero. The Z-score (HR is the normalized value for heart rate from a distribution by mean and standard deviation. Here we take the average of the 3 heart rate measurements taken when the user was at rest (in the beginning of the experiment). Z-score IBI is the normalized value from a distribution by mean and standard deviation. Here we take the average of all measurements. We use the Z-Score and Z-Score IBI measure to find out how far from the average baseline inter-beat interval measurements is the current value in time. This allows us to better check for changes in time of the users’ HRV. In Figure 12 we can see an example of the Z-Score IBI plot for user8. In the upper

19 of 98



part of the graph (count of positive/negative observations) the data is filtered to show only negative observations (experts in red, observers in orange). Below this chart, we can see the Avg Z-Score IBI, where in red we have negative values (decreases in the heart rate variation) and in green we see positive values (increases in the variation). The decrease in variation can be associated with periods of stress. We can observe that before periods of time with more negative observations there are clear indications of stress (decreases in HRV) followed by moments of relaxation when the user regains control.

Figure 12 - Z-Score IBI vs negative observations through the total experiment time for user8.

3.2.2 Eye-Tracker and mouse Analysis By monitoring the movement of the eyes and reconstruct the gaze point on the screen (eye tracking) the data stream contains many rapid position changes. The visual perception of the human needs a certain amount of time to realize elements of a graphical user interface. We are interested in gaze positions that are actively realized by the user. These positions are called fixations.

The freely available eye tracking analysis software Ogama [1] was used to calculate fixations. The areas of interest (AOI) were defined within Ogama (see Figure 13). The calculation of fixations then automatically takes the AOIs into account and connects the results. For simplicity in the processing pipeline of the data we used the same for the mouse movements. Mouse fixations are positions on

User Date Time

0 1 4 6 8 11 12 13 18 19 20 21 22 23 24 25 26 27 30 31 33 34 36 37 39 40 41 42 43 46 47 48 49 50 51 52 53 54 55 56 57 58 59

user8

0

1

2

3

4

5

-1,0

-0,5

0,0

0,5

Average

-1,226 0,761

Avg. Z-Score IBIPositive / Negative*

Negative

NegativeX

20 of 98



which the mouse cursor rested for a certain amount of time (in our case: delta t > 66 ms and delta d < 20 pixel). All the fixation results were exported to a comma separated value file (CSV).

To use the results in other software like Tableau we had to modify the CSV files. The time stamp had to be converted from seconds to a valid date time format. For eye tracking data we added an off-screen AOI for large time slots (>500 ms) between fixations, because saccades between two fixations never takes so much time, so it can be assumed that tracking had been lost and the operator had fixated something beside the screen, like for example some map which lay in front of him.

Figure 13 - Areas of interest of the ATC Simulator as defined in Ogama

3.2.3 Simple Metrics and Data Exploration After pre-processing the data, we had to decide what data set (topic) and what metrics related to the topic should be visualized, explored and investigated first and in more detail. For this we have created a resume table (see Table 3) which lists the metrics we wanted to analyse, visualize and explore.

Metrics Visualization Initial Requirements

Topic (Dataset) Metric visualizations

Mouse Number of mouse clicks per time and per user

Observer notes Errors, help requests and annotations noted by the observer

Selections Number of strips update per time and per user

Selections Strip selection through time

Selections Animated Temporal evolution for the different users of the number of Selections

Selections and StressLevel Stress Level vs Number of Callsign Interactions

Supervisor Notes errors

Voice Number of Calls per time (speech recognition)

Workload Metrics When there was an error report, we show the Workload Metrics between errors

Eye Number of fixations per time (eye)

Eye Number of saccades per time (eye)

Eye Number of transitions for AOI per time (eye)

Eye and Mouse Correlation between transition of eye and mouse

21 of 98



FlightObject Number of flights in one workflow step per time and per user

GlobalMouseWatcher Left Clicks, Mouse Moves

Heart Rate Average heart beat per time

Kinect Head position, body posture

Mouse Number of drag & drops per time and per user, Mouse positions, AOIs

Voice and Selections Number of speech recognition results vs strips update

Waspmote Sensors (Temp and Light) Temperature and Light values

Table 3 - Resume of initial metrics to be visualized and explored After the creation of the initial visualizations, the project consortium discussed the ranking and importance of the different metrics and the suggested visualization types with respect to practical applicability and meaning for the overall goal of finding patterns for different types of behaviour. Therefore, ATC experts of the company partner got to vote (in a scale from 1 to 5) in order to specify the level of satisfaction with the current visualization and metrics utilized. Figure 14 shows the template. This template also served as a reference list to discuss interesting findings or patterns that could be found in the analysed.

Figure 14 - Ranking of metrics and visualizations

As a starting point, eight different topics were selected independently for visualization and exploration. The selected topics were:

Mouse Observations Selections Selection and Stress level Voice Workload Metrics Eye Mouse AOI Heart Measurements

22 of 98



The decision was based on data availability, completeness (data available for the total simulation time), and data quality. Therefore, Kinect and Waspmote could unfortunately not be considered for the data exploration. As shown in Table 2 Kinect data could be recorded only for 2 out of 8 test runs, and for Waspmote in only half of the test runs. Observations: In order to identify specific situations in the simulation/exercise and to be able to compare similar situations within the data, different kind of observations have been recorded and ranked by a supervisor:

Observations during the test run Offline observations

These observations have been merged into a list with the following information:

Timestamp Observation Type Message Positive / Neutral / Negative Observation Category User Synchronized Timestamp Ranking (-5 to 5)

Figure 15 - Observation List

23 of 98



4 Results Recall of Hypotheses H1 and H2: As described in the Verification Plan D4.1, the performance of the final parameterized DM/ML/AI-algorithms should be tested by answering the following hypotheses:

Hypothesis H1: the DM/ML/AI-module is able to detect situations in which the operator tends to make bad decisions by analysing user-input and user-tracking data

Hypothesis H2: the DM/ML/AI-algorithms are able to identify good and bad workflow patterns

Workflow patterns in H2 are sub-sequences of actions the controller performs. These workflow patterns might vary between good and bad controllers. H1 refers to the decision a controller make in a single steps of the workflow.

Both hypotheses refer to the evaluation if the developed DM/ML/AI-modules are able to assess the state of an operator, which was planned to be verified in a third exercise. Since due to limited resources for further experiments and - as a consequence - limited data collection it was not possible to perform the third exercise.

Therefore, we present the analysis of the data in more detail with the respect to the question:

What kind of pattern has been detected and might be useful for the development of such a module?

In the first section of this chapter high level results are presented which describe the results on the general performance of the users. We start in Section 4.1 with the workload estimates that we could generate from the questionnaires. In Section 4.2 we turn our attention to the measured sensor data. We start with simple visualizations and from that we create a list of useful and interesting metrics with different level of complexity (combined sensor data from different sources). We classify the metrics into categories. With this background we created detailed and concrete research questions that guide our analysis, visualization and exploration of data towards good predictors of the users’ behaviours (Section 4.2.3). Combined visualizations are used to show what metrics and what patterns could be found in the collected data. In Section 4.4 we show results of the event trace analysis. At the end of this chapter our conclusion (Section 4.5) and future work (Section 4.6) can be found

These findings can be used in the future for the development of models to predict the user’s behaviours.

4.1 Workload Estimates based on Questionnaires The NASA-TLX is a multi-dimensional scale designed to obtain workload estimates from one or more operators while they are performing a task or immediately afterwards. In our case the user filled out the NASA-TLX questionnaire after the experiment.

The NASA-TLX has different rating scale descriptions. The first rate description describes the mental workload of the users, i.e., the amount of mental and/or perceptual activity that was required (e.g., thinking, deciding, remembering, calculating looking, searching).

We observed that user1, user4, user6 and user8 reported a higher mental workload in the end of their experiments (Figure 16).

24 of 98



Figure 16 - Mental Demand Results for all 8 users, 2 experiments.

Figure 17 - Physical Demand Results for all 8 users, 2 experiments.

Next we analysed the reported physical demand. This rating refers to the amount of physical activity that was required (e.g., pushing, puling, turning, controlling, activating). Here user5 and user8 reported a higher physical demand (

Figure 17).

Temporal demand refers to the amount of pressure that the user felt due to the rate at which the task elements occurred (e.g, was the task slow and leisurely or rapid and frantic?). User1, user6 and user7 reported a higher temporal demand (Figure 18).

Figure 18 - Temporal Demand Results for all 8 users, 2 experiments.

Figure 19 - Level of Effort Results for all 8 users, 2 experiments.

We checked the level of effort (Figure 19) reported by the users, i.e, how hard did the user has to work mentally and physically to accomplish the level of performance. For user1, user4, user5 and user7 there was a higher level of effort.

The level of frustration is related to how insecure, discouraged, irritated, stressed and annoyed versus secure, ratified, content, relaxed and complacent did the user felt during the task. User1, user5 and user7 reported a higher level of frustration (Figure 20).

0

5

10

15

Mental Demand

Mental Demand

user1 user2 user3 user4


0

5

10

Physical Demand

Physical Demand



0

2

4

6

8

10

Temporal Demand

Temporal Demand



0

2

4

6

8

10

12

Level of Effort

Level of Effort



25 of 98



Figure 20 - Level of Frustration Results for all 8 users, 2 experiments.

Figure 21 - NASA-TLX Negative Results (not considering “Level of Performance” answers).

Considering that the maximum possible value is 50 for each user and that we consider only the 5 negative related ratings in NASA-TLX (for mental, physical, temporal, level of effort and level of frustration), we can display a general overview of the ratings reported by our users. In Figure 21 we see that user1, user7 and user5 reported a higher workload, although the results for user4 and user6 are very similar to user5. In our examples we mostly use user4, user6 and user8 to better represent the results, because we had more data available for these users.

We also show the level of performance reported by the users, i.e., how successful does the user thinks he/she was in accomplishing the goals of the task set by the experimenter (Figure 22). User2, user3, user4 and user5 reported higher ratings, which is consistent with the other (negative) ratings.

Figure 22 - Level of Performance (for all users, 2 experiments).

Figure 23 - NASA-TLX Correlation Matrix (taking all answers from all users).

We also used the data analysis tool R to create correlation matrices plots that show the possible relation between the different answers of the users (Figure 23). These correlation matrices help in the interpretation of our data and results. We can for instance observe in Figure 23 the negative correlation between Level of Performance and Temporal Demand (left inferior corner) or the positive

0

2

4

6

8

10

Level of Frustration

Level of Frustration



0

10

20

30

40

50

NASA_TLX_Workload SCORE_C2

NASA_TLX Workload



0

1

2

3

4

5

6

7

8

9

Level of Performance

Level of Performance



26 of 98



correlation between Physical Demand and Age. Please note that: for Gender we use 2 (higher value) for female and 1 for male.

Next we present the SAGAT based questionnaire. As we can see (Figure 24) user8 had a positive report and user7, user4 and user6 had a higher negative rating. In Figure 25 we can observe the correlated matrix plot for all answers and from all users to the SAGAT based questionnaire.

Figure 24 - SAGAT Based Questionnaire. Figure 25 - SAGAT Correlated Answers.

In the SASHA based questionnaire (

Figure 26) user8 and user2 had a higher positive rating, while user1, user7 and user6 had a more negative rating. In Figure 27 we can observe the correlated matrix plot for all answers and from all users to the SASHA based questionnaire.

‐25

‐20

‐15

‐10

‐5

0

5

SAGAT_Score_D

SAGAT Based



27 of 98



Figure 26 - SASHA based Questionnaire.

Figure 27 - SASHA Questionnaires, correlated Plot.

Finally, we plotted the mixed correlation between the NASA-TLX questionnaires and the SASHA questionnaires (see Figure 28).

Figure 28 - NASA-TLX and SASHA Correlation Matrix.

Figure 29 - Negative vs Positive Answers (based on all questionnaires).

In the combination of all questionnaires and reports of the users we looked for evident signs of negativity or positivity in the users’ answers. This allows us a better evaluation of the reliability in the users’ answers. In

‐15

‐10

‐5

0

5

10

15

SASHA_Score_C1

SASHA Based



0

1

2

3

4

user1 user7 user5 user4 user6 user8 user3 user2

Negative vs Positive Answers

Total_Times_Positive (taking average toclassify)

Total Times Negative (taking (minus) todecide)

28 of 98



Figure 29 we present the results regarding this positivity and negativity analysis. User8 seems to be very positive in all the answers, while user7, user4 and user6 are much more negative in general.

In Table 4 we see the negative/positive answers outliers. For this we analysed all questionnaires for consistent negative or positive answers. It can be seen that user7 was consistently negative in his answers.

Interesting findings considering the 2 top users with a negative score or the 2 top users with a positive score

user1* user4** user7# * - positive, negative, positive

user7# user7# user6**

** - only one time positive or only one time negative

user3** user8*** user1* *** - 2 times positive

user2*** user1* user2*** # - always negative user5 never appears in the top 2 negative or positive

user8***

Table 4 - Outliers in negative/positive answers.

As a conclusion to the analysis of the questionnaires, the results can give hints for further data exploration. It would make sense to look at the users that reported the highest workload (here user1 and user7, see Figure 21) and check for correlations with their heartrate or number of errors, subject to the condition that these data are available for those users. But sometimes our decision for further analysis was based on other factors, like the data quantity. Including questionnaires into the data analysis is definitely a promising option. Due to limited resources and data quantity we could not exploit the maximum potential.

4.2 Hints for H1 - Exploring the Sensor Data In the first stage after post-processing the measured data collection, we created simplified visualizations for initial discussion and exploration of the data. As a result we agreed on a reference list of metrics, charts and initial findings. Furthermore, we got an idea of more complex visualizations where we could combine several metrics and visualizations types. The initial list of ideas and questions to study more intensive was the following:

Similar to the NASA-TLX grouping, maybe we should create visualizations that group the visualizations according to: Workload (heart rate, fixation frequency, number of departures or arrivals), temporal, performance

Make usage of different performance measures

Can we show how fast the utterances were spoken?

Can we Distinguish between sentences that have the same meaning but using fewer words

Is the user using shorter words when the stress levels are higher?

We must focus on the Observations list

29 of 98



We should combine stress levels, observations and heart measurements

We should combine voice, selections, observations, heart measurements

We should combine fixations frequency, stress levels, observations and heart measurements

How many departures, arrivals per minute? Is this related with stress and changes in the heart rate variability?

From this starting point, we created a resume table with the most important metrics that could be combined in order to better represent the overall status of the users at each point in time (which is in our case, per minute).

ID Metrics Source Name Data Type

1 Taxi time Arrival Workflow Taxi time duration time

2 Take-off time Departure Workflow

Take-off time duration time

3 Number of Observation Negative Errors (Experts Observations List)

Errors negatives count

4 Number of Deviations or errors

Experts Observation + Supervisor + Observer Errors

number of total negative errors

count

5 Standard Deviation of Eye Fixation Duration

Eye tracker stdev eye fixation duration

stdev duration time

6 Number of distinct changes in AOI Eye Tracker

Eye tracker number of distinct mouse AOIs

count

7 Eye Fixation Frequency (AOI Count) Eye tracker all eye AOIs count count

8 Eye Fixation duration time Eye tracker eye fixation duration duration time

9

Difference between Average Eye Count of AOI and current count of eye AOI (average variation in the AOI count)

Eye tracker diff eye count avg count

10

Difference between Average eye Fixation Time and current eye Fixation Time (average variation in the Fixation Time)

Eye tracker diff eye duration avg duration diff

11

Difference between Average Count of mouse AOI and current count of mouse AOI (average variation in the AOI count)

Eye tracker diff eye count avg count

12 How often has the ATC to interact with the Flight Plan, Manuals or Help system

FlightPlan number of manual and help occurrences

count

13 Heart Rate Z-Score Heart Rate heart z-score score

30 of 98



14 Heart rate RMSSD Heart Rate heart rmssd measure

15 Heart Rate interbeat interval Heart Rate heart ibi measure

16 Heart Rate Beats Per Minute Heart Rate heart bpm measure

17 Standard Deviation of Mouse Fixation Duration

Mouse stdev mouse fixation duration

stdev duration time

18 Number of Mouse clicks Mouse mouse left clicks count

19 Number of distinct changes in AOI Mouse

Mouse number of distinct eye AOIs

count

20 Mouse Position Frequency (AOI Count)

Mouse all mouse AOIs count count

21 Mouse Fixation duration time Mouse mouse fixation duration duration time

22 Number of Observed error Messages (observer list)

Observer errors count

23 Number of Callsigns Selections number of callsigns count

24 Subjective Stress Levels (every 10 minutes)

Stress stress level level

25 Number of Words Used Voice number of words count

26 Number of communications Voice number of communications

count

27 Number of Utterances (phrases) Voice number of utterances count

28 Reaction time to a System Warning Workflow Task reaction time to warning duration time

29 Number of features recalled by the users after the session

Workflow Task features recalled count count

30 Time to complete a specific task (from workflow)

Workflow Task time to complete task duration time

31 Total Time spent per task Workflow Task total time spent for each workflow task

duration time

32 Time spent in recovering from errors Workflow Task time spent in error recovering

duration time

33 Number of tasks the user has completed in a critical amount of time

Workflow Task number of task during critical time

count

34 Number of tasks the user could not complete in a critical amount of time

Workflow Task number of tasks no complete during critical time

count

35 Number of tasks performed vs tasks never performed

Workflow Task ratio between task performed and tasks never performed

ratio

36 Number of errors vs correct interactions

Workflow Task ratio of correct vs incorrect interactions

ratio

37 How many steps to complete task Workflow Task Steps

steps to complete task count

38 Number of Departure Flights Workflows departures count count

39 Ratio between Time passed and number of actions performed by the

Workflows tasks pace of the user ratio

31 of 98



user

40 Number of Arrival Flights Workflows number of arrival flights count

41 How many loops Workflows number of loops per task

count

42 How many different steps Workflows Nr. of different steps count

43 number of switches between handling arrival and departure flights

Workflows Nr. of switches between handling arrival and departure flights

count

Table 5 - Resume of most important metrics

4.2.1 Sixth Sense Prototype Framework for Data Exploration We tuned one of the Fraunhofer Austria desktop application prototypes towards the visualization and data exploration needs of the Sixth Sense project.

Figure 30 - Overview of the Sixth Sense Desktop Application Prototype

The main features of the Sixth Sense desktop application prototype framework are:

Replay of all information topics (on-line data analysis and replay)

Graph database storage (e.g., storage: body posture, work-flow step, AOI, action)

Prediction engine training (e.g., experiment on giving recommendations about possible next handling steps for an air-plane)

On-line complex event processing (CEP) and dynamic change of correlation filters

Analysis and visualization of the air-planes arrival and departure work-flow tasks and times (showing any repetition loops)

32 of 98



Real-time plot of Eye-tracking and mouse current positions

Visualization of interaction metrics, e.g., current user pace (current effort)

Awareness dashboard with thresholds for cumulated departures or arrivals

Web observation platform using Web-sockets and D3.js

Real-time representation in a time line of the supervisors, observers and experts annotations (stress level report, negative and positive observations)

Handling of voice recognition data from communications between pilots and air traffic controllers (similar to “think aloud protocol”)

Export of datasets for analysis in other tools

We also added to the desktop application prototype the capability of analysis and plotting in real-time during the experiment or during the data replay. Therefore, we could analyse in real-time the eye, mouse current focus and the current user actions pace, type and categorization of current user interactions. We are also able to register and monitor all decision activities in a graph database for posterior analysis. In Figure 30, we see an overview of the Sixth Sense prototype application.

To detect the demand on the users, we experimented in analysing the pace of the user (in terms of mouse and eye decisions) in order to automatically calculate the current pace of the user (see Figure 30). The test formula for this calculation uses the number of mouse movements, left clicks and eye-tracker fixations.

In the future, with more experience in the contribution of each metric, we can use this capability of interaction pace calculation together with the monitoring of the ATC workflow steps to be able to detect high moments of workload and take the necessary preventive measures (display warnings or make recommendations).

33 of 98



Figure 31 - Screenshot of the Sixth Sense desktop application UIAction Pace Calculator.

Figure 32 - Sixth Sense desktop application UI Actions Types Monitor.

We also implemented the automatic capability of detecting what are the current top most user actions in real time. We believe this can be used also to help infer what is the current load on the user (e.g., is the user moving the mouse too much? is the user moving many strips?). In

Figure 32 we show an example of how this capability looks like.

Our software framework can also be used to extract fully interactive behavioural HTML5 based reports that can include explorative capabilities around different metrics. We can observe the relation in time between mouse clicks, voice call sign recognition and number of user interaction events per minute. The supervisor can easily print the report in the current explorative state.

34 of 98



Figure 33 - Sixth Sense web based reports for supervisors data exploration, also printable.

The prototype is still work in progress but a number of features could already be used for exploration and visualization in Section 4.2.3 to answer some of the research questions or to support the initial findings.

4.2.2 Categorization of Metrics regarding mental Aspects Literature suggests that a combination of different measures assessing the same mental aspect, e.g. workload, can lead to more robust results than considering each measure on its own ( [2]; [3] ). We therefore grouped the most important metrics into four categories that represent certain factors known to be related to operator performance: task load, mental workload, attention and behaviour (see Table 6).

As these metrics may influence performance they can be regarded as independent variables whereas the performance measures serve as dependent variables.

The identification of relevant categories and allocation of metrics to these categories was based on literature findings. Each category is described in more detail in the following sections.

35 of 98



Categories

task load mental workload

attention other metrics* performance

nr. of arrival flights

fixation frequency / duration

nr. of changes in AOIs per time unit

nr. of mouse clicks Error messages

nr. of departure flights

heart rate measures

fixation duration on AOIs

Nr. of callsigns/ number of communications

nr. of tasks completed / not completed

nr. of task switches

subjective stress levels

standard deviation of fixation duration

nr. of words used Time of task completion

*) includes behavioural metrics not assigned to certain user states due to lack of literature findings

Table 6 - Classification of most important metrics into categories. Task load

Studies investigated that task demand characteristics have an influence on workload and performance (e.g. [4]). Often the terms task load and workload are used synonymously. However, as Rohmert [5] stated individual characteristics (e.g. the operators’ experience or ability) determine the degree to which task demands impact workload and performance. That is why the same task load must not result in the same level of workload for each individual (see Figure 34). This definition of task load and workload is also part of the ISO norm DIN EN 10 075-1 (2000) [6].

Figure 34 - Distinction between task load and workload (Hilburn & Jorna, 2001)

According to the Cognitive task load model of Neerincx [7] three dimensions of task load can be distinguished: time occupied, level of information processing and task-set switches. DeGreef & Arciszewski (2009) [8] describe that ‘time occupied’ can be reflected by the volume of information processing which is likely to be proportional to the number of objects present. The level of information processing can be represented by the complexity of the situation and task-set switching can be indicated by the number of different objects and tasks. Based on this classification important metrics for task load in the experiment would be the number of arrival and departure flights that have to be handled by the operator. Task set switching could be extracted by the number of switches between handling arrival and departure flights.

Mental Workload

Research indicates that performance of the operator is likely to decrease if mental workload is either too high or too low (e.g., Hancock & Chignell [9] , 1986; Veltman & Jansen, 2003 [10]). Thus, the relationship between workload and performance does not seem to be linear but rather resembles the form of an inverted U-shape as visualized in Figure 35.

36 of 98



Mental workload can be assessed empirically in several ways, including self-rating-methods, physiological measures and behavioural measures. One method used in the experiment is the detection of workload by heart rate metrics e.g. beats per minute; inter-beat interval, the RMSSD and the Z-Score. Literature suggests that heart rate is sensitive for different levels of workload (e.g. Roscoe, 1992 [11], Veltman & Gaillard, 1998 [12], Mulder et al., 2007 [13]). There are also studies in the domain of air traffic control indicating increases in heart rate with higher task demands (e.g. Costa, 1993, Rose & Fogg, 1993). However it can be also affected by other factors such as the emotional state (e.g. anxiety) or fatigue which reduces its diagnosis (Manzey, 1998 [14]). Therefore it seems reasonable to combine this measure with other indicators of workload. There are also metrics that could be extracted from the eye tracker data which can serve as indicators for workload. For example, studies of van Orden et al. (2001) [15] suggest that fixation duration and fixation frequency can be sensitive measures for (visual) workload. Besides the physiological assessment of workload, it was also assessed by a subjective rating every 10 minutes during the experiment.

Figure 35 - Relationship between workload and performance (Veltman & Jansen, 2003)

Attention

Performance can be decreased by lack of attentional resources (Young & Stanton, 2002 [16]) and also by “inattentional blindness” (Mack & Rock, 1998 [17]). Inattentional blindness means that unexpected events are not noticed because attention is engaged in another task.

Analysing what is fixated by the user has generally been considered a good way to determine attentional focus allocation. For example Just & Carpenter, 1980 [18] formulated the eye-mind hypothesis assuming that what is being fixated is also what is being processed. This assumption has been criticized, as it is also possible to voluntarily divert attention elsewhere while fixating a specific area. Nonetheless fixation analysis seems to be beneficial as users usually direct their gaze where they can find the most useful pieces of information (Bellenkes, Wickens & Kramer, 1997 [19]).

In our experiments, several Areas of Interest (AOI) were defined in order to analyse which part of the screen is fixated by the user at a specific time. Measures referring to the distribution of fixations on the AOI are for example the number of AOI fixated per time unit, the number of switches between AOI per time unit and the fixation duration on each AOI per time unit.

Other Metrics

This category refers to metric of behavioural responses or actions of the user that might be related to task load and user states. It includes metrics such as number of mouse clicks, the number of communications or the number of words spoken. As their relationship to certain mental aspects has rarely been investigated in the literature, they are not assigned to one specific category.

What literature suggests is that deviations from the normal behaviour of an individual can indicate situations of high workload or overload. Promising results could be found for scan patterns (Tole, 1983 [20]) and also for operating procedures in air traffic control tasks (Sperandio, 1978 [21]). Although these findings refer to rather complex behavioural patterns, it seems likely that conditions of

37 of 98



high task load are also linked to changes in more simple behavioural responses such as the number of mouse clicks or the number of communications. In order to investigate this assumption we also consider these metrics as potential user state indicators.

Performance

Performance can be assessed by measures of reaction time/time spent on task completion, accuracy or number of errors. Errors may be the most important measure of performance as they can be safety critical and cost intensive. In the experimental study errors were detected by observations both during the experiment by the supervisor and the observer as well as post hoc by a domain expert watching a scenario replay.

We decided to merge all the observation lists into one unique observation list that combines the experts’ video analysis (containing negative, positive and neutral observations) with the supervisor and observers reports taken during the experiments (marked with an extra X, e.g. NegativeX). This is the most plausible way to integrate the expert’s knowledge with observational knowledge. All observations were classified (from -5 to 5) and rechecked by the ATC experts in the consortium.

In Figure 36 we can see the types of events from the observation lists with more impact on the overall performance of the users. The stress level report was done by asking the ATCO every 10 minutes and it was classified with an impact = 0. Please note that the observations list is not a direct result of errors solely made by the air traffic controller, but an inherent result of the ATCO’s interactions with the systems. In some parts of the experiment, the pilots also forced the air traffic controller into higher levels of workload or stress, for example, by not complying immediately with instructions given by the air traffic controllers.

Figure 36 - Events from observation list with high impact on the performance of the users.

38 of 98



Figure 37 - Events from observation list with more impact on the performance of each user.

4.2.3 Research Questions With this background the consortium defined general research questions, such as:

How to improve the user interfaces usability?

How to detect main causes that lead to mistakes (e.g., using air traffic info, eye tracker, mouse, heart rate data, body pose)

What are the hidden data signs that we can incorporate in an automated system to detect and predict the users next actions or to predict when a user is in a high workload situation or is about to make a mistake?

What are the unknown factors that contribute to higher stress levels or to the lack of situational awareness?

Can air traffic information be combined with sensor information to improve the detection and classification?

Based on these general research questions we created a table with more detailed and more concrete main research questions that allows us to guide our analysis, visualization and exploration of data in search for good predictors of the users’ behaviours. The answers to these questions will lead us towards the aim of the Sixth Sense project. The research questions are sorted according to the categories of mental aspects (see Section 4.2.2) task load, mental workload, attention, behaviour.

39 of 98



ID Research Questions (RQ)

Relation between task load and workload/performance

1 We believe the higher the task load, the higher the mental workload will be. Are the departures, and arrivals per minute related to stress and changes in the heart rate variability?

2 Does the number of taxi-in airplanes at a given time influence/increase the stress level?

3 Does the occurrence of errors (negative observations) increase with higher task load?

Relation between workload and performance

4 Does the occurrence of errors (negative observations) increase with higher workload?

5 Can the heart rate variability be a good indicator for user mistakes?

Relation between attention and performance/behaviour

6 Can an excessive demand be detected based on the number (or time spent) on areas of interest?

7 When there is an increase in the number of Eye AOI fixations, is there also an increase in the number of Mouse AOI fixations, because there is a relation between eye and mouse work?

Relation between behaviour and workload / performance

8 When there is an error, does the user increase eye/mouse movements in order to scan the user interface? Are pauses in the mouse movement activity linked to high workload? Can we show this with our data?

9 When the user is about to make an error, is there an increase in the mouse AOI fixation time?

10 What are the eye and mouse scan path patterns of the users when they are about to make mistakes? Are these distinctive enough?

11 Kinect Data – Is there the possibility of error detection due to the correlation between air traffic information and the body posture?

12 Is the number of clicks, mouse movements or AOIS related to the occurrence of errors?

13 Can we show how fast the utterances were spoken? Can this metric be utilized to detect periods with more negative observations?

Additional research questions

14 Is there a possibility of error detection due to mismatches in the correlation between eye-tracking and voice (call signs) information? Is there a relation between occurrence of errors and an increase in the number of words used in the communications?

15 Can we report what the users' most preferred eye and mouse scanning sequences are? Table 7- List of Main Research Questions

40 of 98



4.2.4 Data Exploration and Analysis In this section we answer each research question (RQ) from the “Research Questions” resume table (see Table 7) by exploration and analysis of the data.

RQ 1: We believe the higher the task load, the higher the mental workload will be. Are the departures, and arrivals per minute related to stress and changes in the heart rate variability?

To answer this question we used the resumed metrics for task load (number of departures and arrivals) and for mental workload (fixation frequencies, heart rate measures and stress levels reported) and we correlated this data with negative error observations.

Regarding the number of departures and arrivals vs observations: It appears that the users have more negative observations in the middle of the experiment, which may be related with the accumulation of time performing the experiment. The data shows that sometimes the users have negative observations after a period of a more intense handling of departures or arrivals.

Many times, only when the arrivals/departures intensity decreases, the users start to have a higher count of negative observations. We do not yet know why, but it could be associated with the fact that the user had a high peek of workload and then starts to make more mistakes. These peeks of higher workload (more arrivals and departures to be handled) might be used as a good predictor for when the number of errors will increase.

Figure 38 - Departures and arrivals (green area) vs number of negative observations (red)

RQ 2 Does the number of taxi-in airplanes at a given time influence/increase the stress level?

We found indications for a relation between the number of taxi-in airplanes (arrivals in the strip

bays in the user interface of our ATM tower simulator) and the report of higher stress levels by the users during the exercise. The same was observed in the data regarding the number of airplanes waiting for departure in the strip bays.

Origin FlightObject_Bin_1minute

0 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 47 48 50 51 53 54 55 56 57 58 59

user5

user6

user7

user8

0

2

4

6

0

2

4

6

0

2

4

6

0

2

4

6

0

10

20

30

0

10

20

30

0

10

20

30

0

10

20

30

3 3 3 3

4 4 4

2 2 2 2 2 2 2 2 2 2

1 1 1

5

3 3 3

4 4 4

2 2 2 2 2 2

1 1 1 1 1 1 1

3 3 3 3 3

4

2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1

5

3 3 3 33 333

6

44

2 22 2 2 2 22 2 2 2 2

1 111 11 11 11 11 1

1 32Count of Departure/Arri..

1 6Count of Positive / Neg..

41 of 98



Figure 39 - Interdependence between arriving airplanes, departures and stress levels. The stress levels (red) in Figure 39 , reported every 10 minutes by the user, are consistently

higher when there is an increase in the arrival or departure airplanes. This interdependence may also be caused by other factors such as time accumulation, fatigue or number of visual objects to be handled. For a more trustable statement we would need to create another type of experiments but it could be an interesting and promising indicator.

RQ 3: Does the occurrence of errors (negative observations) increase with higher task load?

The same conclusions as for RQ ID 2 are valid here.

In our experiment arrivals were at a very constant rate of more or less one airplane per minute, so we focused instead on the total number of airplanes to be managed per minute (arrival and departure).

We observed that the number of airplanes for departure influenced the attention of the users, For sure the number of strips in each bay (and therefore the number of visual UI objects to be managed at a given time) plays an important role in splitting the users’ attention.

The analysis of periods when the mouse stops seem also to be coincident with negative observations. Mouse pause times could be used also as an additional indicator.

RQ 4: Does the occurrence of errors (negative observations) increase with higher workload?

We could observe in the heart rate variability data (for user4, user6 and user8) that: If we cross check the Z-Score IBI heart rate variability with the negative observations in the observations list, we observe that every time before an increase of severe negative observations, there is a steep descent (lower heart rate variability) on the Z-Score IBI values.

According to the literature that states that the heart rate variability can be a good indicator of high stress.

We think that the study of the angle of the line plot (steep descent or steep climbing) could be used as a good predictor for moments of high stress and for the detection of intervals where negative observations are more prone to occur. When combined with the monitoring of the negative Z-Score IBI value, this can be used to help detect negative situations.

User

user1 user2 user3 user4 user5 user6 user7 user8

20 40 60Time

20 40 60Time

20 40 60Time

20 40 60Time

20 40 60Time

20 40 60Time

20 40 60Time

20 40 60Time

1

2

5

10

20A

irpla

nes

in ta

xi-in

1

2

5

10

Airp

lane

s P

endi

ng D

epar

ture

0

1

2

3

4

5

6

Str

ess

Leve

l

0

1

2

3

4

5

6

Str

ess

Leve

l

0 10Airplanes Pendin..

1 5Stress Level

0 20Airplanes in taxi-..

42 of 98



Figure 40 - Correlation between negative observations and HRV. HRV is a good indicator for periods of negative observations.

We can see in the figure above the dotted vertical reference lines, when the heart variability decreases (negative changes): Usually this event is followed by an increase in negative observations.

We could not confirm from the data, that occurrence of positive observations (annotated by the experts as efficient efforts to solve a negative situation) have a significant influence to lower the occurrence of more negative observations. With this statement, we mean that these positive events are in parallel with other negative observations and therefore although there is a distinct increase in the heart variability, these positive peeks are normally associated also with negative events.

It would be worth to analyse how quick this change (the slope of the Z-Score IBI values) is still associated with the occurrence of negative observations when the user appears to be relaxed and not in a high stress situation.

Origin

-2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54

Minute of Date Time

user6

-3

-2

-1

0

1

2

3

4

5

-0,5

-0,4

-0,3

-0,2

-0,1

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

21 32 452821 32 4528

Steep angle of positive change inthe Z-Score IBI value

-0,437 0,923

Avg. Z-Score IBIPositive / NegativeNegative

NegativeX

43 of 98



Figure 41 - Relation between mouse AOI frequencies observation list and HRV.

We could use both negative changes and positive changes in the heart variability to create a model to detect periods of high stress associated with the occurrence of more negative observations.

We also analysed the mental workload in terms of fixation frequency and fixation duration correlated with the occurrence of negative observations. We have analysed this correlation for both eye and mouse frequency (in terms of AOI visited in a certain period of time, in our case one minute bins).

We can observe when the user is “fighting” to solve a problem and that before an error there is a moment of reduced activity in the mouse movements (Figure 41) .

RQ 5: Can the heart rate variability be a good indicator for user mistakes?

From the analysis and data explorations done so far we believe that the heart rate together with the reduction in mouse activity, the number of visual UI objects to be manages (e.g. flight strips), the eye tracking AOI frequency and duration provide very good clues for anticipating moments of stress, high workload and the occurrence of negative observations.

However, the mouse data seem to be more distinctive than the eye tracking data regarding detection of negative observations periods. But definitely the eye tracker data provided us clues on probabilities of next users’ AOI and sequence of actions.

OriginType / OneMinuteBinObs

Null mouse

1 4 1 3 4 5 12 15 16 17 18 19 21 22 24 25 27 28 29 30 31 33 34 35 37 38 39 42 44 45 47 48 50 51 53 54 55 56 57

user4

user6

user8

0

50

100

150

-5

0

5

0

50

100

150

-5

0

5

0

50

100

150

-5

0

5

-1

0

1

-1

0

1

-1

0

1

-1,354 1,332

Avg. Z-Score IBI1 157

Count of AOIPositive / NegativeNegative

Positive

44 of 98



RQ 6: Can an excessive demand be detected based on the number (or time spent) on areas of interest?

To answer this question we analysed for each user what were the top areas of interest that received more attention from each user. In Figure 42 and Figure 43 we can observe two examples (one for mouse AOI and another Eye AOI) that show the AOI that received more interest by user7.

Figure 42 - Mouse AOI of user7 that received most interest time during the experiment.

Figure 43 - Eye AOI of user7 that received most interest time during the experiment.

UserType

mouse

user7

taxiout4.023

taxiin9.445

radarTR9.105

radarBR2.807

pendingdepartures13.794

onblock2.272

leftpanel2.424

148 13.794

Avg. Length

45 of 98



We created a resume table to demonstrate the AOI preferences of each user throughout their experiments (seeTable 8). As we can observe not all users handle the air traffic the same way (workflow steps, communications, and preferences in order of execution and dispatch) or behave in the same way.

AOI User2 User4 User5 User6 User7 User8

Mouse

Handoverrunway x x

Startuppushback#taxiout x

taxiin#onblock x

pendindepartures x x x

taxiin x

radarTR x x

radarBR#radarBL x

radarTL x

taxiin x x

taxiin#taxiout x

taxiout x

Startuppushback x

Eye

pendingdepartures x x

toppanel x

Startuppushback#taxiout x x

startuppushback x

taxiin x

radarTL#radarBL

radarTL#radarTR x

pendindepartures#startuppushback x

HandoverRunway x

taxiout x

radarBL x Table 8 - Resume of AOI that received most interest time from each user.

We could observe some AOI that had more attention from the users in general (e.g., the “pendingdepartures” bay strip with three users).

We come back to RQ 6 with the methods of Section 4.4

46 of 98



RQ 7: When there is an increase in the number of Eye AOI fixations, is there also an increase in the number of Mouse AOI fixations, because there is a relation between eye and mouse work?

As we can see from the previous figure, most of the times it seems to be the other way around,

if there is an eye AOI increase then the mouse AOI decreases. This fits well to the literature that states that a decrease in the number of mouse movements is linked to a higher workload.

Figure 44 - Window standard deviation o(2 minutes) and number of errors, capturing very well periods with increased user errors.

However, it also seems evident that sometimes there are periods when the mouse activity is

following the patterns of the eyes. This seems to be linked to the moments when the user has more errors and tries to solve the problem. This is especially true if we take in consideration the window standard deviation (last 2 minutes) like we can observe in Figure 44.

RQ 8: When there is an error, does the user increase eye/mouse movements in order to scan the user interface? Are pauses in the mouse movement activity linked to high workload? Can we show this with our data?

We investigated the relation between the occurrence of errors and the increase of eye or

mouse events. According to the literature, pauses in the mouse movement are known to be linked with high workload periods when working with user interfaces. The data indicates that there is a possible link between reductions in mouse movement and increases in the eye movements that are coincident with the occurrence of negative errors (errors indicated by the experts). If this is true, this could help us in the future to create an algorithm that is able to detect or predict error periods. In Figure 45 we visualized the different fixation frequency changes (represented by lines) and what is happening in the observation data (red bar plots).

User Type OneMinuteBin

0 1 4 6 8 11 12 18 19 20 21 22 24 25 27 30 31 33 34 36 37 39 40 41 42 43 45 47 48 50 51 53 54 55 56 57 58 59

user8 eye

mouse

0

200

400

0

200

400

600

0

200

400

0

200

400

600

165,5150,5147,0

142,0

140,0

38,0

45,0

33,5

29,028,5

27,0

22,5 20,520,0 19,515,5

14,0

13,513,5 13,5

12,0

112,5

11,5

10,510,0 8,5 8,58,5 8,5

7,5

6,0

106,5

4,5 3,5 2,52,00,0

89,0

201,026315789

7,5

7,08,0

6,017,5

5,5 5,517,017,0 17,017,0

5,0

5,04,5 4,54,516,0

4,04,04,0 3,50,0 0,0

14,514,52,52,5

14,0

14,01,01,0 1,0

13,513,51,51,5 1,5 1,520,105263158

decrease in mouse movements followedby an increase in eye movements

increase in eye movements

Type, Measure Nameseye, Window_STD_Devmouse, Window_STD_Dev

Positive / Negative*NegativeNegativeX

-157,0 426,0

DiffFromAvgAOI

47 of 98



Figure 45 - Relation between eye and mouse movements (AOI visits) and occurrence of errors.

RQ 9: When the user is about to make an error, is there an increase in the mouse AOI fixation time?

The same conclusions as for RQ 8 apply here, when we analyse the fixation time instead. We

also created visualizations where we filter the data to include only mouse move events and we could see if the user was moving the mouse when negative errors occurred or not. The conclusion is, that the users were never moving the mouse at these negative moments, the users really stopped moving the mouse probably to analyse the current situation.

RQ ID 10: What are the eye and mouse scan path patterns of the users when they are about to make mistakes? Are these distinctive enough?

To analyse eye and mouse scan path patterns of the users we parameterized a stochastics state model (VLMM) as described in section 4.4

By selecting a timeframe around an observed error, specific states could be identified. By analysing when these states occur elsewhere, we did not find any pattern that these states occur significantly more often near errors.

We think the scanning paths need to be analysed in more detail by ATC experts to find additional measures that give, combined with the state sequence, an indication of errors.

RQ 11: Kinect Data – Is there the possibility of error detection due to the correlation between air traffic information and the body posture?

The head pose provides information about the angle of a user’s head. With these two values one can calculate how far or near a person’s head is during the experiment. Furthermore with

User Type OneMinuteBin

1 3 4 5 12 16 17 18 19 21 22 24 25 27 28 29 30 31 33 34 37 38 39 42 43 44 45 47 48 50 51 53 54 56 57 58

user2 mouse

user3 mouse

user4 mouse

user5 eye

mouse

user6 eye

mouse

user7 eye

mouse

user8 eye

mouse

simultaneous decreasing ofmouse movements

Period of increase in movements(AOI visited)

increases in mouse movements

decrease in eye movements

increase in eye movements

Positive.. Negat..

-147,8 157,3

DiffFromAvgAOI

48 of 98



the information of the head pose we will know the tilt of the head at a specific point in time at the experiment.

Figure 46 - Kinect Head Pose Measurements Schema.

“Head Coordinate State” in the Kinect data

‐ Head Coordinate State indicates the position of the persons head. (1 is max down and 9 is max up, 0 is not specified).

Head rotation state Left Right? (0-9) ‐ The left right value gives an indication of how much a person has turned his head left or right.

(1 is max left and 9 is max right, 0 is not specified).

Head Rotation State Up Down (0-9) ‐ The up down values gives an indication of how much a person has turned his head up or

down. (1 is max left and 9 is max right, 0 is not specified).

Sound source angle and the Microphone beam angle ‐ Sound source: Gets the sound source angle (in degrees), which is the direction from where

the sound is arriving (direction of a sound source). ‐ The beam angle: Gets the beam angle (in degrees), which is the direction the sensor is set for

listening.

What can we infer by using the Head Coordinate (x, y, and z)? Or Head Pose? ‐ The Head coordinate x, y and z give an overview of the person’s 3D position in a room

(values are in meters).

Z > 0, distance of the user Z = 0, (minimum), nearest point at Kinect X > 0, right X < 0, left X = 0, centre position Y > 0, right Y < 0, left Y = 0, centre position

Table 9 - Resume of parameters for the Kinect Head Pose

The calibration of the Kinect sensor for the detection of head and body postures was tricky and difficult. In our 2 experiments we could only collect Kinect data for user1, user3 and user5. Therefore we would need more tests and more data to derive definitive conclusions.

We collected data about Head Coordinate State, Head coordinates, Head Pose Coordinates, Head Rotation State (left, right, up, down) Microphone Beam Angle and Sound Source Angle, User in Range and User Tracked.

49 of 98



Figure 47 - Kinect Data Representation, visualizing Detected Head Pose vs Count of

Negative/Positive Observations vs Type of Observation vs User in Range (or not in range).

However, even by using only the current available data we have found the head coordinate state and the user in range variables very promising for implementing a future error predictive system (please see Figure 47).

As we can also observe by considering only the variable Head Coordinate State = 0, 2, 4, 5, 6, 7, or 9 (we removed states: 1, 3 and 8) and the variable Sound Source Angle (between -26.6 and 36), we could account (include in the same time interval) at least 96% of all negative observations reported by the experts, as it can be observe in Figure 48. What we mean with this is that we can envision the usage of our CEP filtering mechanisms to be able to greatly reduce the amount of that that needs to be processed.

This process can be achieved by filtering out everything that might be not relevant, or at least to filter out data that can be processed at a later stage using more complex and slower methods, allowing the most interesting data to be immediately analysed using more simple and fast methods.

Origin

user1

user3

user5

81

Arrival Queuegreater 3In range

71


61

Taxiway OptimisationIn range

61

Blocking Situation whichdelays other plane /

planesNot in range.

41

Taxiway OptimisationIn range

01

After Crossing / QueuingNot in range.

01

After Crossing / QueuingIn range

91

After Crossing /QueuingIn range

71

Blocking Situation whichdelays other plane / planes

In range

71

Arrival Queue greater 3In range

61

BlockingSituation which

delays otherplane / planesNot in range.

61

Arrival Queue greater 3Not in range.

61

Arrival Queue greater 3In range

61


61

After Crossing /QueuingIn range

51

Blocking Situationwhich delays other

plane / planesIn range

41



41

After Crossing /Queuing

Not in range.

21

Blocking Situation whichdelays other plane / planes

In range

01

03

Blocking Situation which delays otherplane / planesNot in range.

01

ArrivalQueue

greater 3Not inrange.

01

After Crossing / QueuingNot in range.

Positive / NegativeNegativePositive

50 of 98



Figure 48 - Kinect Data after applying filters to include only the majority of negative observations (96%).

RQ 12: Is the number of clicks, mouse movements or AOI’s related with the occurrence of errors?

By plotting (Figure 49) the total number of mouse clicks (left clicks, mouse left pressed for drag and drop and mouse right clicks) together with the total number of negative observations per minute, we could observe that there is an evident relationship between the number of clicks and the number of errors per minute.

Also it seems apparent that a small increase in the number of clicks was followed by a high increase of negative observations. A high increase in the mouse click activity was followed by a substantial decrease in the number of negative observations (maybe this means that the users are trying to solve difficult situations).

Origin

user1

user3

user5

61


plane / planesNot in range.

30,0817,60

71



10,034,73

61


plane / planesNot in range.

10,035,95

61

Arrival Queuegreater 3

Not in range.0,00

10,76

61


-20,050,00

61

After Crossing / QueuingIn range

-20,00-25,64

51

Blocking Situation which delays otherplane / planes

In range40,1118,14

41


In range10,034,40

21


In range20,059,27

01

CommunicationNot in range.

20,0511,83

03

Blocking Situation which delays other plane / planesNot in range.

20,05-10,17

01

Arrival Queuegreater 3

Not in range.10,0325,49

01

After Crossing /Queuing

Not in range.20,1135,16

Positive / NegativeNegativePositive

51 of 98



Figure 49 - Correlation between total number of mouse clicks and negative observations

There was only one case where this relation appears to be delayed and it does not occur in the same same minute, (see Figure 49) at the marked with the word “increase”). Here, first there was an increase in the number of mouse clicks followed in the next minute by an increase in the number of negative observations. This might be due to normal delays made by the users while analysing the current situation, occupied with other tasks or distraction. Figure 50 is a general plot taking the values for all users, but the same results remain true if we plot every single user separately.

We only focused in the overall patterns that might be used to create in the future a prediction system, therefore in this case we focused in the direct relation between mouse clicks activity and the occurrence of negative observations.

RQ 13: Can we show how fast the utterances were spoken? Can this be used to detect periods with more negative observations?

By plotting the relationship between the number of spoken words per minute and the number of negative occurrences per minute, we could observe that the users spoke in average between 6 and 8 words per minute.

As we can observe in Figure 50 a big increase in the number of words used by the ATCO seems to point periods of high concentration of negative observations. But this does not always happen.

OneMinuteBinObs

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

-20

-15

-10

-5

0

5

10

15

Diff

eren

ce in

Cou

nt o

f Act

ion

Typ

e

-6

-4

-2

0

2

4

Diff

eren

ce in

Cou

nt o

f Pos

itive

/ N

egat

ive

correlati-on

correlation correlation correlation correlation

Increase

Increase

1 6Count of Positive / Neg..

2 22Count of Action Type

52 of 98



RQ 14: Is there a possibility of error detection due to mismatches in the correlation between eye-tracking and voice (call signs) information? Is there a relation between occurrence of errors and an increase in the number of words used in the communications?

It was not possible to check for mismatches between eye-tracking (of callsigns) and the spoken callsigns. This was due to the fact that neither the simulator user interface (in the radar area) could provide us with information about the callsign (for when the user was looking to a specific airplane). Neither the eye-tracker sensor would allow to track each small point (airplane) in the radar with accuracy without using additional and improved selection strategies. For this reason it was not possible to cross check if the callsigns viewed by the user matched the callsigns spoken by the user (although the voice recognition system could recognize the spoken callsigns with a very high degree of accuracy).

In the future we could improve the simulator to provide feedback information when a user is looking to an airplane in the radar area. We could achieve this by creating a selection circle (with a certain threshold area, instead of just using a small eye cursor). This selection circle would capture any airplane inside the area of the circle and then the simulator would have to provide info about the hovered airplanes (similar to the normal mouse hovering or selection).

However, we could observe a relation between the increase in the number of words used by the air traffic controllers and the occurrence of negative observations.

This seems to follow always the same pattern. Sometimes there is a clear decrease in the number of words used followed by a significantly increase in the number of words spoken by the air traffic controllers.

Especially by looking at the negative observation descriptions this seems to be correlated with the worst situations annotated by the experts (such as putting on hold several times the same air plane, resolving the crossing of runways or having too many airplanes to be resolved in the taxi or departure strip bays).

Figure 50 - Correlation of negative observations and difference in number of words/mouse actions

Origin New TS

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

user6

user8

0

2

4

Cou

nt o

f Pos

itive

/ Neg

ativ

e

-100

0

100

Act

ionT

ype_

Diff

eren

..

0

500

1000

Num

berW

ords

_..

0

2

4

Cou

nt o

f Pos

itive

/ Neg

ativ

e

-100

0

100

Act

ionT

ype_

Diff

eren

..

0

500

1000

Num

berW

ords

_..

less mouse activity

increased mouse activity

less mouse activity

increased mouse activity

negative observation

negative observation

significantly more spokenwords used significant difference in the number of spoken

words used

significantly less wordsused

significant reduction of the number of words usedbefore the negative observations

Positive / NegativeNullNegativeNegativeX

-179,7 143,4

ActionType_Diffe..

-278 963

NumberWords_D..

53 of 98



RQ 15: Can we report what the users' most preferred eye and mouse scanning sequences are?

We analysed most probable eye and mouse scanning sequences per user (see Section 4.4.1.1).

It seems that for e.g. user6 shows more diversity in used eye scanning patterns. But this result needs to be analysed into more detail.

Figure 51 - Relationship between number of words spoken and negative observations.

OneMinuteBinVoice

0 1 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 24 25 28 29 30 31 33 34 36 37 39 40 42 44 48 50 51 53 54 57 58 59

Ori

gin

use

r4u

ser5

use

r6u

ser8

1

2

5

Co

un

t of P

osi

tive

/ ..

1

2

5

10

20

Nu

mb

erW

ords

1

2

5

Co

un

t of P

osi

tive

/ ..

1

2

5

10

20

Nu

mb

erW

ords

1

2

5

Co

un

t of P

osi

tive

/ ..

1

2

5

10

20

Nu

mb

erW

ords

1

2

5

Co

un

t of P

osi

tive

/ ..

1

2

5

10

20

Nu

mb

erW

ords

2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

13,0

00

11,0

00

5,00

0

4,00

0

4,00

0

6,00

0

7,166666667

3 3 3 3

4

2 2 2 2 2 2

1 1 1 1 1

4,00

0

6,00

0

5

3 3 3

4 4 4

2 2 2 2 2

1 1 1 1 1 1 1

6,00

0

5,00

0

5,00

0

5,00

0

4,00

0

6,00

0

5,00

0

8,00

0

5,5

3 3 3 3 3

4 4 4

2 2 2 2 2 2 2 2 2 2 2 2

1 1 1 1 1 1 1 1 1 1 1

10,0

006,00

0

8,00

0

6,00

0

8,00

0

7,6

Count of Positive / Negative:4

NumberWords:6

Count of Positive / Negative:3

NumberWords:8

Count of Positive / Negative:2NumberWords:6


Count of Positive / Negative: 4NumberWords:10


1,000 4,000

Moving Average of Cou..

54 of 98



4.3 Hints for H2 - Analysis of the Arrival and Departure Workflows

In contrast to H1, which refers to decisions a controller makes in a single step of the workflow, we refer in H2 to workflow patterns as sub-sequences (steps) of actions the controller performs. These workflow patterns might vary between good and bad controllers.

In order to detect good and bad workflows – and therefore to find hints for H2 - we took into consideration: on-block situations, repetition of certain workflow steps (number of cross runway situations or on hold) and the total processing time of each airplane.

4.3.1 Implementation Not all the necessary workflow steps for this part of the analysis are delivered automatically by the simulation framework and the different data streams from the sensors or the ATM system. Therefore, we used our own automatic workflow detection and analysis component. Time based data stream oriented applications are used across several fields. One of the strategies for correlating and extracting information from these data streams by employing are Complex Event Processing (CEP) Systems.

CEP combines several events to generate a composite or derived event. These events contain new meaningful information to study the underlying process. Furthermore, CEP gives the opportunity for a loose coupling between software components [22].

Figure 52 - Example of using CEP to join two different events into one.

In our software prototype the detection of the workflow steps for every airplane is achieved by making usage of CEP Server implementation called NEsper. NEsper allows (through a tailored Event Processing Language - EPL) the registration of queries into the NEsper server [23]. After the incoming events are separated from the message queue component (by replaying the experiments data automatically), the CEP-Server consumes the events and triggers, depending on the registered queries, new and more meaningful events.

In our example for automatic workflow detection, the events from the message queue called FlightObject contain ATC information about departures and arrivals. These events are accepted and processed by the CEP Server.

The information is stored by the ATM systems in form of XML messages. Beforehand, we register a specific query to automatically process and detect each workflow step. The CEP components detect the FlightObject messages and filter relevant information such as: FOID, Callsign or AtcType to realize the workflow detection.

After separating each workflow step in the air traffic message, an event is triggered, coded with the callsign of the airplane. The triggered event can be immediately consumed by our software prototype and we can visualize the current workflow step of each airplane and even show any step repetition (e.g., same airplane put on hold more than one time). An example for detecting the workflow step TAXI by utilizing an EPL query is given in Table 10. The first expression is for filtering relevant information Callsign. The second expression is for triggering an event if an airplane is in workflow step of TAXI.

55 of 98



//This sensor counts the number of Aircrafts in the Workflowsteps TAXI expr ="Insert into Atc \n" + "SELECT Identifier, \n" + "fligthObjectPublication.FO.id AS AtcFOID, \n" + "fligthObjectPublication.FO.flightPlan.flight_plan.aircraft_identification.identifier AS Callsign, \n" + "fligthObjectPublication.FO.departureInfo.runwayId AS AtcRunwayId, \n" + "fligthObjectPublication.FO.atcState.role AS AtcRole, \n" + "fligthObjectPublication.FO.atcState.type AS AtcType \n" + "FROM FlightObjectPublicationLocationSensor"; createStatement("AtcChange", expr); expr ="create context sepWorkflowTaxi partition by AtcFOID FROM Atc"; createStatement("CEPVariable", expr); expr = "context sepWorkflowTaxi select Identifier," + "AtcFOID, AtcType, count(AtcType) as countStep, AtcRole FROM Atc \n" + "WHERE AtcType = 'TAXI_WITH_TAXI'"; createStatement("WorkflowstepTaxi", expr); Table 10 – Filter/Query to detect airplanes that are in the workflow step TAXI.

In Figure 53, the entire process of filtering and consuming CEP events that extract and correlate information about each workflow step is represented. Furthermore, the process of data visualization and generation of new data sets is also shown.

Figure 53 - The complete process of consuming, filtering and generating events.

4.3.2 Results of the Analysis of ATC Workflow Steps The DM/ML/AI module is now able to distinguish the different workflow steps (please see Figure 5) performed by the ATCOs for each airplane landing or departure, as well as, the number of times that a certain airplane is on one of this workflow steps. Additionally, we can now automatically analyse the processing time (managing time spent by the air traffic controller) of each airplane.

Next, we show examples of metrics that the DM/ML/AI module is able to extract. We start (Figure 54) by showing relations between the processing time (ProcessSeconds) spent by the handling arrival or departure flights and the occurrence of negative (with high impact) observations (classified according to the experts).

56 of 98



Figure 54 - Analysis of the Processing Time (seconds) for arrivals (orange/brown) and Departures (Blue) for user8.

Like we have stated before, the DM/ML/AI module is also able to automatically filter, correlate and use metrics related to each workflow step of the arrival and departure workflows (achieved through the usage of CEP filters). These metrics are taken from the standard ATCO workflow descriptions (Figure 5). They allow us to discover the best and worst sequences in terms of processing time or repeated workflow steps. It also allows us to know the exact current state of each airplane. We can use this information to predict next states and to prepare in advance recommendations regarding next best actions.

Next, we present screen captures, taken from user8, regarding the performance in handling arrival and departure flights. We can observe the total processing time for handling all tasks related with a specific aircraft and also the number of repetitions for each step of the correspondent workflow (arrival or departure).

We used this information for correlation of stress levels report, experts and supervisor observations and sensor data. From the data we see obviously a relation between the different options taken by the ATCO, like the number of times a user puts a flight on hold, or the number of times an airplane has to make cross runways and the total processing time of a flight or the occurrence of negative observations.

In general, the longer a flight takes to process the higher the negative impact will be (taken from negative observations).However, in some situations, like in a blocking situation there is no obvious correlation with processing times or other workflow steps metrics (see blocking situation in Figure 54).

Timestamp

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

0

5

10

Cou

nt o

f Pos

itive

/ N

egat

ive

0

500

1000

Avg

. Pro

cess

Sec

onds

-5

-4

-3

-2

-1

0

Avg

. Im

pact

0

200

400

600

Avg

. Pro

cess

Sec

onds

Avg. Impact: -4,500Count of Positive / Negative: 4

Avg. Arrival ProcessSeconds: 1.259,0


Avg. Arrival ProcessSeconds: 1.259,0


Avg. Impact: -3,571Count of Positive / Negative: 7Avg. Departure ProcessSeconds: 684,7



Message: CSA543 is blockedby DLH3TP


-5,000 -3,000

Avg. Impact

219,0 684,7

Departure AVG(Proces..1 12

Count of Positive / Neg..

516,0 1.259,0

Arrival AVG(ProcessSe..

57 of 98



Figure 55 - DM/ML/AI module with automatically calculated metrics for arrival flights capturing repeated workflow steps (e.g., number of taxi commands or cross runways for all flight).

The analysis of correlations between workflow steps values and all the other available metrics (performance, sensor data, workload, etc.) is a very challenging research question. In the future we would like to make use of graph databases capabilities to capture and semantically annotate all the users’ decisions (sensor data, steps, clicks) in order to follow step by step all the user decisions and to perform more exhaustive data analysis and correlations.

4.3.3 Machine Learning Experiments As an extra step, we also tried to develop experimental models for the detection of outliers, discovery of patterns of for the creation of prediction models related with the detection of negative observations.

We have started by creating a special dataset that contained the several metric counters in intervals between errors. With this new dataset at hand we applied algorithms that return a set of association rules from the given set of frequent item sets. To achieve this goal we focused on a-priori algorithms, specifically the FP-Growth algorithm that calculates all frequent item sets from the given example-set using a FP-tree data structure (all the attributes were converted to binomial). Frequent item sets are groups of items that often appear together in the data, here it is also important to know the basics of market-basket analysis for understanding frequent item sets.

Association rules are if/then statements that help uncover relationships between seemingly unrelated data. An example of an association rule would be "If a customer buys eggs, he is 80% likely to also purchase milk." An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item (or item set) found in the data. A consequent is an item (or item set) that is found in combination with the antecedent. Association rules are created by analysing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. The frequent if/then patterns are mined using the operators like the FP-Growth operator. By utilizing a create association rules operator these frequent item sets are taken and the generation of association rules is performed. The algorithm tries to find at least the specified number of item sets with highest support taking the 'min support' into account, in our case 0.8.

58 of 98



Figure 56 - Discovery of Association Rules using the algorithm fp-growth.

We can see in Figure 56 a resume of the metrics used in our model (they represent occurrence counting’s between successive negative observations) for one user (user8). In our case these associations’ rules could help us understand the relation between different metrics and the occurrence of errors. Next in Figure 57 we can see an example of a graph visualization of the relation between different metrics and the different discovered association rules.

Figure 57 - Relation Between the discovered association rules and different variables of the model.

59 of 98



We also experimented with the algorithm k-NN Global Anomaly Score. This algorithm calculates the outlier score based on a k-nearest neighbours implementation. The outlier score is by default the average of the distance to the nearest neighbours, it can be set to the distance to the kth nearest neighbour which is similar to the algorithm proposed by Zengyou He et al (2003) [24] by setting the corresponding parameter. The higher the outlier the more anomalous the instance is. The operator is also able to read and write a model containing the k nearest neighbours set.

Typically, 99% of the execution time is used to compute the neighbours, so it is a good idea to store the model, for example, when looping over a parameter.

The operator checks whether the model and the example set fit together. The model can be used for any of the nearest-neighbour based algorithms. The parameter k used to create the model needs to be the same or larger as the parameter k specified in the operator. Otherwise, the model is re-computed.

Figure 58 - Outliers Discovery for negative observations in the new dataset with metrics counters (captured between successive negative observations). Next, in Figure 59, we present decision trees to analyse the correlations between the different metrics and the occurrence of errors. We used the dataset that contains metrics between successive intervals of negative observations. For example, for user7 the increase of negative occurrence appears to be linked to the number of eye movements or a combination of this with the number of departures and the number of arrivals.

60 of 98



Figure 59 - Decision tree to depict reasons for increasing numbers of negative occurrences for different users. The regression test uses the variables that explain our data the most. By utilizing the software GMDH (extended trial version) we were able to create a draft of a prediction model. It is applied by separating the dataset into a test dataset with 80% of the data and uses 20% of data as training data, using a bootstrap. We used a polynomial approximation method to build the predictive model. These algorithms detect the structure of the data and create some polynomial equations with weights for each of the variables that are then used in the definition of the prediction model. In Figure 60 we can observe the model being applied to both the 80% known dataset (first part of the graph in blue) and then applied to the unknown 20% part of the dataset (in red). The grey line that comes from the beginning of the chart and then morphs into a red greyed out area over the red part of the graph show, how well the model was predicting and adapting to the data (known and unknown to the algorithms). The match was very accurate for this example, and it was a first indication that these variables made sense for future developments regarding the creation of a prediction framework.

61 of 98



Figure 60 - Polynomial regression analysis for creating a model to predict negative

occurrences based on top most metrics (number of eye events or departure flights).

4.4 Event Trace Analysis We analysed the traces : . . of a set of possible events , . . , like eye-and mouse-fixations by parameterising a Variable Length Markov Model (VLMM). VLMMs provide an efficient method to learn a model of discrete event systems. Different to other models, like Hidden Markov Models that are able to describe more general processes, no previous knowledge about the process is needed. States of VLMM can easily be interpreted, since each state is labelled by a corresponding subsequence within the data. A state chart can be calculated from the VLMM and most probable state sequences can be determined. The occurrence of a state can be associated with a timespan within the data. A state can have different attributes, i.e. complexity or the entropy of the probability distribution of next events. These measures can additionally be used to look for patterns associated with the observations. This is accomplished by a tool of the Fraunhofer FKIE and will be described in the following in more detail.

4.4.1 Variable Length Markov Models (VLMM) The parameterization of VLMM is very intuitive. The algorithms grow a tree of sequences. Each node of the tree represents a sub sequence. The root node is the sequence with no events; the children of a node are representing a sequence which extends the sequence of its parent by one previous event. Every node within the tree is labelled by a unique sequence of events. Hence, the parent of one node represents a sequence that looks one step less into the past as its children. The algorithms grow the tree from the root and do only include nodes which correspond to sequences that occur sufficiently often in the data where the observation of the sequence contain more information about the event following next than if only a suffix of the sequence is considered (a suffix of a sequence is if the sequence is shorted by the events at the beginning. is a suffix of ). All algorithms are very similar and mainly differ in the criteria when a node corresponding to a sequence is included into the tree. We use some statistical criteria [25]. Each node is also associated with the observed empirical probability which event follows next. Therefore the tree can easily be used to make predictions. By looking up a leave node that corresponds to a suffix of the observed sequence its associated probability distribution of next events can be retrieved. The tree of sequences can be converted into a probabilistic state machine also called Probabilistic Suffix Automation [26].

62 of 98



Figure 61 - Relation of Probabilistic Suffix Tree (PST) and Automation PSA.

To transform the tree (also called Probabilistic Suffix Tree, PST) into a PSA the leaves are used as states and additional states are added to make sure that every state has a follower. The probabilistic state machine can be used for simulations or for visualizing typical event sequences by a state chart. Furthermore, additional attributes of the states can be calculated. The occurrence of states over time can also be determined and visualized this way.

4.4.1.1 Discrimination of states by event durations Often for determining the user's state it’s important how much time has passed between different events, since they provide important information on whether e.g. a user has a menu entry read. The idea is to consider each sequence to be checked not only whether a change in the prefix causes a change in the probability distribution, but also whether the durations of the events in the sequence contain information on the next event [27]. In order to decide this, the distribution of event durations within a sequence in dependence on the observed next event is analysed (Figure 62).

Figure 62 - Hypothetical distribution of event durations, if after one event (left) or a

sequence of two events (right) a specific event is observed. For a specific sequence : . . possible next events are shown on the x-axis. The y-axis shows the durations of the events in the sequence if the event indicated on the x-axis has been observed. Figure 62 shows on the left side the case that has been observed. In dependence on which event has been observed next, the event durations for are different. The mean event durations seems to be longer if . Therefore predictions can be improved if the durations of the previous events are considered by including this sequence into the tree with additional information

BA AA

B

(AA) 0.5

(B) 0.5

(BA) 0.25 0.75

0.5

0.5

0.75

0.25

0.25

PST:

e

A B

AA BA

(0.5,0.5)(0.5,0.5)

(0.5,0.5):

(0.25,0.75)(0.75,0.25)

A

A B

B

PSA:

: ⟼ 0, 1

)1(o

1i

)2(o )3(o )4(o )5(o

i=

(k) ,

i+1

)

1i

T (

i=

(n

) , i+

1)

T(

i -1

=

(m),

i+1

)

)1(o )2(o )3(o )4(o )5(o

63 of 98



about the event durations. The right side of Figure 62 shows an example where two previous events are considered. The patterns in the mean event durations contain information about the next

event. To decide if these differences are relevant, statistics can be applied, like e.g. analysis of variance (ANOVA). Because the durations are not normally distributed we used the non-parametric Kolmogorov-Smirnov test.

4.4.1.2 Complexity of States To calculate the complexity of a stochastic process the informational dimension of the block-entropy is used to assess how much information is contained in previous events for the prediction of the next event, weighted by how far they lay in the past [28]. The more information about the further development of the system is contained in far past events, the higher the complexity of the stochastic process. As a measure of information that is contained in the observation of a previous event sequence about the next event, the reduction of entropy is used. Entropy is a measure of the uncertainty with which an event occurs. In case of considering sequences . . of length n, the entropy of the probability distribution of the next event can be calculated by

h | . . . ...

| . . log | . .

The needed empirical probabilities can easily be obtained from the PST. For larger n more of the past events are considered in predicting the next event and hence the entropy (uncertainty about the next event) will be reduced. If the memory of the process is limited, this value will reach a limitlim → h n h . Grassberger [29] used the area under the curve Λ h n h as a measure of complexity for a stochastic process:

≝ Λ .

By this definition information gain that comes from events more far in the past is weighted higher than information from events in the near past. This is illustrated in Figure 63.

Figure 63 - Illustration of the complexity measure of Grassberger.

In this way complexity is defined on the overall process. But for an analysis it is important to investigate the origin of the complexity. If the process is transformed into a PSA the contribution of single states to the overall contribution can be determined [30]:

: ⋀ : | |, log : | |,

:

There are two features of a state that determines its contribution to the overall complexity. First, how much information contributes from far past events to the prediction of the next event, but also how frequently this state occurs in a process. If one wants to eliminate this correlation of state complexity with its frequency it makes sense to divide state complexity by its initial probability, since the initial probability is defined to represent the expected relative frequency of the state to occur in a sequence:

/

64 of 98



It is natural to associate this complexity with work load, especially when the external factors can be isolated and only internal factors are considered for the generation of the interaction sequences [28]. Since the occurrence of a state is associated with time intervals, the complexity per state can be used to calculate a time series of this complexity measure.

4.4.2 Scatterplot Matrix for Measures The VLMM provides a good consolidated description of the occurrence of events and provided us with additional measures for the operators’ state at a given time. To make the link to other time series of measures we used a scatterplot matrix. A scatterplot matrix shows 2D-scatterplots of all possible combinations of considered measures in a table, where each column contains scatterplots with the same measure plotted on the x-axis and each row contains scatterplots plotting the same measure on the y-axis. Beside the complexity measure, derived from the VLMM we also calculated the entropy for the diversity of next events in a timespan and also the mean number of events within a timespan. Furthermore we included the IBI of the heartbeat data as an additional workload measure and the mean number of flights as a task load measure into the scatterplot matrix. We normalized each time series by calculating for each considered measure a mean value for timespans of 10 seconds. In this way each point in a scatterplot is associated with a timespan and therefore there is a link between dots in the scatterplot and the occurrence of states within this timespan. By selecting outliers within one of the scatterplots the corresponding states can be identified and their occurrence on the timeline can be displayed in a histogram. By including markers for the observations into the histogram, correlations of states or possible outliers in the scatterplot to the observations can be identified. To improve this capability we coloured points red that correspond to time intervals short before the observations with most negative impact (-5).

4.4.3 Visualization of Sequential Patterns In this section we will shortly explain the visualizations we used to present and analyse the sequential patterns found by the VLMM algorithms. Figure 66 shows the user interface. On the right, there is the state chart. On the left is the user who is to be analysed and whose event data should be displayed in the state chart. The tree view for the PST is located in the centre. Nodes can be expanded for exploration and sub sequences can be selected for display. At the bottom a time line histogram is displayed that shows the distribution of the occurrence of a selected state or corresponding dots in the scatterplots. The left top centre can be used to show different visualizations, which can be selected by the tabs on the top. Figure 64 displays the probability distributions of the selected state augmented on the screenshot of the Sixth Sense simulator user interface. The colour scheme of the nodes in the state chart can be changed to: frequency, complexity, or uni. The state chart contains only links between states with sufficiently high frequency of events except for links that are needed to make sure that every node in the state chart has a next and a preceding link.

65 of 98



Figure 64 - Screenshot of the user interface with displayed transition probabilities.

Figure 65 shows how the probabilities of next events associated with a state are visualized in more detail. Green circles: next events (eye/mouse), blue circle: prefix / starting point of sequence, light red circles: middle part of the sequence. By mouse over/click on the prefix details of the next events are shown. Arrows indicate how probabilities of next events change if the middle sequence is extended by the prefix. In this case the middle sequence is: radar->pushback->radar. The prefix considered is pushback. The arrows show that the probability to return to pushback increases from 68% to 80% and to off-screen from 4.5% to 14.9% whereas the probabilities to taxiin decreases from 9.1% to 1.1% and to pending departures from 13% to 1.1%. The width of the arrows scales with the overall probabilities and the proportion of red and green colour with their extent of change.

Figure 65 - Illustration of how probabilities for next events in a sequence are displayed.

66 of 98



Figure 66 shows how the scatterplot matrix has been used. Within each scatterplot a region can be selected by the mouse. The frequency of states occurring in corresponding time intervals to the selected points in the scatterplot will be reflected by an adapted colouring of the nodes in the state chart if the frequency colouring scheme is selected. The other way around, if states are selected in the state chart all points in the scatterplot that do not co-occur will be dimmed. In this way, the analyst can easily switch between the different views and drill down into patterns of interest.

Figure 66 - Illustration of the user interface combining states with displayed scatterplot matrix.

4.4.4 Insights regarding interaction sequences In the following section we’ll present and discuss examples for eye and mouse states sequences that we have found. With respect to RQ10 and RQ15 we are interested in concise sequences that might be used as indicators and predictors of an operator’s behaviour.

4.4.4.1 Most frequent States per User In RQ15 we ask if we are able to determine the most preferred mouse and eye sequences per user. For this we are looking for the most frequent state sequences in the data. In general it would not be easy to say at which length to cut such sequences, but to choose the state sequences of the VLMM is a plausible choice. Table 11 lists the top 5 most frequent eye state sequences for each user. In the tables following within this section you will notice some prefixes like Not(event1 || event2). It means that the state represents a sequence where the first element is not listed in the parenthesis of the Not-statement and the sequence without the first element matches the rest of the sequence after the Not-statement.

67 of 98



User Eye state sequences N complexityuser6 Not(radar), taxiin, taxiin 1050 5.13899

startuppushback, startuppushback 511 5.41615

Not(taxiin), radar, radar, radar, radar, radar, radar 525 10.4532

pendingdepartures, pendingdepartures 415 5.27982

handoverrunway 337 4.24181

user7

Not(taxiin), radar, radar, radar, radar, radar, radar, radar, radar, radar, radar 1406 10.6399

Not(startuppushback || offscreen || radar), taxiin 646 3.57608

pendingdepartures 340 3.76519



user8 Not(pendingdepartures), radar, radar, radar, radar 1853 6.68078

taxiin, taxiin 933 5.11017


offscreen 317 3.95789


Table 11 - Most frequent state sequences for the eye data (top5 for each user).

The most frequent state sequence is very distinct for each user and occurs nearly two times more often than the second one in the ranking list. All the more remarkable, that this distinct top state are similar for user 7 and user 8 and both contain longer sequences on radar whereas for user 6 the state describes focused attention on taxiin. Taking a look on the number of fixations this focused attention on the radar can also be seen: user 6: 35% radar, 26% taxiin, user 7: 64% radar, 13% taxiin, user 8: 53% radar, 20% taxiin. But surprisingly, for user 6 the overall number of fixations on radar is bigger than the number of fixations on taxiin. The state sequence containing radar is the third most found state sequence for user 6. This contradiction indicates that user 6 uses radar in a much more flexible way, combining it with different AOIs.

Table 12 illustrates some of the most frequent eye state sequences by the display of the delta in the transition probabilities.

User Eye sequence N complexity

user6 not(radar),taxiin,taxiin

1050 5.1

68 of 98



user 7 not(taxiin), radar, radar, radar, radar, radar, radar, radar

1400 10.3

User8 radar, radar, radar, radar

1834 6.68

Table 12 - Most frequent states of each user for the eye fixation sequences.

The same analysis can be applied to the mouse moves. Table 13 lists the top 5 most frequent mouse state sequences for each user.

User Mouse state sequences N complexity

user2 radar 537 3.9514

taxiin 464 4.3576

startuppushback 386 4.34803

Not(pendingdepartures), handoverrunway 87 4.80334

nowhere, nowhere, nowhere, nowhere 89 2.77347

user3 nowhere, nowhere, nowhere 100 2.00923

bottomcenter 30 3.06534

radar 30 3.48681

Not(nowhere), nowhere 7 2.57605

toppanel 6 3.99731

user4 radar 1291 2.67213

taxiin 295 3.50671


Not(startuppushback), startuppushback 82 4.07647

startuppushback, startuppushback 74 ‐0.0556529

69 of 98



user5 taxiin 113 0.798235



radar 32 1.7067


user6 radar, radar, radar, radar 392 2.07238

taxiin, taxiin, taxiin, taxiin 285 2.06089


taxiout 96 4.26805

Not(taxiin), taxiin 85 3.63707

user7 startuppushback 181 3.65007


radar, radar 37 4.61829

taxiout 32 4.01068

leftpanel 23 2.64312

user8 taxiin, taxiin 229 1.3014


taxiout 53 4.06747

Not(taxiin), taxiin 50 3.53306

Not(startuppushback), startuppushback 38 3.82724

Table 13 - Most frequent state sequences for the mouse data (top 5 for each user).

The detected mouse state sequences are significantly shorter than the eye state sequences. Many state sequences only contain one single event. However, the reason for this is that much less mouse events were present, since the mouse has not been moved as much as the eye: e.g. for user6 5795 eye fixations were recorded but only 1503 mouse “fixations” (user 7: 529 mouse vs. 6963 eye, user 8: 562 mouse vs. 6387 eye). Most mouse events were registered for user 4 and user 2 (1977 and 1883), and the least mouse events where recorded for user3 and user5 (188 and 282). In this sense it is amazing that user2 has no state sequences with radar that is longer than one event, which indicates that there are really no longer correlations to the past for mouse events associated with radar. In contrast to this, the most frequent state sequence of user6 contains four successive radar events. Also for user2 there are longer mouse fixation sequences with successive radar fixations. Considering only a sequence with a single radar-fixation the probability to stay on radar for user2 is only slightly lower than for user6 (80% vs. 88%). But for user6 the probability distribution of next events changes the more often radar is repeated in the sequence. For user6 the probability to stay on radar increases up to 92% whereas for user2 it stays constant so that there is no reason to consider longer state sequences repeating radar.

4.4.4.2 Most complex states per user Another way to look at the state sequences is to focus on the most complex state sequences. Table 14 shows the top 5 complex states of the VLMM of the eye tracking data for each user.

All complex state sequences contain many successive radar fixations preceded by another fixation, mostly taxiin. These states indicate that the probability to return to an AOI is increased after successive fixations on radar. This reflects that in the cognitive workflow the operator has to complete a task associated to the preceding AOI, e.g. taxiin, by collecting the necessary information on the radar and returns back to the AOI where information has to be put in. Besides taxiin other AOIs preceding successive radar fixations are: pendingdepartures, and startuppushback. For user6 the probability that on six successive radar fixations taxiin follows if taxiin also precedes increases from 5% to 25%. For user6 the successive radar fixations are longer if taxiin precedes than if startuppushback precedes. This effect seems not to be statistically since the state sequence taxiin, radar, radar, radar (not in the table listed) occcurs 95 times in the data nearly as much as

70 of 98



startuppushback, radar, radar, radar (83 times). This means that at least user6 needs more radar fixations if working on taxiin than if he is working on startuppushback.

User Eye state sequences N complexity

user6

taxiin, radar, radar, radar, radar, radar 57 10.5224

Not(taxiin), radar, radar, radar, radar, radar, radar 525 10.4532

Not(taxiin || radar), radar, radar, radar, radar, radar 88 9.22915

taxiin, radar, radar, radar, radar 69 9.04634

startuppushback, radar, radar, radar 83 8.17981

user7 taxiin, radar, radar, radar, radar, radar, radar, radar, radar, radar 51 11.7606

Not(taxiin), radar, radar, radar, radar, radar, radar, radar, radar, radar, radar

1406 10.6399

taxiin, radar, radar, radar, radar, radar, radar, radar, radar 61 10.1068

Not(taxiin || radar), radar, radar, radar, radar, radar, radar, radar, radar, radar

109 9.91921

offscreen, radar, radar, radar, radar, radar, radar, radar 57 9.64865

user8 pendingdepartures, radar, radar, radar, radar 31 8.02889

startuppushback, radar, radar, radar 77 6.82221

pendingdepartures, radar, radar, radar 41 6.70428

Not(pendingdepartures), radar, radar, radar, radar 1853 6.68078

offscreen, radar, radar 129 5.84961

Table 14 - Most complex state sequences for the eye tracking data (top 5 for each user)

User 7 does also show this long successive radar sequences preceded by taxiin, whereas the model for user 8 contains mainly long radar sequences preceded by pendingdeparture and startup pushback. Also the sequence taxiin, radar, radar, radar, radar, is 124 times present in the data of user 8 compared to 69 times for user 6 it is not included into the model, since the probability to return to taxiin is not increased. Table 17 illustrates the top most complex eye state sequences and their occurrences on the time line. The pictures show the effect discussed so far. On long radar sequences probability to stay on radar decreases and to go back to fixation before the long radar sequence increases. Analysing the histogram for the occurrence of these complex events does not show any correlations with the negative observations.

71 of 98



User Eye state sequence N complexity

user6 taxin,radar,radar,radar,radar,radar

52 10.52

user7 taxiin,radar,radar,radar,radar,radar,radar,radar,radar,radar

51 11.75

72 of 98



User8 pendingdeparture, radar, radar, radar, radar

31 8.2

Table 15 - Illustration of the most complex state sequences for the eye tracking data

Looking on the most complex state sequences for the mouse data (Table 16) there are no such complex state sequences as for the eye tracking. We already discussed this above. However for user2 user4, user6, and user7 there is at least one state sequence containing more than one event. They are shown in Table 17.

User Mouse state sequences N complexity

user2 pendingdepartures, handoverrunway 31 6.537

Not(nowhere), nowhere, nowhere, nowhere 15 6.12915

pendingdepartures, pendingdepartures 71 5.56096

Not(nowhere), nowhere, nowhere 24 5.08845

Not(pendingdepartures), handoverrunway 87 4.80334

user3 Toppanel 6 3.99731

Radar

30 3.48681

Leftpanel 5 3.46379

Taxiin 2 3.10449

Bottomcenter 30 3.06534

user4 Not(startuppushback), startuppushback 82 4.07647

Taxiout 21 4.04965

Nowhere 17 3.96738

Onblock 16 3.79514

Handoverrunway 86 3.69004

user5 startuppushback#taxiin 1 2.4141

Taxiout 1 2.4141

taxiin#handoverrunway 2 2.4141

Onblock 4 2.4141

Bottomcenter 1 2.41407

73 of 98



user6 Not(taxiin), taxiin, taxiin, taxiin 48 5.81504

Not(taxiin), taxiin, taxiin 61 4.74621

Not(radar), radar, radar, radar 35 4.67024

Onblock 12 4.61976

Taxiout 96 4.26805

user7 startuppushback, taxiin 30 4.89652

radar, radar 37 4.61829


Taxiout 32 4.01068

Nowhere 6 3.82285

user8 Radar 19 4.57338

startuppushback#taxiin 4 4.40386

Pendingdepartures 14 4.38562

Taxiout 53 4.06747

Handoverrunway 15 3.96427

Table 16 - Most complex state sequences for the mouse data (top 5 for each user).

For user2 the most complex state sequence means that if the user moved the mouse from pendingdeparture to handoverrunway he will move it with increased probability to startuppushup next and not to taxiin as otherwise, if not started from pendingdeparture. For user4 the most complex state sequence tells us that two mouse fixations on startuppushback in succession increases the probability to move the mouse to taxiin. For user 6 the visualization of the displayed state sequence shows that if there has been a mouse fixation different to taxiin before a succession of three mouse fixations within taxiin the next mouse fixation will most probably stay in taxiin. For user7 a startuppushback mouse fixation preceding a taxiin mouse fixation will increase probability for next mouse fixation on startuppushback again. The mouse fixations of user2 are more complex than from other users by numbers.

User Mouse state sequence N complexity

user2 pendingdeparture, handoverrunway

31 6.5

74 of 98



user4 Not(startuppushback), startuppushback

82 4.07

user6 Not(taxiin), taxiin, taxiin, taxiin

48 5.81504

user7 startuppushback, taxiin

30 4.89652

Table 17 - Illustration of the most complex state sequences for the mouse data

75 of 98



For the eye tracking data no obvious correlation of negative states with negative observations can be read from the histograms.

4.4.5 States corresponding to outliers and around Additionally we are looking for interesting states by associating outliers in the scatterplots with co-occurrence with states. Table 18 illustrates some findings.

Overall the strict appliance of this method was not possible, since for no user all sensor data were available. For user 6-8 there were no heartbeat data available whereas for user 3-5 there heartbeat data available but no eye moving data. For convenience and only to demonstrate the method, we only use examples from the scatterplot correlating entropy (diversity) of mouse and eye fixations. Overall both measures seem not to be correlated. Looking for outliers in scatterplots where both measures in general are expected to be correlated is more promising, since the usual correlation can easily be identified and outliers might be more leap out. One measure like this would might have been heartbeat data, and number of eye fixations. However also in uncorrelated measures outliers might show some variation from usual behaviour.

User Outlier in Eye entropy vs. Mouse entropy

Most freq. eye sequence

User 6

Pendingdeparture,pendingdeparture

User 7

Not(radar || taxiin || startpushup), taxiin

76 of 98



User 8

Offscreen

Table 18 - Examples of state sequences corresponding to outliers in the scatterplots. The procedure for creating Table 18 was as follows. We marked a cluster of outliers in the scatterplot which is shown on the left. The histogram below the scatterplot displays the corresponding time intervals with co-occurring states. Then we looked for the most frequent states within these time intervals and selected it to display the histogram and see where this state occurs elsewhere. As can be seen in these examples the outliers at least for the example of user 6 and user 7 do lay in the vicinity of an negative observation, but the corresponding state sequences are not distinctive enough and do occur in many other time intervals which cannot be associated with negative observations.

77 of 98



4.5 Conclusion In order to collect meaningful data for the Sixth Sense project, we prepared and performed two experiments with ATCOs. An intensive research and integration work was needed for that, including state of the art related to air traffic control, data mining and machine learning, decision making, psychology and other important topics.

We designed and implemented a software framework that allows collecting data in real time about the users' behaviours. For this we had to integrate different systems, sensors and sources of information. We started by unifying all the air traffic data sources with all the sensor technologies used (e.g., heart rate and heart rate variability, Kinect and body/head posture, eye tracker and areas of interest, user interaction information, environmental sensors, air traffic controller workflow steps analysis, real time data filtering, capability of replaying complete experiments and other functionalities). We do not yet use all the capabilities of our framework, namely the graph database registration and graph analysis or the prediction engine component because these components were not the main scope of this project.

In the first ATC experiment, exercise 1, we collected enough data to test and confirm the practical utility of our data collection approaches, and sensors integration. Also this first exercise helped in the clarification of our research questions, allowing us to create a common and clearer picture of our final goals.

In the second experiment, exercise 2, we collected much more data and we automated manual data collection steps. We integrated the supervisor, observer and ATC stress level reports inside our software framework to allow us to automatically analyse and treat all the aspects related with "think aloud" and observational protocols. All eight users had to answer several questionnaires, from where we extracted valuable information about the different preferences or user experiences when handling air traffic. We used the questionnaire outcomes to search for answers regarding the difficulty of the experiment, usability of the system, workload, situational awareness, performance and other important measures. Including questionnaires into the data analysis is definitely a promising option, but due to limited resources and data quantity we could not exploit the maximum potential.

From the two exercises we collected at least 600.000 events distributed among several datasets. The handling of the complexity and amount of data (many events, different datasets, multiple variables, time series, and behavioural sensed data) required multiple strategies for pre-processing, analysis, discussion sessions, exploration and visualization. The obtained results are presented in this report.

In Sixth Sense we are interested in looking for patterns or hidden data signs that allow us to detect moments of bad and good decisions that could be incorporated in an automated system in order to detect and predict the users’ next actions.

Based on psychological findings the metrics obtained from the experiments were aggregated in task load, mental workload, attention, behaviour, and performance categories. This categorization established a ground truth of possible useful predictors to detect moments of high workload, high stress, and loss of situational awareness. Guided by these findings, 15 research questions were established and addressed during data analysis. This includes exploration of

the number of arrivals and departures per minute in relation to errors,

increases in eye movements when the user is having periods of high workload that relates to the occurrence of negative observations,

the relation between mouse pauses and increases in eye fixation times, the number of areas of interest visited per minute, lower heart rate variability, how the voice communications (number and speed of words spoken) is related to negative observations,

the most preferred areas of interest by the users,

how we might use the Kinect head pose and sound source angle variables to detect problematic time periods that might allow us to reduce the amount of data that needs to be analysed in real time.

78 of 98



We gathered promising evidence that the strategies employed and the predictors found will be very useful in the design of a new automated cyber physical system that is able to detect unusual behavioural situations in the field of air traffic control.

The most promising metrics and as a consequence the post promising hints were found in relations between different data streams. One is the link between reductions in mouse movement and increases in the eye movements, coincident with the occurrence of negative observations.

The heart rate variability together with the reduction in mouse activity, the number of visual UI objects to be managed and the eye tracking AOI frequency and duration provide very good clues for anticipating moments of stress and high workload.

There are direct relations between an increase in the number of words used by the air traffic controllers and the occurrence of negative observations.

And we found a correlation between the users head position and negative observations that indicate promising model creation for predictions.

The presented results show how important the incorporation of behavioural analysis is for the design of automated systems that are able to analyse, detect and predict unsafe situations and systems that are even useful to react or advise for better and safer actions. Our results can also be applied to the improvement of existent systems and user interfaces.

Taking into consideration that we were mainly interested in behavioural indicators, for which the incorporation of new sensors and methods like voice recognition system, eye tracker, etc. was essential, we believe that we identified behavioural causes that play an important role in the report of higher stress levels, high workload or even to the lack of situational awareness. These behavioural causes are for example number of visual objects to be handled (arrivals and departures per minute), number of areas to be monitored, delays and problems in communication with the pilot, time accumulation and also emotional factors.

79 of 98



4.6 Future Work New experiments to collect more data would be the next step. For an improved data quality and quantity per experiment, the sensor output could be distributed to multiple machines in order to reduce the data load. This was the reason for lack of Kinect data. In order to answer specific questions about situational awareness or task completion times we would need to create specific and shorter experiments focused on smaller tasks. This would also simplify measure time or speed.

We envision the use of graph models and prediction engines applied to behaviour analysis or to the prediction of next user actions or next best suggestions. Deep learning and agent based models are also important components for building more intelligent systems especially to incorporate cognitive features that better map the users’ behaviours and the users’ decision making processes. Here the inclusion of cognitive architectures could also be beneficial.

The collection of real data is a challenge due to the extensive preparations and system integrations work. A plug and play solution for our components and sensors to an existent standardized simulation sensing platform would be desirable. This would save precious time and the results would be more comparable to other studies (e.g. [31], [32]).

We would like to see developments in the creation of real time behavioural monitoring dashboards that can account for the calculation of different types of costs. Specifically in the case of Sixth Sense, we would like to improve the implementation of costs related with interaction and user behaviour (e.g., costs related with having to focus to attention on not so meaningful areas of interest, or costs related with useless movement of eyes, mouse or body activity). These would allow to quantify specific decisions and to analyse the impact of those decisions not only in terms of time or effort but also in terms of financial effort.

We expect that more and more gesture, voice and natural language based user interface capabilities will be used by ATCOs. Therefore, there will be the need to perform new similar studies that take in account the usage of more assistive technologies in the working place.

Also we envision the incorporation of emotional costs, e.g. analysing the costs of frustration in the communications between ATCOs and pilots or periods of inactivity or extreme effort or dislike. The system would gain from the use of extra emotional related sensing technologies. Regular manual reports, reporting about stress level, or the current status of the environment perceived by the users, such as room temperature, air humidity or noise, would also be a beneficial contribution. However, what we are aiming at is the inclusion of sensors that can automatically sense most of these environmental or psychological facts in an automatic way. By incorporating this data we can create even more innovative automated cognitive systems.

We did the first steps in this direction by calculating interactivity complex metrics like the number of AOI visited, interaction effort in terms of number of words used, areas of interest visited, number of visual objects to be handled or the “UI interaction pace” measure and the analysis of the processing time of airplanes globally and at each workflow step, by taking the ATCOs standardized workflow processes for handling airplane departures and arrivals as basis for our analysis.

However, this is definitely further specific and complex research. This would also require the inclusion of more sensing technologies for detecting emotions like Electroencephalography and camera based approaches. It is also desirable to have better real time body tracking capabilities to increase the awareness of the automated system in respect to the users’ position and posture. However, in the scope of Sixth Sense we tried to make only minimal changes to the working environment of the ATCO.

Finally it would be very interesting to quantify the factors mentioned above in terms of real financial impact and not only in terms of ergonomic or time constraints.

80 of 98



References [1] A. Vosskühler, V. Nordmeier, L. Kuchinke and A. M. Jacobs, “OGAMA (Open Gaze and Mouse

Analyzer): open-source software designed to analyze eye and mouse movements in slideshow study designs,” Behaviour Research Methods, pp. 1150-1162, 2008.

[2] A. Haarmann, W. Boucsein and F. Schaefer, “Combining electrodermal responses and cardiovascular measures for probing adaptive automation during simulated flight,” Applied ergonomics, 40(6)., 2009.

[3] K. F. Van Orden, T. P. Jung and S. Makeig, “Combined eye activity measures accurately estimate changes in sustained visual task performance,” Biological Psychology, 52, p. 221–240, 2000.

[4] P. A. Hancock, G. Williams and C. Manning, “Influence of task demand characteristics on workload and performance,” The International Journal of Aviation Psychology. Special Issue on Pilot Workload: Contemporary Issues 5(1), pp. 63-86, 1995.

[5] W. Rohmert, “Das Belastungs-Beanspruchungskonzept,” Zeitschrift für Arbeitswissenschaft, pp. 193-200, 1984.

[6] “DIN EN 10 075-1,” Ergonomische Grundlagen bezüglich psychischer Arbeitsbelastung. Teil 1: Allgemeines und Begriffe, 2000.

[7] M. A. Neerincx, “Cognitive task load design: Model, methods and examples,” Handbook of Cognitive Task Design, pp. 283-305, 2003.

[8] T. E. de Greef and H. F. R. Arciszewski, “Triggering Adaptive Automation in Naval Command and Control,” Frontiers in Adaptive Control, pp. 165-188, 2009.

[9] P. A. Hancock and M. H. Chignell, “Input information requirements for an adaptive human-machine system,” Proc. of the Tenth Department of Def. Conf. Psych., vol. 10, pp. 493-498, 1986.

[10] J. A. Veltman and C. Jansen, “Differentiation of Mental Effort measures: Consequences for Adaptive Automation,” Operator Functional State, pp. 249-259, 2003.

[11] A. H. Roscoe, “Assessing pilot workload. Why measure heart rate, HRV and respiration?,” Biological Psychology, 34, pp. 259-287, 1992.

[12] J. A. Veltman and A. W. K. Gaillard, “Physiological indices of workload in a simulated flight task,” Biological Psychology, 42(3), pp. 323-342, 1996.

[13] B. Mulder, H. Rusthoven, M. Kuperus, M. de Rivecourt and D. de Waard, “Short-term heart rate measures as indices of momentary changes in invested mental effort,” Human Factors Issues, 2007.

[14] D. Manzey, “Psychophysiologie mentaler Beanspruchung,” Ergebnisse und Anwendungen der Psychophysiologie (Enzyklopädie der Psychologie, C. Serie L. Bd5), p. 799 – 864, 1998.

[15] K. F. Van Orden, W. Limbert, S. Makeig and T. P. Jung, “Eye activity correlates of workload during a visuospatial memory task,” Human Factors, 43(1), pp. 111-121, 2001.

[16] M. S. Young and N. A. Stanton, “Attention and automation: New perspectives on mental underload and performance,” Theoretical Issues in Ergonomics Science,3, p. 178–194, 2002.

[17] A. Mack and I. Rock, “Inattentional Blindness,” Cambridge, MA: MIT Press, 1998.

[18] M. A. Just and P. A. Carpenter, “A theory of reading: From eye fixations to comprehension,” Psychological Review 87(4), pp. 329-354, 1980.

[19] A. H. Bellenkes, C. D. Wickens and A. F. Kramer, “Visual scanning and pilot expertise: The role of attentional flexibility and mental model development,” Aviation, Space, and Environmental Medicine, pp. 569-579, 1997.

[20] J. R. Tole, A. T. Stephens, M. Vivaudou, A. R. Ephrath and L. R. Young, “Visual scanning behavior and pilot workload,” NASA Contractor Report No. 3717, 1983.

[21] J. C. Sperandio, “The regulation of working methods as a function of workload among air traffic controllers,” Ergonomics, 21, pp. 195-202, 1978.

[22] S. Lehmann, R. Dörner, U. Schwanecke, N. Haubner and J. Luderschmidt, “UTIL: Complex,

81 of 98



Post-WIMP Human Computer Interaction with Complex Event Processing Methods,” Workshop ”Virtuelle und Erweiterte Realität”, pp. 109-120, 2013.

[23] “EsperTech: Event Series Intelligence,” EsperTech Inc. , 2015. [Online]. Available: http://www.espertech.com/esper/nesper.php. [Accessed 7 7 2015].

[24] . Z. He, X. Xu and S. Deng, “Discovering Cluster Based Local Outliers,” Pattern Recognition Letters, pp. 9-10, 2003.

[25] C. Winkelholz and C. M. Schlick, “Statistical Variable Length Marov Chains for the parameterization of Stochastic UserModels from Sparse Data,” in IEEE interation Conference on Systems, Man, and Cybernetics, The Hague, 2004.

[26] D. Ron, Y. Singer and N. Tishby, “The Power of Amnesia: Learning Probabilistic Automata with Variable Length,” Machine Learning, vol. 25, no. 2/3, pp. 117-149, 1996.

[27] F. Kruger, C. Winkelholz and C. M. Schlick, “System for a model based analysis of user interaction patterns within web-applications,” in IEEE Internation Conference on Systems, Man, and Cybernetics, Anchorage, Alaska, 2011.

[28] C. M. Schlick, C. Winkelholz, F. Motz and H. Luczak, “Self Generated Complexity and Human Machine Interaction,” IEEE Transactions on Systems, Man, and Cybernetics, Part A, vol. 36, no. 1, pp. 220-232, 2006.

[29] P. Grassberger, “Towards a quantitative theory of self-generated complexity,” International Journal Theoretical Physics, vol. 25, no. 9, pp. 907-938, 1986.

[30] C. Winkelholz and F. Kruger, “Anwendung des EMS-Werkzeugkastens zur Analyse von Mensch-Technik-Interaktion im militärischen Kontext,” Frauhofer FKIE, Wachtberg, 2012.

[31] A. Isaac, O. Straeter and D. Van Damme, “A Method for Predicting Human Error in ATM HERA-PREDICT,” HRS/HSP-002-REP-07. Bretigny-Sur-Orge, France: EUROCONTROL, 2004.

[32] S. Loft, S. Sanderson, A. Neal and M. Mooij, “Modeling and Predicting Mental Workload in En Route Air Traffic Control: Critical Review and Broader Implications,” Human Factors: The Journal of the Human Factors and Ergonomics Society, pp. 376-399, 6 2007.

[33] G. Costa, “Evaluation of workload in air traffic controllers,” Ergonomics, 36 (9), pp. 1111-1120, 1993.

[34] G. Cugola and A. Margara, “Processing Flows of Information: From Data Stream to Complex Event Processing,” ACM Computing Surveys 44.3, p. 15:1–15:62, 6 2012.

[35] B. Hilburn and P. G. Jorna, “Workload and air traffic control,” Stress, workload and fatigue, 2001.

[36] R. M. Rose and L. F. Fogg, “Definition of a responder. Analysis of behavioral, cardiovascular and endocrine response to varied workload in air traffic controllers,” Psychosomatic Medicine, 55, pp. 325-338, 1993.

82 of 98



Appendix A

Technical Verification Details of Exercise 1 and 2 A.1

During exercise 1 the accuracy of the sensor systems have been evaluated to be used within the experiment.

Technology:

The Tobii REX Developer Edition which we used in the Sixth Sense project is a gaze interaction device which is available today for developers of interactive applications. It can be placed on a monitor or laptop screen and measures in real time the gaze position of the user on the screen.

Technical Specification:

Sampling rate: 30 Hz (std. dev. approx. 3 Hz) Freedom of head movement, width x height at 70 cm: 50 x 36 cm (20 x 14 inch) Operating distance (eye tracker to subject) 40 – 90 cm System latency 48 – 67 msec Mounting alternatives: Adhesive mounting brackets for monitors, laptops and tablets.

Desk stands for tripod and desks. Windows Operating Systems 7 and 8, both 32-bit and 64-bit.

Figure 67 - Eye-Tracking

Setup:

24“ Wide screen (Full HD)

75 cm table height

65 cm distance to eyes

18cm screen height

83 of 98



Figure 68 - Test Setup - Eye-Tracking

Evaluation:

The evaluation was performed by recording and analysing the eye-tracking data of four people, based on a five point calibration. 300 data sets for each person and each point have been recorded.

Test Person 1: Female, no glasses, no contact lenses

Test Person 2: Male, no glasses, no contact lenses

Test Person 3: Male, contact lenses

Test Person 4: Male, glasses

Based on the collected data, the following information has been calculated. Additionally, a visualisation of the area in relation to the reference points has been generated. This information has been used to analyse the quality of the eye-tracking data.

Average coordinates

Average coordinates – Reference Point

Standard Deviation

Evaluation Result:

Based on the collected data the availability and quality of the data has been analysed. Please find below the information of the measured point of each test person in relation to the reference point.

84 of 98



Figure 69 - Eye Tracking Data Analysis

Based on these findings the following constraints can be identified to be used within the experiment.

Name Description

Side Areas The quality of data is more accurate in the centre of the screen than in the side areas. (Please note that there was a software update after the test, which should improve the quality of the side areas)

Identification Area

The minimum area to be detected shall not be less than 80x80 pixels

Outliers Within the experiment the outliers shall be identified and filtered.

Contact Lenses There could be a loss of quality for people with hard contact lenses. (has to be identified in the calibration phase of the experiment if there are any problems with the test person)

Glasses There could be a loss of quality for people wearing glasses (has to be identified in the calibration phase of the experiment if there are any problems with the test person)

Taking into account the points mentioned above, the eye-tracking system provides accurate information. Therefore it is recommended to be used within the experiment.

100 100 1820 100 100 980 1820 980 960 540

Test Person 1 X Y X Y X Y X Y X Y

Average 120,63 83,12 1815,48 99,85 142,92 879,26 1790,90 906,50 962,86 505,29

Average ‐ Ref 20,63 ‐16,88 ‐4,52 ‐0,15 42,92 ‐100,74 ‐29,10 ‐73,50 2,86 ‐34,71

Standard Deviation 12,63 43,75 14,12 46,06 25,78 292,32 16,24 257,76 13,91 180,38


Average 177,78 100,13 1804,95 132,14 211,22 898,91 1765,94 910,33 969,00 527,95

Average ‐ Ref 77,78 0,13 ‐15,05 32,14 111,22 ‐81,09 ‐54,06 ‐69,67 9,00 ‐12,05



Average 181,38 61,71 1725,56 105,34 241,81 771,84 1620,09 808,41 978,54 495,10

Average ‐ Ref 81,38 ‐38,29 ‐94,44 5,34 141,81 ‐208,16 ‐199,91 ‐171,59 18,54 ‐44,90



Average 76,86 46,46 1448,61 59,12 125,25 774,51 1436,80 707,70 755,38 437,14

Average ‐ Ref ‐23,14 ‐53,54 ‐371,39 ‐40,88 25,25 ‐205,49 ‐383,20 ‐272,30 ‐204,62 ‐102,86


Point 1 Point 2 Point 3 Point 4 Point 5

85 of 98



Visualisation of the eye-tracking information for each test person:

The following graphs are showing the visualisation of the test person based on different factors (female/ male, contact lenses, glasses …) This has been measured with specific test persons to identify issues with female / male / glasses and contact lenses.

Test Person 1: Female, no glasses, no contact lenses

Figure 70 - Test Person 1 – Eye Tracking

Test Person 2: Male, no glasses, no contact lenses

86 of 98




Test Person 3: Male, contact lenses


Test Person 4: Male, glasses

87 of 98




A.1.1 Kinect Technology:

The Microsoft Kinect sensor (also called a Kinect) is a physical device that contains cameras, a microphone array, and an accelerometer as well as a software pipeline that processes colour, depth, and skeleton data.

Figure 74 - Kinect sensor

Inside the sensor case, a Kinect for Windows sensor contains:

88 of 98



Figure 75 - Sensors included in the Kinect Kinect Array specifications

Viewing angle 43° vertical by 57° horizontal field of view

Vertical tilt range ±27°

Frame rate (depth and colour stream)

30 frames per second (FPS)

Audio format 16-kHz, 24-bit mono pulse code modulation (PCM)

Audio input characteristics

A four-microphone array with 24-bit analogue-to-digital converter (ADC) and Kinect-resident signal processing including acoustic echo cancellation and noise suppression

Accelerometer characteristics

A 2G/4G/8G accelerometer configured for the 2G range, with a 1° accuracy upper limit.

Table 19 -Technical specifications of Kinect.

Setup:

Kinect mounted above the screen

24“ Wide screen (Full HD)

75 cm table height

100cm average distance to Kinect

18cm screen height

89 of 98



Figure 76 - Test Setup - Kinect

Evaluation:

The evaluation was performed by measuring the real angles and distances and compares them to the measured Kinect data of 4 people. As starting point the distances have been measured to identify the point of losing head pose and the point of losing the tracking of the Kinect sensor, starting with 150cm. The measuring of the angles took place with a distance of 100cm. (based on the minimum distances measured during the test)

90 of 98



Kinect

α β

Kinect

Start Measuring

Losing Head Pose

Losing Tracking

Figure 77 - Evaluation of distances and angle

Based on the collected data, the following information can be summarised.

Deviation of the 0° head pose

Deviation of the 45° head pose

Deviation of the -45° head pose

Average Angle of all Test Persons

Standard Deviation of Test Persons

Average Angle – Reference Point

Distance of losing head pose

Distance of losing Tracking

91 of 98



Evaluation Result:

Based on these findings the following constraints can be identified to be used within the experiment.

Losing Head Pose

Losing Tracking 0° α = 45° β = ‐45°

cm cm ° ° °

Test Person 1 100 50 0,6 42 37

Test Person 2 74 62 2 41 41

Test Person 3 75 47 0,3 38 40

Test Person 4 70 48 ‐1 44 43

Average 79,75 51,75 0,97 41,25 40,25

Average ‐ Reference N/A N/A ‐0,97 3,75 4,75

Standard Deviation 13,67 6,95 0,91 2,50 2,50Table 20 - Kinect Results

Name Description

Maximum angle The angle of the head (alpha or beta) should not be more than 60°. Otherwise the Kinect is not able to track the head of the person.

Closest position The minimum distance of the user should always be at least 80 cm. Below this distance the Kinect is not able to track a person's head.

Tracking environment

To get the best tracking results it should be avoided to have additional people in the background of the tracked person. Furthermore the room should provide good lightning conditions.

Tilt angle The tilt angle of the Kinect should be adjusted individually by each test person to get the best tracking results.

The Kinect component provides a standard deviation of 2.5 degrees. A minimum distance of 1m to the equipment shall be considered to ensure the maximum result. Taking into account the points mentioned above, it is possible to use it within the experiment.

92 of 98



A.1.2 Speech Recognition

Technology:

In computer science, speech recognition is the translation of spoken words into text. Speech recognition only implies that the computer can take dictation, not that it understands what is being said. This process is important in the Sixth Sense context because it provides a fairly natural and intuitive way of controlling the ATMS while allowing the user's hands to remain free. The difficulty in using voice as an input method lies in the fundamental differences between human speech and the more traditional forms of computer input.

Setup:

USB Headset with Push-to-Talk (PTT)

separated Speech Recognition component

Simulator Position Ground Position

Experimental CWP

Speech Recognition component

PTT

Figure 78 - Test Setup – Speech Recognition

Evaluation:

The evaluation took place by simulating ATM commands and observing the identified results. This has been performed with 10 users, to receive a variety of different accents. This has been performed with a setup including only the speech recognition and the ground position. During the evaluation only the callsign identification has been taken into account

Following information has been recorded:

Callsign Recognised

Callsign Not Recognised

Callsign Wrongly Recognised.

Overall: 775 ATM commands including a callsign have been observed.

93 of 98



Based on the collected data, the following information of the quality of the recognition rate of the speech recognition has been identified.

Evaluation Result:

Please find below the summary of the collected information.

allsign Recog. Callsign Not Recog. Callsign Wrongly Recog.

TOTAL

747 28 0 775

96% 4% 0% 100%

Figure 79 - Callsign Recognition Rate – Speech Recognition

Based on these findings, the following constraints can be identified to be used within the experiment.

Name Description

Microphone Volume

It is important that the microphone volume is correctly adjusted, as otherwise the recognition rate could be influenced.

PTT During the exercise it is important that the test person uses the PTT button as in real operation, due to the fact that it triggers the speech recognition.

Taking into account the points mentioned above and the analysed finding, the speech recognition system provides accurate information (96% recognition rate) in relation to callsign identification. Therefore it is recommended to be used within the experiment.

Callsigne Recognised %;

96%

Callsign Not Recog %; 4%

94 of 98



Questionnaires Appendix B

95 of 98



96 of 98



97 of 98



98 of 98



- END OF DOCUMENT-

Experimental Test report - Fraunhofer Austria · Figure 47 - Kinect Data Representation,...

Documents

Transcript of Experimental Test report - Fraunhofer Austria · Figure 47 - Kinect Data Representation,...