Final Write Up Data Visualization Project · diverging in the case of likert scale questions....

Data Visualization Project Final Write UpTeam CPD (Apurvaa, Chet, Jan, Pranav)

Visualizing Summer Camp Survey Data for Chicago Park District

The Client: Chicago Park DistrictFinancially the largest park district in the USA

❖ $450m annual budget❖ 3200 FTE staff❖ 580 parks

Target audienceBrendanDirector of park programs

▷ Needs high-level overview of summer camp satisfaction

▷ Wants to be able to drill down into specific parks/areas for deeper analyses and to resolve issues in collaboration with the respective park manager(s)

JaniceManager Lincoln Park

▷ Interested in performance of her own park over time and compared to other parks

▷ Needs good understanding of quantitative survey responses as well as insights from text comments

30,000 participantsStrategic Initiative to Increase Participation

Summer Camp Statistics

10 years Survey Data with ~30 questions each: quantitative and text

150 parksAcross Chicago

Survey Example Questions

▷ Overall, how satisfied were you with the camp program? ○ Very Satisfied○ Satisfied○ Indifferent○ Dissatisfied

▷ Will you send your child(ren) back to a Chicago Park District camp next year and recommend it to others?○ Yes/No

▷ What do you suggest we change about the summer camp program?○ Open-ended text comment

Currently CPD does some high-level visualizations in Excel

▷ Stand-alone and static, thus difficult to identify trends across parks or over time.

▷ Not much thought put into design.

▷ Manual and tedious process which needs to be repeated every year.

▷ No analysis of text comments.

A few key questions our visualization will help answer

How is overall participant satisfaction changing over time?

An aggregate response value across all parks is provided for each year of available data. These responses can be plotted on a single axis for easy comparison. This will help park management detect trends.

What is the response for Park A compared to Park B?

A user can manually select parks from a list/map and compare responses for different questions by year. This allows managers to see whether there is a significant difference in response between parks.

What aspects of the summer camp did participants like/dislike?

Text analysis provides park managers with both prevalent topics and the sentiment of comments associated with them. This allows them to get insights which cannot be obtained from the multiple-choice quantitative questions.

Dataset

▷ Survey data in csv file format for each year 2009 onwards. Each row represents a respondent’s answers and the columns are questions.

▷ Each survey has on average >1500 respondents.▷ Each survey consists of ~30 questions of types:

○ Likert Scale questions○ Yes/No questions○ Categorical multiple-choice questions○ Open-ended text comment questions

Please note: After discussion with CPD, we used the same question format mentioned above but created dummy data for this project.

Assessment of viable design

space

Quantitative Questions

Data Abstractions▷ Data Types:

○ The quantitative tab must represent responses to CPD survey questions. These responses can be either purely categorical, or diverging in the case of likert scale questions.

▷ Dataset Availability: ○ The underlying survey responses are dynamic with respect to time.

Each year, new data is incorporated into the dataset. Since the updates are so infrequent, typical use cases involve transformations performed on a fixed data set.

▷ Dataset Types: ○ The closest representation of the underlying data is a tabular format.

Each row in the data corresponds to the entire set of responses provided by a summer camp participant.

▷ Data in Datasets: ○ For each row, the values of the selected responses for all questions

are stored. In some cases, more than one possible answer may be chosen. In these cases, one question has multiple value-storing columns associated with it.

Task Abstractions▷ Who:

○ Overall park managers may be interested in finding general trends across time, while specific park managers may be interested in comparing sets of parks.

○ People in the marketing department may want an easy way to visualize camp participant responses after rolling out a new advertising campaign.

▷ Actions: ○ Searching functionality is necessary for finding specific parks or

question types. For example, a user may want questions related only to satisfaction, and the viz system should support this.

○ Users may want to share or record an interesting finding in an official report.

○ Users must be able to quickly summarize general trends as well as drill down and compare parks at a more granular level.

▷ Targets: ○ Within a park, the distribution and extremes of the responses are of

key interest.○ Trends between parks, as well as across time may inform decisions as

to which parks may deserve special attention.

Current quantitative visualizations are buried in excel sheets

▷ Park managers must manually sift through dozens of excel worksheets to see charts.

▷ Separate charts must be consulted for more detailed stats, such as the percentage of total responses.

100% stacked bar charts offer a slight improvement

▷ Stacked bar charts can be adapted for all types of questions○ Purely categorical○ Likert scales○ Yes/No responses

▷ However, comparing multiple likert response distributions across parks can prove challenging.

Diverging stacked bar charts more clearly emphasize the extreme response categories

▷ Aligning stacks at the median applies to both likert and yes/no responses.

▷ For pure categorical questions, single stack 100% bar charts also suffice.

Research into alternatives yielded few promising results

▷ Adding area as another visual channel makes distinguishing percentages difficult.

▷ Lack of different colors makes distinguishing ‘positive’ and ‘negative’ responses difficult.

Research into alternatives yielded few promising results

▷ Others are far too dense and scientific to digest for those not very familiar with analytics.

Final choice is flexible and interpretable

▷ 100% diverging stacked bar charts adapt to all included survey questions.

▷ Unified chart type simplifies the development process○ Only slight tweaks are needed to change the data

input format

▷ Visual channels are limited to color, position length○ Easy to understand for park managers interested in

finding trends in survey data

▷ A vertical layout aligns with what we consider to be a feasible set of use cases. Additionally, this layout makes best use of available screen-space, especially given that the Chicago map is best oriented vertically.

Bar Chart Considerations▷ Software limitations:

○ The chosen charting library requires very specific data format. It is somewhat difficult to add another chart type in the future, since the data series is so rigid.

▷ Potential Problem Spots: ○ Labelling not in immediate proximity with the bars, though our solution of

dynamically highlighting the associated park name when hovering the mouse does reduce potential clutter.

○ The true data set exhibits a great deal of sparsity, and we don’t have a great way to handle missing data. The current solution plots a bar of zero length, but for certain combinations of parks the consistent spacing between years gets muddled.

○ For specific views, notably single year, single park views, the vertical width of the bars can make interpretation somewhat confusing.

▷ Difficult Issues: ○ Data transformation was very complicated; constructing data series

involved lots of moving parts behind-the-scenes.○ The survey data was not always consistent in how questions were phrased

and what responses were available. One case we have not accounted for is when a question’s diverging response categories (e.g. likert) change or expand over time. The client has expressed interest in making the tool as generalizable as possible.

Future Improvements

▷ In addition to general code refactoring and cleanup, the next feature we would include is the flexibility to account for a question whose response scale changes over time.

▷ Currently, the chart is limited to plotting grouped stacked bar charts, which work well for questions with diverging and qualitative responses. There is definitely room for allowing different types of charts depending on user preferences

Bar Chart Design Considerations

Map: Motivation

▷ Enables users to easily select parks in the same neighborhood or compare parks in different geographic locations.

▷ Additionally, allows selection of parks close to a landmark such as Lake Michigan, or University of Chicago for example.

▷ Rectangular selection feature makes it easier to select large number of parks, compared to manually selecting one park at a time from the menu.

Preliminary map design proved cluttered, with unidentified markers covering much of space

▷ We immediately found markers useful for disambiguating the parks in our dataset interactive points. However, the default Google Maps markers occupied much of the space in the map and obscured the points of interest themselves.

Highlighting park areas helped disambiguate markers, but clutter at low zoom was not eliminated

▷ Locations of interest can be highlighted using the Google Maps API. We luckily did not need separate geographic data to render the green highlighting in the park areas.

Changing marker icons to red circles visible only at high zoom helped eliminate clutter

▷ Changing the marker icons to red circles eliminated clutter. We also noticed that the markers resize according to zoom level. Since the markers completely obscure some of the parks at low zoom, we decided to make them visible only at medium zoom and higher.

Toggling marker fill helped visualize the geographic distribution of selected parks

Adding legend further disambiguated selected parks from unselected ones

Map Considerations▷ Software limitations:

○ The Google Maps API provided an easy-to-understand and flexible map-visualization library for our purposes. We faced little limitations with the software itself. Any limitations stemmed from learning to use the API while actually implementing the map.

▷ Potential Problem Spots: ○ Finding a standardized list of names for the parks in the dataset

proved quite challenging. Even though we developed a master list of names for the tool to use, changes to names in the future would require manual fixing by a user.

○ There is a slight rendering bug that can occur in the map when a user moves between tabs in the dashboard. Resizing the window fixes the issue, though we haven’t been able to pinpoint the exact cause of the problem.

▷ Difficult Issues: ○ The names of parks did not match up between those listed in the

survey data and those in our geographic data. Accounting for matches required joining our lists based on Levenshtein distance.

Future Improvements

▷ Having a pre-set list of options for selecting parks by geographical area could greatly increase the usability of the map. For example, a user may want to compare parks from two community areas.

▷ When hovering over the park markers, a tooltip appears that shows the park’s name. Expanding the size and responsiveness of this tooltip could speed up the process of selecting parks on the map. For example, the tooltip can be displayed for each selected park without having to hover.

Map Design Considerations

Assessment of viable design

space

Text Analysis

Data Abstractions▷ Data Types:

○ The text tab represents responses to CPD open-ended/text comment survey questions.

▷ Dataset Availability: ○ The underlying survey responses are dynamic with respect to time.

Each year, new data is incorporated into the dataset. Since the updates are so infrequent, typical use cases involve transformations performed on a fixed data set.

▷ Dataset Types: ○ The closest representation of the underlying data is a tabular format.

Each row in the data corresponds to the entire set of responses provided by a summer camp participant.

▷ Data in Datasets: ○ For each row, the values of the selected responses for all questions

are stored.

Task Abstractions▷ Who:

○ Overall park managers may be interested in finding overall general sentiment of respondents’ feedback while specific park managers may be interested in comments for specific parks.

○ Summer camp survey staff may want to identify the most popular feedback topics, and incorporate corresponding questions into the following year’s survey.

▷ Actions: ○ Searching functionality is necessary for finding specific parks or

question types. For example, a user may want questions related only to satisfaction, and the viz system should support this.

○ Sorting individual comments by sentiment is also key in helping users quickly identify the most negative feedback.

▷ Targets: ○ Identifying the most popular keywords with most negative sentiment

can help park managers diagnose and address issues.○ Positive feedback helps validate what the summer camp is doing

right.

The visualization for analysis of text comments went through several design iterations

Pre-DataVis-class:

▷ No conscious thought put into design questions.

▷ Tried boilerplate topic-modelling code from the web.

▷ Topics don’t tie back to original user comments, thus difficult to get context.

▷ Client dissatisfied due to difficulty in understanding visualization and lack of interpretability.

For the second attempt, we put more emphasis on design

▷ Link: http://bit.ly/1KYwtS1

▷ Easier to understand, since main keywords are displayed.▷ Avoided “overloading” with too many design elements. ▷ Managed to also incorporate sentiment analysis and tie back to user

comments.

However this still breaks some design principles…

▷ Uses red-green color scheme which is not color blind safe.▷ 2 shade color scheme makes it difficult to compare sentiment across

keywords.▷ Circles arranged randomly, thus most common keywords not

immediately apparent.▷ Need to scroll down to see user comments.

http://bit.ly/1KYwtS1

We also considered bar charts

▷ Inefficient space utilization.▷ Hard to read as number of keywords increases.

The current version is the most design- conscious and user-friendly so far

▷ Applying design principles learned in class:○ Using a color-blind safe

color scheme.○ Sentiment represented by

single hue instead of two colors.

○ Uses order (larger bubbles in center) as visual cue to easily identify relevant keywords.

○ User comments in sortable, searchable table on the right panel of the visualization, thus no need to scroll down.

○ Efficient use of space.

Bubble Chart Considerations▷ Software limitations:

○ Bubble chart layout only orders keywords by count, not by similarity. Adding any other order will require significant code.

▷ Potential Problem Spots: ○ Keyword may not be fully visible for smaller bubbles, although the full

keyword and corresponding count can be viewed by hovering over a bubble.

○ Since the color of the bubbles are calculated based on overall sentiment, there may not be much variation in color because users tend to give more positive comments on the whole. Thus, to find negative feedback the user has to look at the table of associated comments for each keyword.

○ Depending on the comments, there may be cases where there is a very large number of bubbles due to comments not being related to the same topic.

▷ Difficult Issues: ○ Since D3 is somewhat low-level, every element requires several lines

of code.

Future Improvements

▷ Option to view only top n keywords, to avoid clutter when many keywords are present.

▷ Visualize distribution of sentiment for each keyword to get a better idea of overall sentiment, rather than just the average.

▷ Enable side-by-side views to more easily compare changes over time.

Bubble Chart Design Considerations

Final DesignInteractive Dashboard with Tabs for Quantitative and

Text Data Visualization

Design Description

Filter by Question

Dashboard displays data for one survey question at a time.

Filter by Park

Users can choose to view data for one park or several parks at a time.

Filter by Year

Users can choose to view data for a single year or for a range of years.

Filter by Geographic Location

Dashboard includes a map of Chicago parks so that users can easily compare parks in different neighbourhoods/regions.

Text Comments Analysis

Visualizes main keywords/topics in user responses along with the level of positive/negative sentiment associated with each keyword.

Design Details - Navigation

Software

JavaScript libraries - Google Maps

Slider (for filtering responses by ordered categories - in this case, survey year)

Drop-down menus (for filtering responses by unordered categories - park and question)

Map (for filtering response by location)

Design Details - Charts

Software

JavaScript libraries -

AmCharts, ZingCharts, JQuery, D3.

Packed Bubble Chart - for visualizing topic model and sentiment analysis output.

Diverging Stacked Bar Chart - for

visualizing Likert scale, yes/no

questions data.

Design Justification

▷ Dashboard provides big-picture overview as well as detailed view of survey results.

▷ Comprehensive solution for visualizing quantitative and text data.

▷ Makes data interesting to encourage exploration.

▷ Reduces manual effort, since dashboard can be easily generated for future data.

▷ Uses open source software, thus easy to extend for future use cases and free.

Related Work

▷ Likert Scale Data Visualization▷ Visualizing Survey Data and

Visualizations ▷ Yelp Review Visualization

http://stephanieevergreen.com/aggregated-stacked-bars/

http://stephanieevergreen.com/aggregated-stacked-bars/

http://www.datarevelations.com/category/visualizing-survey-data-and-likert-scales






Final Write Up Data Visualization Project · diverging in the case of likert scale questions....

Documents

Transcript of Final Write Up Data Visualization Project · diverging in the case of likert scale questions....