We begin a new feature this month where I look at some recently-published data visualizations and offer suggestions on how they can be improved.
I’ll start with the MASIE Center’s Mobile Pulse Survey results. For those of you that don’t know, The MASIE Center is a Saratoga Springs, NY think tank focused on how organizations can support learning and knowledge within the workforce. They do great work.
So, why focus on this report? There are three reasons:
The subject matter, mobile learning, is near and dear to me after my stint with the eLearning Guild.
There’s good stuff in the data, but the published visualizations make it difficult to understand what the data is trying to say.
Ah, Likert-Scale Questions
Consider this chart below that attempts to describe the results to the question “Current level of interest in providing the following learning elements on mobile devices.”
This chart is a tough read. With the exception of the fourth item, “Access to the Web” which clearly has a really big “Strong Interest” bar, it’s very difficult to determine which of the ten reasons are high on respondents’ lists and which are low.
With this rendering it’s very easy to see that “Access to Corporate Databases and Intranets” is only slightly behind “Access to the Web”. It’s also trivial to sort the ten items by respondent sentiment.
Particularly surprising to me is the negative sentiment (e.g., no interest or low interest) towards accessing simulations. I would have expected there to be quite a bit more interest here. That fact was buried in the other chart.
Note: For those of you that want to see the exact values for particular items as well as just compare positive vs. negative sentiment, there is a fully-interactive version of this visualization at the end of this blog post.
Yes / No Questions
Here’s another chart that attempts to show responses to which factors cause concern about Mobile Learning.
Why diamonds, and why two sets of them? Here’s an alternative that I think is easier to understand and prioritize.
There’s some great stuff in the Masie report, but the published charts are obfuscating rather than illuminating the data.
Here’s the interactive version of the first chart. This would be WAY cooler if we could filter by industry or company size, but that data is not available to me.
Author’s note: Stephen Few is a bastion of data visualization excellence. I encourage all of my Tableau clients and students to purchase his books, read his blogs, and attend events where he is presenting.
I also submitted an entry into the competition (you can read it here).
I’ve been troubled by both the premise and results of Stephen Few’s dashboard design competition as I think it celebrates the number of disparate elements that can be crammed onto a single dashboard rather than champion the display and positioning of the elements that really need to be present (see http://www.perceptualedge.com/blog/?p=1374).
Consider this section from the instructions that accompanied the data set:
“Even though a dashboard designed for this purpose would ordinarily provide access to additional information to supplement the overview, which could be viewed by navigating to different screens or through tooltips, this won’t be addressed in this competition. Concern yourself only with the design of the initial dashboard that this teacher could use to rapidly monitor what’s going on with her students in preparation for the day’s class.”
Now let’s consider all the elements that were in the data set:
Student Name Days Absent This Term Count Days Tardy This Term Count Disciplinary Referrals This Term Count Disciplinary Referrals Last Term Count Detentions This Term Count Detentions Last Term Count Latest Standardized Math Assessment Score 9th Grade Standardized Math Assessment Score 8th Grade Standardized Math Assessment Score 7th Grade Standardized Math Assessment Score 6th Grade Standardized Math Assessment Score Assignment 1 Score Assignment 2 Score Assignment 3 Score Assignment 4 Score Assignment 5 Score Assignments Completed Late Count Current Course Grade Current Course Score Student’s Course Grade Goal Student’s Course Score Goal Previous Math Course Grade Previous Math Course Score English Language Proficiency Special Ed Status (Y/N) Student Absences Student Tardies Absences yesterday count Tardies yesterday count Median and bin distribution of latest standardized exam Median and bin distribution of latest standardized exam for other class Median and bin distribution of latest standardized exam for entire school Median and bin distribution of latest standardized exam for entire district
So, do all these elements warrant being on the dashboard? Given that our teacher (Susan) will be looking at this dashboard every day, does she really need to see how her class compares with others on the latest standardized math test? And just how important is it to know what score Allison Perry received on her sixth grade standardized assessment test?
As somebody who builds (and teaches others how to build) interactive data visualizations for a living I’m always asking myself two questions:
What is it in the data that is important; and,
How do I convey that importance?
Maybe there’s a good reason to include years-past test scores on a daily-viewed dashboard, but my bet is that no one performed any correlation or trend analysis to connect the dots among these disparate elements.
So, given this, just what would I highlight?
In looking at both the winning and runner-up dashboards, I see that comparing a student’s current grade with his / her grade goal is featured prominently. Here’s the applicable portion from the winning entry:
Here’s how the runner up handled the same data:
Here’s my submission:
(I confess that I subscribe to the “I want people form the back of the room to be able to grok what’s going on” approach to visualization).
While I do think all of these will help Susan easily see how a student is performing against his/her goal, I think we’re still missing the “Big Story”.
The Big Story (and It’s Missing from the Winning Dashboard)
If I were Susan I would be hyperventilating because, according to the data set scenario, it’s May, the course is 80% complete and a whopping 43% of her students are below goal:
Both the winning and runner-up dashboards are missing this element, and I maintain it’s the most important story that is in the data.
So, just what should Susan do with this visualization, besides fret?
If I had different data I would try to show how actual vs. goal has been progressing over time. Maybe Susan should be thrilled because in March 65% of her students were below goal, but now only 43% are?
An example of how goal trends might appear
I can’t say, because the data available to me doesn’t track that information, but I would argue that tracking goal trends over time is a lot more valuable than displaying years-old test data.
I’m Guilty, Too
I included a visualization on the dashboard that shows how Susan’s class performed on the latest standardized exam with respect to her other class, the school, and the district.
Pretty neat, huh? This uses the staggered / divergent stacked bar chart that I so like to use for presenting LIkert scale data.
But does this really need to be here? That is, does Susan need to see this months-old data every single day as she prepares for her class?
The answer is she doesn’t and I don’t think this oh-so-clever chart belongs on the dashboard. I do, however, suggest adding data that is not in the data set as an example of something that I think would be very useful:
Great Idea, but Poor Data Set?
As I indicated in the author’s note, I’m an enthusiastic disciple of Stephen Few’s. I also like data visualization challenges / competitions and have seen my own career propelled by participating in them.
I also think the entries Few cites are beautiful and wonderfully crafted, but I can’t help but think that this particular dashboard competition needed a better data set where all the elements truly warranted being on the same dashboard. Indeed, even in his instructions, Few acknowledges that under ordinary circumstances many elements in the data set would be presented via tool tips and secondary dashboards. How can one create a truly useful dashboard that contains all of the included elements if you know that many elements just shouldn’t be there?
Thoughts on Standard Business Practices, User Expectations, Tableau’s Server Pricing Model, and Appealing to the Consumer’s Inner Narcissist.
In my six-plus years of using Tableau I’ve created hundreds of dashboards and thousands of interactive visualizations. In observing these creations out in the field I noticed something rather disappointing – the vast majority of people for whom the systems were created were not interacting with these so-called interactive dashboards.
And as much as I love Tableau, I believe the company’s server pricing model is retarding rather than accelerating the adoption of a more enlightened approach to data exploration and sharing.
Let me explain.
Turning something dynamic into something static –
My students’ experiences
In addition to my consultancy I am a Tableau Training Partner.
One of the somewhat sobering – and disheartening – things I hear from my students is that organizations are using Tableau to embed static images in static reports. That is, while much of the curriculum is spent showing people how to create highly interactive visualizations, the vast majority of attendees report that the way they will share their visualization brilliance is with PDFs and PPTs.
You mean your company has invested in Tableau and you’re just going to copy and paste stuff into static documents that get e-mailed?
After a few seconds of sheepish nods all around we usually get into a discussion of why this is happening and what one can do to help organizations embrace the full power in Tableau. Here are some of the things we discuss:
Cost (i.e., server pricing)
Make it Irresistible by Making it Personal
A lot of organizations look at Tableau as being a type of report writer, something that is tantamount to using a spreadsheet as a word processor. As Tableau makes inroads into organizations it seems it has to first prove itself by duplicating whatever functionality it is supposed to replace, rather than complimenting, augmenting, or leapfrogging the incumbent technology.
While Tableau Server can be cajoled into automating the creation and distribution of PDF files, I don’t think this is a great use for the product. Indeed, if you need different cuts of the data or different pre-filtered views for different personnel – but you want to distribute these as static reports – Tableau is not, in my opinion, your best bet.
So, for those of you in the avant garde of your organizations, you need to manage expectations of your colleagues and higher-ups and explain that Tableau is not a report writer.
I’ve brokered a couple of lively debates around visual filters and how long it will be before people just expect clicking a chart to have some impact on a dashboard. For example, if you have a dashboard that has quick filters like the ones shown below, most people will know what these things are and how to work with them.
Figure 1 -- Tableau quick filters. People know what these things are and how they work.
However, if you were to expose these same people to the dashboard snippet shown below, very few will know that clicking a portion of one chart will trigger an action that filters all the other charts in the dashboard.
Figure 2 — Visual filters. Without instruction people will not know that selecting a bar or bars will filter by the selection.
This is amazingly potent stuff, but unless you show people how it works they aren’t going to simply discover it on their own.
So, how long will it be before people “know” that charts can be filters, too, and this becomes expected functionality?
Based on nothing more than gut, I predict it will take three years, but I wonder what Tableau could do to accelerate this expectation.
And as I wonder, I find myself thinking about Tableau Reader and Tableau’s Server licensing model.
Deployment Hurdles with Tableau Reader
I’m very grateful that Tableau offers a no-cost reader. The problem is that Tableau Reader is a non-starter with many organizations and potential individual interactors for myriad reasons:
You lose a lot of people when you ask them to first download and install something.
Those that you don’t lose may not be able to download and install the reader (e.g., it is not an “approved” app in the organization).
There is no way to protect the underlying data from prying eyes or to even password protect a packaged workbook.
You cannot customize the workbook for different consumers of the data (e.g., have certain things pre-filtered based on department, job function, etc.)
You cannot easily add helpful scaffolding to the dashboards that walk people through how to use the dashboard (whereas you can do this when you embed a dashboard inside a web page).
So, what about Tableau Server?
Tableau Server Pricing
Tableau Server is a great solution, but it can be prohibitively expensive.
The last time I investigated, there were two ways to purchase Tableau Server:
An individual interactor licensing model that costs $1,000 per license with a minimum of 10 licenses; and,
An “all-you-can-eat” core licensing model that costs between $160K and $250K, depending on whether you purchase a quad-core or eight-core license.
The first license is very reasonably priced if you have a small group of people that will be interacting with Tableau frequently, and the second license is very reasonably priced if you have a large group that will be interacting frequently.
But what happens when you have a large group of people that wants to make light use of Tableau? That is, what if you have thousands of users but each of these users will only need to interact with Tableau three or four times a year, for maybe 30 minutes at each session?
I have had four different clients that have fit that very scenario – thousands of potential interactors but very few concurrent interactors – and they would love to implement a Tableau server solution but the $1,000 per user or $160K core license is totally untenable.
I understand that Tableau does not want to undervalue its technology, but why not offer a concurrent licensing model? That is, make it so that the server can only handle a certain number of concurrent users at one time. There are so many reasons why this would make sense for both the organizations that want to adopt the technology and for Tableau:
1) The initial cost to adopt will be low as one could purchase a five or ten-user concurrent license.
2) Many people in the organization will be able to “kick the tires”
3) As people see the benefits of this approach, more and more people will use it more and more often.
4) This in turn will lead to organizations purchasing more concurrent licenses.
Make It Irresistible; Make it Personal
Let’s say we surmount the training and cost issues. Will people in fact dig in and click?
I’ve learned a lot by observing how people use – or do not use – the dashboards I’ve created. To get people to use the dashboards they have to care about the data and people will care deeply if the data is about them. That is, the dashboards that get the most traction answer questions like these:
How is our company doing vs. other companies?
How is my department doing vs. other departments?
Where do I stand vs. my peers?
Consider the example below that shows a simplified version of a CTO compensation dashboard I created earlier this year. A person can apply various filters and see salary ranges based on the filter settings.
Figure 3 -- Generalized salary visualization.
Now, compare this with the visualization below where we allow the user to enter his / her salary, and then apply the filters. We get a much more compelling story and we make it a lot easier to see how the person is doing vs. others.
Figure 4 -- Personalized salary visualization.
This type of dashboard – where we have both an interesting story and we incorporate data about the person viewing it – will generate a lot of interest and once people see how powerful and personally meaningful this is they won’t want to go back to PDFs and PPTs.
Here’s something you are welcome to try – have a friend or colleague go to a Tableau Public URL that contains multiple tabs; then see how many people realize that there is in fact more than one view / dashboard available for him/her to explore.
I’ve created hundreds of dashboards with multiple tabs and the bottom line is most people will not notice the tabs. This is why my dashboards have navigation links; i.e., arrows that people can click to move forward and back).
The problem is that crafting these forward / back buttons is very time-consuming; each one has to be “hand-chiseled” as it were. Indeed, I spend as much time on the cosmetics / usability component of dashboard development as I do building the actual visualizations.
Compounding my frustration with this is that I really have little or no use for the tabs along the top, but the navigation buttons won’t work unless the tabs are visible.
So, I respectfully submit a feature request to the folks at my favorite software company – please make it easy for me to generate a clickable table of contents, and please make it easier to build forward / back buttons.
… this request is a distant second to my top Tableau feature request (let’s all chime in together): Dashboard-level filters.
Note: While the information in this blog is useful, I’ve discovered some better ways to perform intra-question analysis; I just haven’t blogged about them yet. Feel free to nag me.
So, I thought I was all done with this subject (see http://www.datarevelations.com/using-tableau-to-visualize-survey-data-part-2.html) but a couple of blog readers (Matt and Tony) “busted” me and pointed out that my approach did not address a way to conduct intra-question analysis. That is, we could cut Likert scale and check-all-that-apply questions by gender, location, etc., but there was not an easy mechanism to compare responses for people that answered “yes” to one question with their Likert scale opinions on another question. To put it another way, we did not have a way to see whether or not folks that answered “no” to voting in the next election strongly agree with the statement that a candidate is good at playing jazz.
I’ll confess that I don’t perform this type of analysis that often, but when I do, an addition to drinking Dos Equis, I find that the most interesting insights usually come from a scatterplot analysis where, for example, we compare the responses for annual household income with how much respondents are willing to pay per month for Internet service.
Note: If you have not already done so, please read Part 1 and Part 2.
Further note: click here to download the sample survey data.
Final note: The interactive visualization may be found at the end of this blog post (you’ll find the download link there as well).
What it Looks Like When it’s Done
Consider the example below where I compare responses to the question “Will you vote in the upcoming election?” with respondents’ Likert scale opinions on a candidates’ ability to play jazz, their grace under pressure, consistency in eating all his/her vegetables, and intelligence.
Comparing plans to vote in next election with several Likert-scale questions
Notice that I’m using a staggered Likert scale visualization to offset negative from positive responses. Joe Mako busted my chops last time for not doing this in my survey analysis visualization.
That said, I am going to build a simplified visualization (the one below) as I want our focus to be on intra-question analysis, not the best way to fashion Likert-scale visualizations.
A simpler visualization
Getting Two Sets of Questions by Joining a Table to Itself
(Or joining a table that’s really similar to the primary table)
If you downloaded the data source workbook you will see there is a tab called “Use This One” that contains data that looks like this:
Data from the “Use This One” table
There’s another tab called “Reshaped Questions” and it contains the exact same information, but without the demographic columns:
Data from “Reshaped Questions” table.
To achieve our goals we will need to join these two sheets (or tables) using ID as a common field.
Editing the Data Connection and Joining the Tables
We need to edit the Tableau data connection, indicate that we want to use multiple tables, and then add a table (in this case the table called “Reshaped Questions”). We then need to fashion an inner join that looks like this:
Joining the two tables
Note: This approach will generate A LOT of additional rows (180K vs 9K). Once you figure out just which questions you want to include in your intra-question analysis you may want to refine your extract and exclude questions you don’t need.
Tableau will add two new dimensions to the mix, as shown below.
New dimensions giving us two sets of data
So, Let’s See What We Got
If we place ‘Reshaped Questions$’_Question and ‘Reshaped Questions$”_Reponse on the columns shelf, Question and Response on the rows shelf, and CNT(Number of Records) on the Text mark shelf, we’ll see a complete cross tab of all questions and responses plotted against all questions and responses.
Cross tab of all questions
Now we can refine our view by looking at just one question from the Reshaped Questions table and a handful of questions from the “Use This One” table, as shown here.
Comparing results from one questions with responses from several Likert-scale questions
Next, by moving things around a little bit and by changing the CNT formula to show a percentage of the total, we start to glean some interesting insights.
Refining the cross tab
And finally, if we apply some of the Likert scale visualization techniques we explored in several earlier blog posts, we start to see some interesting patterns (e.g., folks that don’t know if they will vote in the upcoming election offer more positive responses to the Likert-scale questions.)
Further refinement combining stacked bars and circle marks
So, Why is this Part 2 ½ and not Part 3?
There is so much more we could explore with respect to intra-question analysis, but I have a backlog of posts I need to write. I did, however, want to get people heading in the right direction.
One thing I will encourage people to explore is using parameters to allow consumers to select which questions they want to see displayed, especially if you build scatterplots.
Matt and Tony, I hope this addresses some of the missing pieces in visualizing survey responses in Tableau.
So, as anyone who has either taken a Tableau class with me or has read my blog posts knows, I go ballistic when I see anyone use a red/green divergent color palette.
So, it’s great to see people using Tableau’s color-blind palette, but we need to get something straight:
Dark Blue == Good, Lots
Dark Orange == Bad, Few
I’m seeing examples where people are applying the colors in reverse, and, well, that’s just wrong (in my not-so-humble opinion). Think about it: when was the last time you saw a danger sign flash in royal blue vs. red or orange?
While I’m on the soap box, I want to sing the praises of the simple highlight table. Consider this:
I love this type of visualization because it’s so easy to see which combinations of sub-product / region are performing well and which are doing poorly. In other words this passes the “can I grok it from the back of the conference room” test.
I really am not “out to get” Texas. Yes, I have previously published a visual study tracking STDs, HIV, and AIDS in Texas, which in turn led to some blog posts on the subject, but one of the main reasons for focusing on Texas was that the data I needed were readily available and that was not the case with many other states.
So, in my most recent visual study on Foster Care placement I looked at data from all 50 states. I really wasn’t even going to look at Texas, but something unexpected just bubbled to the top (or to be accurate, plummeted to the bottom) when I looked at one very important finding.
Let me explain.
The main focus of the study was to compare states that placed children in kinship foster care (i.e., placement with a member of the child’s extended family) with states that placed children in congregate care (i.e., placement in a group home or institution). The National Coalition for Child Protection Reform maintains that kinship care is the least harmful form of foster care while congregate care is the most harmful.
For the record, Texas ranks 13th for kinship care and 24th for congregate care (where being ranked high in kinship care is good but congregate care is bad.)
So, what’s the problem with Texas?
Perhaps more important than the issue of kinship vs. congregate care is whether or not states are needlessly placing children in any form of foster care.
For the last two years for which data is available, the number children placed in foster care decreased in 35 of 50 states. And if we exclude Texas from our analysis, the overall number of cases for the other 49 states decreased by 3.9%.
And Texas? The overall number of cases increased by 9.6% giving Texas the bottom spot on the list.
So, whether it’s STD cases or foster care placement, the old adage seems to apply…
In my previous blog post I discussed how to reshape your survey data so that it plays nicely with Tableau. In this post we’ll explore how to visualize three types of survey questions:
Yes / No / Maybe
Note: A fully working collection of interactive visualization may be found at the end of this blog post.
Connecting to the Data
We need to connect to the reshaped data source we created in Part 1. For those of you that are following along and downloaded the sample data, there are several tabs in the source Excel file. Make sure you connect to the one marked “Use This One” as shown below.
Connecting to the right data source
When asked how you want to connect to the data, select “Import all data”.
You’ll be happier if you import all the data and let Tableau create an extract
There are two reasons for doing this:
With larger data sets your performance will be much faster; and,
You will need Tableau’s COUNTD() function and that is not available when you connect directly to Excel (or Access, for that matter.)
Your dimensions and measures shelves should look like the one shown here.
Default Dimensions and Measures
Note that the ID “Measure” should in fact be a Dimension, so when you get a chance, drag the field up to the Dimensions shelf.
So, just who are these people?
Before visualizing any questions I like to get a sense of who participated in the survey. I’ll start by looking at Gender.
Start by getting a feel for who answered the survey questions
Notice that I place COUNTD(ID) on the columns shelf indicating that I want to count the number of unique. Had I used the regular COUNT() function I would count all of the rows that contain a respondent’s ID, and that is a large number given how we’ve reshaped the data.
The “Null” value indicates that some people chose not to respond to this demographic question. I suggest creating an alias called “Did not respond” to handle these non-responses.
From here I will create views for any other demographic groups (e.g., Generation, Location, etc.).
Visualizing a Yes / No / Maybe question
Let’s start by seeing what questions are available by placing Question on the Rows shelf.
With reshaped data there’s only one field, but many questions
As you may recall from the previous blog post, our sample survey contained the question “Do you plan to vote in the upcoming election?” for which there were three possible responses (“Yes”, “No”, and “I don’t know.”)
As I’m not interested in all the questions in the survey I need to add a filter that just keeps the first question in the list.
To see preliminary results for this question I will place “Response” on the rows shelf and COUNTD(ID) on the columns shelf. While not particularly pretty, it does show us something useful.
I will next alias the Question (and Responses), sort the Responses, and place Response on the color shelf. I will also add a table calculation to the Columns shelf so that instead of showing a count we show the percentage of the total.
A simple, clear visualization
Now we’re talking!
The nice thing about the way we have this set up is that it’s easy to break down the results by our demographic segments simply by dragging a dimension to either the rows or column shelf, as shown below.
Results broken down by generation
To get a sense of what we have going on for this question I duplicated the sheet from the previous visualization and then changed the Question filter so that I’m looking at the check-all-that-apply questions. Here are the filter settings.
Selecting questions using a filter
Do you now see the major advantage of having reshaped the data? Had I kept the one column / one question approach I would have to use Tableau’s Measure Name / Measure Names feature and I would likely have to build a separate calculated field for every question I want to analyze. With reshaped data I just put Question on the Rows shelf and filter by the questions I want to analyze.
Tip: I will usually create a set for each question grouping I want to review, sort of a “named filter”, as it were. In the accompanying workbook you will find two sets, one called “Check All That Apply” and the other called “LikertQuestions”. You can then drag either of these sets to the filter shelf.
Here’s the resulting visualization.
A bad way to show check-all-that-apply results
Well, it’s a start, but it’s a not a a very good visualization. For one thing, I only care about the percentage of people that selected “yes”; I don’t even need to consider the “no” folks.
I need to fashion a formula that counts of how many “yes” responses there were for each question, then divide by the number of responses. This would be simple of the responses for these questions were 1 and 0, but unfortunately the responses are labels that look like this:
So, I need to perform a little string arithmetic. The calculated field that does what we want is shown here.
And a visualization that employs this formula is shown here.
A better way to show check-all-that-apply results
That’s much better. And of course, I can add a demographic dimension to glean additional insights. In the screen below I show a comparison between male and female respondents, using an overlapping bar chart.
Check-all-that-apply question, broken down by gender
Likert Scale Questions
Note: Before continuing, you must promise that upon reading the rest of this blog post you read the following two blog posts:
As with the previous two question types, the first things I need to do is place “Question” on the filters shelf and select the group of questions we want to analyze in this visualization:
Using a filter to select the Likert scale questions
Now I’ll create a stacked bar chart, by placing “Question” on the Rows shelf, CNTD(ID) on the Columns shelf, and response on the Color shelf.
First pass at Likert Scale viz
Next I need to replace the CNTD(ID) measure with a table calculation that shows the percent of the total based on Response.
Adding percent of total to the viz
Since the questions are sorted in alphabetical order it’s hard to gauge which questions garner the greatest amount of agreement. This is where using the Likert scores (4 for “Very high degree”, 3 for “High degree”, etc.) can help add clarity.
While I go into much greater detail in the two blog posts YOU PROMISED TO READ, I’ll walk you through how to calculate the average Likert score and superimpose if atop the stacked bars.
In our example, the value I need is the first character in the Response string, so we just need to grab the left-most character and convert it into a number. I can do that using the following calculated field called LikertValue.
Next, I’ll place the Average of this calculated field on the Columns shelf next to the existing table calculation (don’t worry that the result will not look good.) I now need to indicate that we have multiple marks and make the first chart a Bar chart and the second a Circle chart, as shown here.
As is often the case, as you first make improvements you can get some wacky results
This still looks pretty dumb, but if I remove “Response” from the Color shelf, then right-click AVG(LikertValue) on the columns shelf and select “Dual Axis” we get something that starts to be useful.
Finally, if I show labels for the circle chart and sort the questions by Average LikertValue we get something that tells a clear story.
Not too shabby
In this and the accompanying blog post we’ve explored how to reshape survey data and visualize three different types of survey question. Granted, there are other question types (e.g., “add up to 100”, “rank these in order”, etc.) as well as intra-question analysis (e.g., create a scatterplot showing respondent’s age vs. annual income) that are a bit trickier to implement, but what we’ve delved into here should cover a wide variety of needs.
Question – any interest in a one or two day course devoted to visualizing survey data in Tableau?
Note:This blog post is rather dated. While it does stress the importance of reshaping the data it only shows how to do this using the Tableau add-in for Excel and doesn’t discuss the optimal format for survey data. The post also doesn’t address how to combine survey results downloaded in a numeric format with survey results in a text format.
In my teaching and consulting practice I’ve fielded a lot of inquiries on how to use Tableau to analyze and visualize survey data so I thought it would make sense to write a blog post or two on the subject.
The good news is that Tableau is a fantastic tool for gleaning insights into survey data. The bad news – and it really isn’t that bad – is that it may not be obvious as to how you go about setting up your survey data so that you can do the cool stuff with Tableau.
In this first blog post we’ll discuss reshaping “typical” survey data so that it gets along nicely with Tableau. In the accompany post we’ll see how to visualize different question types (e.g., check-all-that apply, Likert scale, etc).
The screen shot below shows several rows and columns from a CSV file that might have come from one of several different survey tools (e.g., “Survey Monkey”, “Zoomerang”, etc.)
Raw survey data in Excel
You’ll notice that in addition to an ID column there’s one column for each question and one row for each response. We’re going to need to change this is moment, but first let’s understand some of the different types of questions we’ll be visualizing.
Columns B through D contain what I call “demographic” questions. These are the questions that usually come at the beginning of the survey and let us know something about the survey respondents. In this example we’ve just gathered gender, country of origin, and generation, but in your surveys you may see columns for ethnicity, education, marriage status, income, etc.
Note: If you are using Excel and the Tableau reshaping add-in (which we will explore in a moment) then you should make sure the demographic questions are listed first in your spreadsheet, even if they were presented to survey takers at the end of the survey.
Yes / No / Maybe question
The data in column E, below…
Simple Yes/No/Maybe question
… represents the data that was gathered from this survey question.
Note that the cells in the Excel spreadsheet combine both numeric values (0, 1, or 2) and labels (“No”, “Yes”, “Don’t know”.) Once we get the data into Tableau we’ll need to do some string arithmetic to address this.
Check All That Apply questions
The data in columns F through N shown below…
Check-all-that-apply question data
… represents the data that was gathered from this “check all that apply” survey question.
Notice that the only options for these columns are “1: Yes” and “0: No”. A blank indicates that the survey respondent did not answer the question.
Likert Scale Questions
Columns O through Z …
Likert scale question data
… contain Likert Scale data that was gathered from the following survey question.
Here each response receives both a label “Very high degree”, “High degree”, etc., and a number rating from 1 to 5. It was Rensis Likert (pronounced “Lick-ert”) that developed the idea of assigning quantitative values to qualitative measures. If you’re creating surveys, chances are you will be doing a lot of work with Likert scale questions.
What’s wrong with the way the data is configured?
There are two very compelling reasons to reshape the data before connecting to it with Tableau:
It’s way easier to explore the data and build visualizations with reshaped data (trust me); and,
You won’t be stymied by a 255 column limit when connecting to an Excel or Access data source (and with long surveys it’s very easy to exceed 255 columns worth of data).
Reshaping the Data
There are a variety of data transformation / extract-transform-load (ETL) tools available including a free and promising system from Stanford University called Data Wrangler. As our data is in Excel, I’m going to use Tableau’s free reshaping add-in to transform the data into the format we need it to be in.
Let’s walk through the process using the sample data in Excel 2010. Note that I have already downloaded and installed the Tableau reshaping add-in.
From the Add-In menu, select Reshape. The Reshape Data dialog box will appear.
Select the cell that demarks where the data should be reshaped in this case the first row of the first column that contains question, rather than demographic, data.
Click OK. The add-in will generate a new tab in your Excel workbook that contains data that looks like this:
Notice that there is a separate row for each question in the survey and that the ID, Gender, Country, and Generation data is repeated for each survey respondent. This is what is called “normalized” data and it is a good thing.
Change the field names for Col5 to “Question” and Col6 to “Response”.
Save your work and note the name of the tab that contains the reshaped data as this is the source we will use in Tableau.
So, now the data is all dressed up. In the next blog post we’ll give it somewhere to go.
Note: When I prepare survey data for Tableau I will usually employ a small variation to the approach discussed above. Specifically, I will create two separate tables, one called “Demographics” and the other called “Questions” and connect them with an inner join (I do this whether the source is Excel or a “real” database). In this case, the “Demographics” table just contains one row for each respondent and the “Questions” table contains one row for each question. I join the tables on the respondent ID.
Here’s what the Demographics table looks like:
And this is what the reshaped Questions table looks like:
Reshaped Questions table
Note that both this approach and the one we looked at earlier produce almost identical results in Tableau.