swexler

May 102016
 

Note: Special thanks to Joe Mako who came up with a considerably more elegant and extensible solution than I had first built. I describe Joe’s approach in this post.

Overview

I’m fortunate that when I have to work with survey data that is not well-coded I can take out my trusty copy of Alteryx Designer and get my data “just so.”

But what if you don’t have a copy of Alteryx? Or what if you want to use Qualtrics’ new Web Data Connector and don’t have the ability to insert a data prep tool within your workflow?

In this blog post we’ll see how you can get Tableau to address the problem.  Specifically, we’ll see how to work with check-all-that-apply questions where the data has been coded as 1s and blanks instead of 1s and 0s.

Note: While the example shown focuses on Qualtrics data the approach will work with any tool that leaves blanks for respondents that did not select an item (e.g., Survey Gizmo).

So, What’s Wrong with Blanks?

Consider the data snippet below that shows responses to a group of check-all-that-apply questions. This is what I consider to be badly-coded data.

Figure 1 -- Check all that apply questions as they might appear in a CSV file downloaded from Qualtrics

Figure 1 — Check all that apply questions as they might appear in a CSV file downloaded from Qualtrics

Now here’s how I would prefer the data to be coded.

Figure 2 -- Correctly-coded check all that apply questions

Figure 2 — Correctly-coded check all that apply questions

So, why is this better?

By having 1s and 0s we can see, easily, if some respondents didn’t answer the question group at all. In the figure below we can see that respondents 1002, 1006, 1011, and 1016 did not make any selection for any of the check-all-that-apply questions.  More importantly, with data coded with 1s and 0s, Tableau we can just filter out the Nulls and we’ll get accurate results.

 Figure 3 -- Respondents that didn't answer this question group are highlighted in yellow


Figure 3 — Respondents that didn’t answer this question group are highlighted in yellow

IMPORTANT: Before going any further, make note that respondents 1002, 1006, 1011, and 1016 did not make any selections and that of the 13 people that actually responded to Q1_4,  12 selected that option.  There will be a test.

So, given our constraints, how are we going to come up with a way to distinguish between a blank that means a respondent didn’t select a particular option and a blank that, when combined with three other blanks, means the person didn’t respond to the group of questions at all?

Before we plow into this I want to make sure people realize that we will soon pivot / reshape the data so that it is tall. Specifically, instead of dealing with this…

Figure 4 -- Data before reshaping (as you would get it if you downloaded a .CSV file.)

Figure 4 — Data before reshaping (as you would get it if you downloaded a .CSV file.)

… we’re going to be working with this:

Figure 5 -- Reshaped data using Tableau's built-in Pivoting feature. If you're using the Qualtrics web data connector you don't need to use this.

Figure 5 — Reshaped data using Tableau’s built-in Pivoting feature. If you’re using the Qualtrics web data connector you don’t need to use this.

Once we pivot the data we’ll see the following collection of dimensions and measures in Tableau.

Figure 6 -- Reshaped data as it appears in Tableau

Figure 6 — Reshaped data as it appears in Tableau

Note: For this example I’m not going to the trouble of creating a data set that has the survey responses in both numeric and text format. The technique we explore will work equally well for when you have both as when you just have numeric values.

Grouping the related questions

To be able to come up with an elegant, extensible way to tell Tableau “don’t take into account questions where people didn’t make any selections” we’re going to need a way to group related questions together.

Note that the technique I’m about to propose is different from the ad-hoc grouping approach I describe in Using Qualtrics’ Web Data Connector and Reshaping Survey Data with Tableau 9.0.  The programmatic “group” we create here needs to be referenced in a calculated field, and you can’t do that with an ad-hoc group.

Here we’re going to take advantage of Qualtrics’ Question ID naming convention and how related questions all look like this:

Q(N)_

That is, a group of related check-all-that-apply questions might be named like this:

Q1_1
Q1_2

Q1_8

And a group of related Likert scale questions might be named like this:

Q7_1
Q7_2

Q7_13

Given this standardization, we need a mechanism that will group everything to the left of the underscore. There are several ways we can achieve this, but for this example we’ll create a new field called “Question Prefix” and define it as follows:

Figure 7 -- Calculated field to that will group related questions based on whatever comes before the underscore in the Question ID field.

Figure 7 — Calculated field to that will group related questions based on whatever comes before the underscore in the Question ID field.

Now let’s use this field and the Resp ID field to see who did not respond at all to a particular group of questions.

Figure 8 -- This collection of fields show us who did not respond at all to a particular group of questions.

Figure 8 — This collection of fields show us who did not respond at all to a particular group of questions.

So, have a look at results in the image above where we place MAX(Value) on the text button. What’s happening is that for each respondent we’re looking at the answers to Q1_1, Q1_2, Q1_3, and Q1_4 and taking the maximum value across all those questions.  Notice that there are four blanks / Nulls for 1002, 1006, 1011, and 1016 — these are the four respondents that didn’t answer the question group at all (see, I told you there would be a test.)

We do NOT want to include these people when we calculate the percentage of people that selected a particular choice, so we can place MAX(Value) on the Filters shelf and indicate that we should not include Null values, as shown below.

Figure 9 -- Filter that will exclude Null values.

Figure 9 — Filter that will exclude Null values.

Applying this filter will show us responses from people that did NOT skip the question, as shown below.

Figure 10 -- Respondents that selected at least one option.

Figure 10 — Respondents that selected at least one option.

Okay, this is great, but the visualization we want to create shouldn’t look anything like this. Indeed, we have too many things on the level of detail at this point.

So what are we going to do?

Level-of-Detail Calculations to the Rescue

Anytime you think “I need Tableau to consider some dimensions when making a calculation, but those dimensions can’t be in the visualization” a giant alarm bell should go off in your head — this is precisely the use case for a LoD calculation!

Here’s the LoD calc, that when placed on the Filters shelf, will do the trick:

Figure 11 -- LoD calc that allows us to exclude folks that skipped answering a question group.

Figure 11 — LoD calc that allows us to exclude folks that skipped answering a question group.

A translation into English would be:

Even though [Question Prefix] and [Resp ID] may not be on the viz detail, take that combination of these two fields and find the MAX([Value]). Then find only the Null values from this and exclude them.

Note that we are using FIXED instead of INCLUDE as the former will automatically create a dimension and will work no matter what other filter settings may be in place.

Putting This Altogether

First we need to create a calculation that gives us the percentage of people that selected an option from the group of check-all-that-apply questions. For this instance the following calculation will work well.

Figure 12 -- How to determine the percentage of people that selected an option.

Figure 12 — How to determine the percentage of people that selected an option.

A quick translation into English would be:

Add up everybody that answered “1” (meaning they selected an option) and divide by the number of people that answered the question.

Let’s see if this works. Consider the visualization shown below.

Figure 13 -- First attempt at showing the percentage of respondents that selected an option. This is NOT showing the correct percentage.

Figure 13 — First attempt at showing the percentage of respondents that selected an option. This is NOT showing the correct percentage.

Does this pass the “smell” test? Look at Q1_4. You may recall that we saw that 12 out of 13 people selected this option — but 12 out of 13 is 92% not, 71%.  Indeed, if we place SUM([Number of Records]) as a discrete measure on the rows shelf we see that Tableau believes 17 people responded to the question.

Figure 14 -- Without a filter Tableau believes 17 people responded to this group of questions

Figure 14 — Without a filter Tableau believes 17 people responded to this group of questions

We need to put our LoD calc on the Filters shelf and set it to True.

Figure 15 -- Removing "All of these are Null" from the respondent pool gives us the correct count of 13 and the correct percentage (12 out of 13).

Figure 15 — Removing “All of these are Null” from the respondent pool gives us the correct count of 13 and the correct percentage (12 out of 13).

Conclusion

How check-all-that-apply questions are coded is a problem with many survey tools including Qualtrics and Survey Gizmo. If you have a good data preparation tools like Alteryx you can fix the problem, but if you don’t have such a tool or don’t want to insert something into your workflow then you can get Tableau to address the problem with a Level-of-Detail calculation.

Apr 112016
 

Overview

This past week I enjoyed looking at and interacting with Matt Chambers’ car color popularity bump chart.

 Figure 1 -- Matt Chambers' car color popularity bump chart.

Figure 1 — Matt Chambers’ car color popularity bump chart.  You can find the original Datagraver visualization upon which this was based here.

The key to this dashboard is interactivity as it’s hard to parse all the car colors at once. If you hover over one at a time it’s easy follow the trends, as shown here.

Figure 2 -- Hovering over a color shows you that color’s ranking over time

Figure 2 — Hovering over a color shows you that color’s ranking over time

Showing Rank Only

Over the past few months I’ve seen a lot of people making bump charts (myself included). As much as I like them I fear that people are leaving some critical insights out of the discussion as bump charts only show ordinal information and not cardinal information. That is, they show rank but not magnitude.

Consider the bump chart above.  In 2009 White was the number one color, Black was number two, and Red was a distant sixth.

Figure 3 -- Red appears to be a distant sixth

Figure 3 — Red appears to be a distant sixth

But was Red in fact “distant” or its popularity closer than it would appear?  When you just show rank there’s no easy way to tell.

Showing Rank and Magnitude

Consider the dashboard below that shows the overall ranking and percentage popularity for car colors over the last ten years.

Figure 4 -- Ranked Bar Chart dashboard with no colors selected

Figure 4 — Ranked Bar Chart dashboard with no colors selected

Right now we can see that over the last ten years white came in first place with 22% and Red came in fifth place with 11%.  Now let’s see what happens if we select red and white, as shown below.

Figure 5 -- Comparing popularity of white and red car over the last ten years.

Figure 5 — Comparing popularity of white and red car over the last ten years.

Here we can see everything that the bump chart had plus so much more. Specifically, we can see that White was in first place for the past ten years and that Red was as high as fourth place in 2007 and as low as sixth place in 2008 and 2009. But we can also see that in 2009 White was only 50% larger than Red while in 2015 it was almost 150% larger!

Try it yourself

Click here to interact with the color popularity ranked bar chart.

Ranked Bars are Versatile

The ranked bar approach works well showing rank and magnitude over time and across different categories.

Consider the dashboard below that shows the sales for the top 20 products overall and then a ranked breakdown by one of three possible categories (Customer Segment, Region, and Year)

Figure 6 -- Overall sales / rank and sales / rank broken down by Customer Segment.

Figure 6 — Overall sales / rank and sales / rank broken down by Customer Segment.

Here we can see not only how the Bosch Full Integrated Dishwasher is ranked overall and within the four Customer Segments, but we can also see how much more and less the other products’ sales were.

Here’s the same dashboard showing a breakdown by Region.

Figure 7 -- Overall sales / rank and sales / rank broken down by Region.

Figure 7 — Overall sales / rank and sales / rank broken down by Region.

The Bosch Dishwasher is fifth overall but it isn’t even in the Top 20 in the South.  We can also see that it is Second in the East, ever-so-slightly behind the first ranked product (the Whirlpool Upright Freezer.  You can see for yourself when you interact with the dashboard that’s at the end of the post).

Here’s the same data but presented using a bump chart.

Figure 8 -- Overall sales / rank and just rank by Region.

Figure 8 — Overall sales / rank and just rank by Region.

The bump chart looks cool but we only get part of the story as I can only glean rank.

Conclusion

The bump chart is a great choice if you want to show “soft” rankings, such as what place a team came in over time, but if you want to show rank and magnitude, consider the ranked bar chart instead.

Note: for step-by-step instructions on how to build a dashboard like the one below, see Visual Ranking within a Category.

The Ranked Bar Dashboard — Kick The Tires

Mar 302016
 

Some thoughts on functionality, beauty, crown molding, and lollipop charts

Overview

I’ve been writing a book about business dashboards with Jeffrey Shaffer and Andy Cotgreave and we’ve conducted screen-sharing sessions with dozens of people and reviewed scores of dashboards. We had a particularly enjoyable jam session with Tableau Zen Master Mark Jackson last week. When we asked him why he had done something in particular he replied with a comment that has been haunting me (in a good way) ever since:

“I look at this dashboard first thing every morning. I want to look at something beautiful.”

This really resonated with me. Mark was not tasked with making a public-facing dashboard that had to compete with USA Today infographics. He just wanted to make something that was both functional and beautiful. It made me think of waking up in a lovely room with crown molding. You don’t need crown molding, but as long as it isn’t blocking sunlight or clashing with the decor it’s certainly delightful to have crown molding.

This got me thinking about a topic I come back to often — how to make visualizations that are both functional and beautiful.

Unfortunately, this isn’t so easy and often leads to people sacrificing clarity for the sake of coolitude (see “Balancing Accuracy, Engagement, and Tone” and “It’s Your Data, not the Viz, That’s Boring” for some more thoughts on the matter).  I did, however, want to share a case study that had a delightful outcome and that employed a chart type that combines the accuracy of a bar chart with a bit of the “oooh” from packed bubbles and “ahhh” from donut charts.

Marist Poll and Views of the 2016 Presidential Election

Marist Poll is one of my clients and they are tasked with providing nationwide survey results to The Wall Street Journal and NBC News.  In November 2015 they conducted a poll asking people to describe in one word the tone of the 2016 presidential election. Here were the results.

Figure 1 -- Marist Poll results in tabular form

Figure 1 — Marist Poll results in tabular form

Attempt One — Word Cloud

The results from the poll are very compelling but the results as depicted in the text table don’t exactly pop.

The client tried a word cloud as shown below.

Figure 2 -- Marist Poll results using a word cloud

Figure 2 — Marist Poll results using a word cloud

I’ll admit that the graphic “pops” but it’s hard to make sense of the six terms let alone discern that the results for “Crazy” were almost three times greater than the next most popular term.

Attempt Two — Packed Bubbles

People love circles and this chart certainly looks “cool” but what does it tell other than that the “Crazy” circle is larger than the other circles?

Figure 3 -- Marist Poll results with packed bubbles

Figure 3 — Marist Poll results using packed bubbles

Why not use a simple bar chart?

Attempt Three — A Simple Bar Chart

Here are the same results rendered using the chart type Tableau’s “Show Me” suggests you use when working with this type of data.

Figure 4 -- Marist Poll results using a bar chart

Figure 4 — Marist Poll results using a bar chart

This is a big improvement over the word cloud and packed bubbles with respect to clarity — you can easily sort the responses and see how much larger “Crazy” is than the other responses.

But the chart is a bit sterile. What can we do to make the “Crazy” pop out without distorting the information?

Attempt Four — A Colored Bar Chart

The major takeaway from the poll is that 40% of the respondents characterized the election as “Crazy.” We can make that easier to glean by making that bar a bold color and all the other bars muted, as shown here.

Figure 5 -- Marist Poll results using a bar chart with one bar colored differently

Figure 5 — Marist Poll results using a bar chart with one bar colored differently

I’ll confess that this does the trick for me, but the client wanted to pursue some other options so we looked into a lollipop chart.

Attempt Five — Lollipop Chart

The lollipop chart is not native to Tableau;  it’s simply a dual axis chart that superimposes a circle chart on top of a bar chart that has very thin bars.

Figure 6 -- Marist Poll results as a lollipop chart

Figure 6 — Marist Poll results using a lollipop chart

This strikes me as an excellent compromise between the analytical integrity of the bar chart and the “ooh… circles” appeal of the packed bubbles.  I have no qualms about using this chart type.

But there’s still something if we want the chart to have some impact.

Final Attempt — Adding a Compelling Title

A concise, descriptive title can make a huge difference in garnering attention and making a chart more memorable. In the example below the client added some graphic design artistry to the typography to make the title compelling.

Figure 7 -- Marist Poll results as a lollipop chart with compelling headline.  I love this.

Figure 7 — Marist Poll results using a lollipop chart with compelling headline.  I love this.

Conclusion

My bass-playing friends will probably agree that “groove” is more important than “chops.”  That is, being able to play “in the pocket” with a rock-steady beat is more important than being able to play a great solo with a flurry of notes all over the neck.

But it sure is great to be able to do both.

The same goes for data visualization. Functionality needs to come first, then beauty.

But it sure is great to have both.

And in many cases, with a little extra effort, you can have both.

So go ahead, try putting some “crown molding” into your data visualizations and delight yourself and your stakeholders.

 

Mar 172016
 

Overview

I’m a big fan of Andy Kriebel’s and Andy Cotgreave’s Makeover Monday challenge. For those of you not familiar with this, each week Kriebel and Cotgreave find an existing visualization / data set and ask the data visualization community to come up with alternative ways to present the same data.

As Cotgreave points out in one of his blog posts “It’s about using a tool to debate data. It’s about improving people’s data literacy.”

With one major exception that I’ll discuss in a moment the challenge is meeting its goals as each week several dozen people participate and the submissions and accompanying discussions have been enormously valuable.

But there was one week where the community failed.

Worse than that, the community did some damage that will be difficult to repair.

Bad Data Make Bad Vizzes Make Bogus Conclusions

Week four of the Makeover Monday challenge used survey data from GOBankingRates that posed the question “how much money do you have saved in your savings account?” Here are some of the baseless conclusions from people that participated in the makeover:

Figure 1

Figure 1 — From the source article that spawned the makeover.  Yes, the exploding donut needs a makeover, but it’s the headline “Survey finds that two-thirds of Americans don’t have enough money saved” that presents the bigger problem.

  • Americans Don’t Have Enough Money Saved (See link).
  • 71% of Americans Have Less than $1,000 in Savings. Yikes! (See link).
  • Americans Just Aren’t Saving Money (See link).
  • Most Americans Have Miniscule Savings (See link).
  • 80% of Americans Have Less than $5,000 in Savings! (See link).
  • Americans Are Not Saving Enough Money! (See link).
  • Americans Have Too Little Savings (See link).

So, what’s the problem?

It turns out the key finding from the original publication is not true — and thanks to the makeovers that spurious finding has been amplified dozens of times.

How did this happen?

Let’s dig into the data a little bit.

Is There a Relationship Between Age and Savings?

As I mentioned before I think the Monday Makeover Challenge is great and I’ve participated in a couple of them. I started to work on this one and took a stab at showing the relationship between age and savings, as shown here.

Figure 2 -- Divergent stacked bar chart showing the percentage of people that have different savings amount, sorted by age

Figure 2 — Divergent stacked bar chart showing the percentage of people that have different savings amounts, sorted by age

This looked odd to me as I expected to see a correlation between age and savings; that is, I expected to see a lot more blue among Seniors and Baby Boomers.

I decided to make the demarcations less granular and just compare people with minimal savings and those with $1,000 or more in savings, as shown here.

Figure 3 — Less granular divergent stacked bar chart

This result seemed way off, so either my supposition is wildly incorrect (i.e., as people get older they save more) or there was something wrong with the data.

Note: I try to remind people that Tableau isn’t just for reporting interesting findings. It’s a remarkably useful tool for finding flaws in the data.

It turns out that while there is indeed something wrong with the data, there was a much bigger problem:

Most people didn’t bother to look at the actual question the survey asked.

What the Survey Asked

The survey asked “How much money do you have saved in your savings account?”  It did not ask “How much money do you have saved?

The difference is titanic as the average American savings account yields but .06 percent interest!  That’s infinitesimal — you might as well stick your money in a mattress!

Indeed, I am of the Baby Boomer generation and I have but $20 in my savings account — but (thankfully) more in my savings.

So, the vast majority of people that participated in the makeover didn’t bother to look at the actual question and came to — and published — a bogus conclusion.

Were there any other problems with the survey?

You betcha.

What’s Wrong with the Survey?

A visualization is only as good as its underlying data and the data in question has nothing to do with the savings habits of Americans; it only has to do with having a savings account.

But there are other shortcomings with the survey that should make us question whether the data is even useful for analyzing how much money Americans have sitting in a savings account.

Consider this excellent review of the same Makeover Monday challenge from Christophe Cariou.  He points out the following shortcomings with the survey itself:

  • In the article, we read: ‘The responses are representative of the U.S. internet population’. It is therefore not representative of the US population. See this report by Pew Research Center for age and online access.
  • We also read ‘Demographic information was not available for all respondents, and analysis of responses by demographics is based solely on responses for which the targeted demographic information was available.’ Normally, if it was demographically representative, this would be clarified. This comment adds a doubt.
  • The average savings amount in the article is the sum of the averages of the groups divided by 6. It is not weighted by the size of each group.

Note: Kudos to Bridget Cogley who also saw the problems with the conclusions when the makeovers first appeared in late January 2016.

Further note: In a subsequent makeover challenge blog post Cotgreave alerted participants to questionable data.

So, Where Exactly is the Harm?

So, dozens of people have created visualizations based on bad data and came up with bogus conclusions. Given the number of articles from allegedly reliable sources reporting shortcomings in savings, what’s the harm of sounding an alarm bell?

I suppose if you are an “ends justify the means” type of person then it’s fine to publish bogus findings as long as they change behavior in a positive way.

But I know many of the people in this community and they would be aghast at using data visualization this way.

I also fear that with collective missteps like this people will question the ability of makeover participants to relay accurate information.

So What Should We Do?

Andy Cotgreave and Andy Kriebel have earned their leadership positions in the data visualization community, so I hope they will make note of this makeover mishap and encourage people that published the bogus result to modify their headlines.

I also strongly encourage anyone working in data visualization to understand the data — warts and all — before rushing to publish. Andy Kriebel is providing the data set and we shouldn’t ask him to find all the flaws in it.  Indeed, that’s part of our job.

Finally, I ask others in the community to be more diligent: only publish work that has been carefully vetted and do not tolerate unsubstantiated work from others.

While it’s true that nothing terrible will happen if more Americans open savings accounts, there may be other situations where publishing spurious conclusions will do some serious damage.

Feb 112016
 

Overview

As I’m mentioned before the number one impediment to success with visualizing survey data in Tableau is getting the data in the right format so you can do all the dazzling things you want to do.

Up until very recently none of the survey tools vendors had done anything to make things easy so you had to pursue one of the following options:

  • Use the Tableau add-in for Microsoft Excel (only works on a PC and can be very cumbersome if you have data that needs to be updated.)
  • Use the pivot feature in Tableau 9.x (works great but is missing some important functionality, especially for larger surveys.)
  • Use Alteryx (Best-in-class but it has its own learning curve and is pricey.)
  • Get somebody to do this for you (like that’s going to happen.)

But now there is another choice and it warrants its own blog post.

Say hello to the Qualtrics web connector

Survey vendor Qualtrics has developed a web data connector so that you can connect directly to your Qualtrics surveys from inside Tableau. While the connector does not have everything I would want (more on this in a moment) this offering presents a great step forward for Qualtrics users that want to better see and understand their data using Tableau.

In this blog post we’ll look at the following:

  • Wide vs. tall data (and why we need it to be tall)
  • How to connect to Qualtrics survey data using Tableau
  • How you can get the data “just so” using data blending
  • Dealing with null values
  • Benefits and possible drawbacks

Wide vs. Tall Data

Most survey tools present the data in a one-column-for-each-question format that looks something like this.

Figure 1 -- One row per respondent, one column for each question

Figure 1 — One row per respondent, one column for each question

We will be absolutely miserable working with the data in this format so we need something that can reshape the data so that we have a separate row for each question, as shown below.

Figure 2 -- Each respondent is listed many times (one row for each question)

Figure 2 — Each respondent is listed many times (one row for each question)

Actually, we want to do more than just reshape the data.  Ideally we want to get something that looks like this.

Figure 3 -- Ideal format for survey data

Figure 3 — Ideal format for survey data

Qualtrics can get us most of the way there, and with some simple data blending it can get us all of the way there.

How to connect to Qualtrics survey data

Let’s assume you have a Qualtrics ID and at least one survey with some data.

  1. Log into Qualtrics.com and access the survey to which you will want to connect.
  2. Click the Responses tab and click the Export Data button, as shown here.
    04_InsideQ
  3. When the Export Responses dialog box appears, click Tableau, then copy the URL, as shown below.
    05_QDialog
  4. Close the dialog box and open Tableau.
  5. From within Tableau indicate you want to connect to a new data source and select Web Data Connector.
    06_WebConnector
  6. When the Web Data Connector dialog box appears paste the survey URL that you copied from Qualtrics.
  7. When asked, enter your Qualtrics username and password and click Log In.
    Qualtrics will then ask you which fields you want from your survey and which fields you will want to transpose
  8. Start by indicating that you want to select all fields and transpose all the fields, as shown below.
    07_Which fields
  9. Indicate which fields you won’t need and which you will not need transposed. In this example we will want Response ID, Q1: Gender, Q2: Region and Q3: Age to be separate columns, but all other selected fields need  to be transposed (pivoted).
    08_JustThese
  10. Click Import Survey Responses and wait a little bit as Tableau imports the data and create a Table Data Extract file.

So, just what do we have here?

After specifying that we want the vast majority of questions (fields) to be transposed, here’s what we’ll see from inside Tableau.

09_Fields In Tableau

Figure 4 — Dimensions and Measures generated by the Qualtrics Web Data Connector

It may not look like much but the fact that we get both text and numeric responses to the survey questions will be a huge time saver.

Here’s what we get if we place Question ID, Question Description, Textual Answer, and Numeric Answer on the Rows shelf.  Please note that Numeric Answer has been converted to discrete so we can easily see how the text responses line up with the numeric responses.

10_Whatdowehavehere

Figure 5 — “What do we have here?” viz that shows Question ID, human-readable translation of the Question ID, all possible text responses, and all possible numeric responses.

Here I can see how each Question ID maps to a human-readable translation of the Question ID (i.e., the Question Description) as well as all possible text responses and all possible numeric responses.  Yes, there are some null values here, but we’ll deal with them in a moment.

I wish I can express just how beautiful this is! This has almost everything I could possibly want and I barely had to lift a finger.

Okay, what did I mean by “almost”?

This particular survey does not contain many questions so I could work with what I have right now without too many problems.  But if I had a survey with hundreds of questions then I would benefit greatly from having some way to group related questions together.

Fortunately I can do this easily using Tableau’s blending feature.

How you can get the data “just so” using data blending

I need to create a separate Excel file that groups the Question IDs so that I can more easily find and manipulate the data in Tableau.

11_Helpfer File

Figure 6 — Example of a “Question Helper” file.

Note:  In some cases it isn’t just convenience that warrants creating this “helper” file.  NPS analysis and Importance vs. Satisfaction analysis will be very difficult without building a scaffolding file like the one shown above.

Let’s see how to create the blend.

  1. On a new worksheet drag Question ID onto rows.
  2. Click the Add New Data Source
  3. Connect to the Helper File indicate whether or not you want to create an Extract (of course you do!)
    Notice that there is a link field on Question ID in what will be the secondary data source.
    12_LinkField
  4. Drag Grouping the right of Question ID and then Qtype to the right of Grouping as shown below.
    13_TempBlend
  5. Right-click the Grouping pill and select Create Primary Group.
  6. Rename the field Question Grouping as shown below.
    14_QuesitonGrouping
  7. Right-click the Qtype pill and select Create Primary Group.
  8. Rename the field Qtype and click OK.
    At this point we no longer need the secondary data source as we’ve created some very useful groups in the primary data source.
  9. Remove the Grouping and Qtype pills from the Rows shelf.
  10. Close the secondary data source.
  11. Drag Question Grouping and QType to the left of Question ID as shown below.
    15_PerfectMapping

Putting this to use

Let’s see how to put all this to good use; specifically, let’s see how to visualize a check-all-that-apply question.

  1. In a new sheet drag Question Grouping to the Filters shelf and select What do you measure, as shown below.
    16_Filters
  2. Drag Question Description to rows.
  3. Create a new calculated field called CheckAll_Percent and define is as follows:
    17_CheckAll_Percent
  4. Right-click the newly created measure and indicate you want to change its default numeric format to Percentage with 0 decimal places.
  5. Drag CheckAll_Percent to the Columns shelf, turn on mark labels, and sort in descending order.

    Figure 8 -- Percentage of respondents that measure Pulse Rate, Adrenaline Production, and so on.

    Figure 8 — Percentage of respondents that measure Pulse Rate, Adrenaline Production, and so on.

Dealing with nulls

As we saw earlier we have some nulls and we need to make sure they don’t muck up the visualization.

As of this writing the Qualtrics Web Connector writes a 0 as the numeric answer and a blank as a textual answer for Nulls.  I would have expected either a “99” or an actual NULL value as we cannot simply filter the 0 values as sometimes a 0 means “no”, sometimes it may mean “strongly disagree,” and so on.

Fortunately, the Qualtrics Web Connector does write an unambiguous blank that we can use as our filter.

Let’s see how to remove the null values in the visualization above.

  1. Drag Textual Answer to the Filters shelf.
  2. When the Filter dialog box appears indicate that you want to see Fewer values, as shown below.
    19_Fewer Values
  3. Indicate that you want to Exclude blanks, as shown below.
    20_Exclude

Benefits and possible drawbacks

So, just what are the benefits and potential drawbacks of using the Qualtrics web connector?

Benefits

There are two clear benefits to using the Qualtrics web connector:

  1. You can get the data the way you need it with very little effort – there’s practically no learning curve.
  2. As you get more survey responses you just need to update the Extract. The workflow is seamless.

Possible drawbacks

  1. What happens when there are many questions? It’s unclear how well this will work when you have many questions, let alone many survey responses. Navigating the connection dialog to find the fields you want and don’t want to transpose may prove cumbersome.
  2. Weighted data. For those that need it, there needs to be some mechanism in the workflow to weigh survey responses. This is something that can be automated using a tool such as Alteryx.
  3. The Qualtrics web connector is not free and the pricing will vary based on a variety of factors. Given this, there may be cases where it makes sense to sacrifice Qulatrics’ ease and seamlessness for a more powerful and flexible ETL solution.

Conclusion

If you are Qualtrics user who wants to explore and visualize data using Tableau you should definitely take the Qualtrics web connector for a test drive as Qualtrics has made the task of getting survey data into Tableau easy and seamless.

If you’re using a different survey tool and want to visualize your data with Tableau, you may want to consider switching to Qualtrics.

 

 

 

 

 Posted by on February 11, 2016 2) Visualizing Survey Data, Blog Tagged with: ,  14 Responses »
Jan 112016
 

Overview

I spend a lot of time with survey data and much of this data revolves around gauging people’s sentiments and tendencies using either a Likert Scale or a Net Promoter Score (NPS) type of thing.

Examples

Here’s an example of gauging sentiment using a 5-point Likert scale.

Indicate how satisfied you are with the following:

00_Grid1

Here’s an example of measuring tendencies, using a 4-point Likert scale.

How often do you use the following learning modalities?

00_Grid2

So, what’s a good way to visualize responses to these types of questions?

Over the past ten years I’ve spent thousands of hours working on the best ways to show how opinion and tendencies skew one way or another.  I have found that in most cases a divergent stacked bar chart helps me (and more importantly, my clients) best see what’s going on with the survey responses.

In this blog posts we’ll

  • See an example of a divergent stacked bar chart (also called a staggered stacked bar chart)
  • Work through a data visualization improvement process
  • Show how to visualize different scales (e.g., NPS, Top 3/Bottom 3, 5-point Likert, etc.)
  • Show sentiment and tendencies over time
  • Present a dashboard that will allow you to experiment with different visualization approaches

Note: for step-by-step instructions on how to build a Likert-scale divergent stacked bar chart in Tableau, click here.

Divergent Stacked Bar vs. 100% Stacked Bar

Readers of my newsletter and folks visiting the web site may have seen my redesign of a New York Times infographic that showed the tendencies of politicians to lie or tell the truth.  Here’s the 100% Stacked Bar chart that appeared in the New York Times.

Figure 1 -- 100% stacked bar chart.

Figure 1 — 100% stacked bar chart.

Here’s the redesign using a divergent stacked bar chart.

Figure 2 -- Divergent stacked bar chart.

Figure 2 — Divergent stacked bar chart.

With both the 100% stacked bar chart and the divergent stacked bar charts the overall length of the bars is the same, but with the divergent approach the bars are shifted left or right to show which way a candidate leans. I, and others I’ve polled, find that shifting the bars makes the chart easier to understand.

How We Got Here — Likert Scale Improvement Process

Consider the table below that shows the results from a fictitious poll on the use of various learning modalities.

Figure 3 -- Table with survey results.

Figure 3 — Survey results in a table.

I can’t glean anything meaningful from this.

What about a bar chart?

Figure 4 -- Likert scale questions using a bar chart. Yikes.

Figure 4 — Likert scale questions using a bar chart. Yikes.

Wow, that’s really bad.

What about a 100% stacked bar chart?

Figure 5 -- 100% stacked bar chart using default colors.

Figure 5 — 100% stacked bar chart using default colors.

Okay, that’s better, but it’s still pretty bad as Tableau’s default colors do nothing to help us see tendencies that are adjacent. That is, “Often” and “Sometimes” should have similar colors, as should “Rarely” and “Never.”

So, let’s try using better colors…

(…and don’t even think about using red and green.)

Figure 6 -- 100% stacked bar chart using a more appropriate color scheme.

Figure 6 — 100% stacked bar chart using a more appropriate color scheme.

This is certainly an improvement, but the modalities are listed alphabetically and not by how often they’re used. Let’s see what happens when we sort the bars.

Figure 7 -- Sorted 100% stacked bar chart with good colors.

Figure 7 — Sorted 100% stacked bar chart with good colors.

It’s taken us several tries, but it’s now easier to see which modalities are more popular.

But we can do better.

Here’s the same data rendered as a divergent stacked bar chart.

Figure 8 -- Sorted divergent stacked bar chart with good colors.

Figure 8 — Sorted divergent stacked bar chart with good colors.

Of course, we can also look take a coarser view and just compare Sometimes/Often with Rarely/Never, as shown here.

Figure 9 – Divergent stacked bar chart with only two levels of sentiment.

Figure 9 – Divergent stacked bar chart with only two levels of sentiment.

I find that the divergent approach “speaks” to me and it resonates with my colleagues and clients.

Experiments using Different Scales

A while back Helen Lindsey was kind enough to send me some data that contained responses to some Net Promoter Score questions.  Specifically, folks were asked to rate companies/products on a 0 to 10 or 1 to 10 scale.

Figure 10 -- The classic Net Promoter Score (NPS) question

Figure 10 — The classic Net Promoter Score (NPS) question

We compute NPS by subtracting the percentage of folks that are promoters (i.e., people who responded with a 9 or a 10), subtracting the percentage of folks that are detractors (i.e., people who responded with a 0 through 6) and multiplying by 100.

But sometimes my clients have questions that are on a 10 or 11-point scale but instead want to compute the percentage of folks that responded with one of the top three boxes minus the percentage of folks that responded with the bottom three boxes.

I realized that the Lindsey data set could provide a type of “sandbox” where we could experiment with different sentiment scales including NPS, Top 3 minus Bottom 3, 5-point Likert, 3-point Likert, and 2-point Likert.

Let’s look at the results of some of these experiments.

NPS

Here are two ways we can visualize NPS data.  The first shows the percentages of people that fall into the three categories.

Figure 11 -- NPS showing percentages

Figure 11 — NPS showing percentages

Here’s the same view, but with the NPS score superimposed over the divergent stacked bars.

Figure 12 -- NPS with score superimposed

Figure 12 — NPS with score superimposed

NPS over Time

It turns out that divergent stacked bars are great at showing NPS trends over time.  Here’s a view using percentages.

Figure 13 -- Divergent stacked bar showing NPS over time with percentages

Figure 13 — Divergent stacked bar showing NPS over time with percentages

Here’s the same view but with the score superimposed.

Figure 14 -- Divergent stacked bar showing NPS over time with scores

Figure 14 — Divergent stacked bar showing NPS over time with scores

Note – for some other interesting treatments of showing sentiment over time, see Joe Mako’s visualization on banker honesty.

Net = Top 3 minus Bottom 3

Let’s take the same data but divide it into the following buckets:

  • Positive = Top 3 Boxes
  • Neutral = Middle 4 Boxes
  • Negative = Bottom 3 Boxes

Here are the associated visualizations.

Figure 15 -- Top 3 / Bottom 3 showing with percentages

Figure 15 — Top 3/Bottom 3 showing with percentages

Figure 16 -- Top 3 / Bottom 3 with scores

Figure 16 — Top 3/Bottom 3 with scores

Five, Three, and Two-Point Likert Scale Renderings

Let’s suppose that instead of asking a questions on a 1 through 10 scale we instead asked folks to select one of the following five responses:

  • Strongly disagree
  • Disagree
  • Neutral
  • Agree
  • Strongly agree

Here’s the same NPS data but rendered using a five-point Likert scale.

Figure 17 -- Divergent stacked bar chart showing all responses

Figure 17 — Divergent stacked bar chart showing all responses

And here’s the same data, but divided into positive, neutral, and negative sentiments (3-point Likert).

Figure 18 -- Divergent stacked bar showing positive, neutral, and negative

Figure 18 — Divergent stacked bar showing positive, neutral, and negative

Finally, here’s the same data, but only showing positive and negative sentiments (2-point Likert).

Figure 19 -- Divergent stacked bar showing just positive and negative

Figure 19 — Divergent stacked bar showing just positive and negative

Try it yourself

Below you will find a dashboard that allows you to explore different combinations of the 1 to 10 scale.

I strongly recommend you do NOT give your audience all these scaling options;  these are here for you to experiment and see how the visualizations and ranking change based on what scales you use.  The only option I would present to your audience is the ability to toggle back and forth between percentages and scores.

Dec 092015
 

Overview

In an October 2015 meeting of the Tableau Zen Masters each Zen Master was asked to name his / her favorite thing about Tableau.  Most people started by saying “well, besides the community, my favorite thing is…”

At the time I said “undo”.  Don’t laugh.  Let’s not take it for granted that with Tableau we can try things, fail, and go back to where we were before we failed… gracefully.

After the meeting I thought more about this question and how there are unheralded aspects of the tool and the ecosystem that I count on but that don’t get the attention they deserve.  I realized that there is one thing in particular that I use almost every day and without it I, and scores of others, would be at a major loss.

Tableau Public.

What is Tableau Public

Tableau Public consists of a free downloadable version of Tableau desktop to explore and visualize data, and a free cloud platform to host, share and embed interactive visualizations.

While I use my regular copy of Tableau desktop to explore data and fashion visualizations, it’s the free hosted version of Tableau Server (the cloud platform) that allows me to do so much and to learn so much.

Sharing

My website consists mostly of examples and blog posts and most of those posts contain Tableau dashboards that are embedded right inside the post.  I don’t have to ask people to download a reader and then download my dashboards.  The dashboards are right there.

Learning

But it gets better. Anyone with a copy of Tableau who is curious about how the viz works can just download the workbook, open it up, and see how it’s put together.

For example, a couple of years ago I wanted to see if anyone had recreated Hans Rosling’s famous Gap Minder demo in Tableau.  A quick Google search lead me to this example from Jeffrey Shaffer.

Figure 1 -- Jeffrey Shaffer recreates Rosling's Gap Minder.  See http://public.tableau.com/profile/jeffs8297#!/vizhome/shared/FQWYZ95DJ

Figure 1 — Jeffrey Shaffer recreates Rosling’s Gap Minder.  Click here.

I was curious to see how Jeff had gotten the year to display as big block letters in the middle of the viz, so I downloaded the workbook and “looked under the hood.”

Over the years I’ve downloaded hundreds of workbooks and have analyzed the individual brush strokes of Kelly Martin, Andy Cotgreave, Ben Jones, Anya A’Hearn, and dozens of others.  That I can do this so easily  is nothing short of amazing.  I don’t just get to interact with cool vizzes; I can download them and see how they work.

Adding to the conversation

Because the vast majority of Tableau Public authors do make their work downloadable, you’ll see people modify and repost the work they’ve downloaded. Andy Cotgreave wrote about this in late 2014 where one way of visualizing something begat other ways of visualizing the same data.  Here’s a particularly beautiful example from Michal Mixon.

Figure 2 -- A stunning dashboard from Michael Mixon that you can download from here.

Figure 2 — A stunning dashboard from Michael Mixon that you can download from here.

It Isn’t Perfect

Before you accuse me of writing a hagiography, there are several shortcomings with Tableau Public.

Row limit

You can only have 100,000 rowsNo, now you can only have 1,000,000 rows.  Okay, as of May of 2015 you can have 10 million rows of data.  Not really a shortcoming.

Locking down your data

Your workbooks and the underlying data can be downloaded and examined.  As of May of 2015 you can “lock down” your workbooks and your data.  This is huge as it allows public-serving organizations with proprietary data to publish interactive workbooks without fear that somebody will download and examine the underlying data.

That said…

Unless you indeed have proprietary data please, please, please don’t stop your workbooks from being downloaded.

Figure 3 -- This setting is on by default.  Please don’t turn it off unless you have a good reason (e.g., the underlying data is proprietary).

Figure 3 — This setting is on by default.  Please don’t turn it off unless you have a good reason (e.g., the underlying data is proprietary).

Tableau and the community that supports it are giving you this amazing free platform for you to showcase your work.  Please allow others to benefit by making your work downloadable.

Just because something is anointed a Tableau Public “Viz of the Day” does not mean it is a good viz

Many Viz of the Day selections are examples you should emulate, but I’ve seen some really bad vizzes that made the cut because the subject matter was “discussion-worthy”.  The problem is that those who are new to data visualization won’t know that these vizzes aren’t worth emulating.  They’ll just think “hmm, that viz with the word cloud and donut charts was a ‘viz of the day.’  I guess it’s a good thing to make vizzes with word clouds and donut charts.”

As Tableau comes out with new versions they tinker and break things

I understand that Tableau Public is an evolving platform and that Tableau wants to improve that platform by adding new features. The problem is that I’ve had many dashboards that suddenly stops working. Tableau has been great at responding to notices that things are broken, but sometimes it can take hours, if not days, to fix. So…

Favor: Please add my website to your test suite. In other works, do not implement any new features until you’re sure everything on my site works perfectly.

Okay, okay, okay… but it doesn’t hurt to ask.

Things I cannot do without

Here’s a summary of the things I cannot do without in my practice.

I would not be able to do what I do without Tableau Desktop.

I would not be able to do what I do without the community that supports Tableau.

And I would not be able to do what I do without Tableau Public.

My sincere thanks to the Tableau Public team.

 Posted by on December 9, 2015 1) General Discussions, Blog Tagged with:  4 Responses »
Nov 222015
 

Overview

I received an e-mail inquiry about weighted data recently and realized that while I cover this in my survey data class I had not yet posted anything about it here.  Time to remedy that.

The good news is that it is not at all difficult to work with weighted survey data in Tableau.  And just what do I mean by weighted data? We use weighting to adjust the results of a study so that the results better reflect what is known about the population. For example, if the subscribers to your magazine are 60% female but the people that take your survey are only 45% female you should weigh the responses from females more heavily than males.

To do this each survey respondent should have a weighting amount associated with their respondent ID, as shown here.

Figure 1 – A snippet of survey data showing a separate column for Weight.

Figure 1 – A snippet of survey data showing a separate column for Weight.

When pivoting / reshaping the data make sure that [Weight] does not get reshaped.  It should remain in its own column like the other demographic data.

Once this is in place we’ll need to modify the formulas for the following questions types:

  • Yes / No / Maybe (single punch)
  • Check-all-that-apply (multi-punch)
  • Sentiment / Likert Scale (simple stacked bar)
  • Sentiment / Likert Scale (divergent stacked bar)

Yes / No / Maybe (single punch)

With this type of question you usually want to determine the percentage of the total.

02_YesNoMaybe

Figure 2 — Visualization of a single-punch question

Unweighted calculation

The table calculation to determine the percentage of total using unweighted data is

   SUM([Number of Records]) / TOTAL(SUM([Number of Records]))

Weighted calculation

The table calculation to determine the percentage of total using weighted data is

   SUM([Weight]) / TOTAL(SUM([Weight]))

Check-all-that-apply (multi punch)

With this type of question you usually want to determine the percentage of people that selected an item.  The total will almost always add up to more than 100% as you are allowing people to select multiple items.

Figure 3 -- Visualization of a multi-punch question

Figure 3 — Visualization of a multi-punch question

Most surveys will code the items that are checked with a “1” and those that are not checked with a “0”.

Unweighted calculation

The calculation to determine the percentage of people selecting an item using unweighted data is

   SUM([Value]) / SUM([Number of Records])

where [Value] is the name of the measure that contains the survey responses.  If the survey responses are coded as labels instead of numbers you can use this formula instead.

   SUM(IF [Label]="Yes" then 1 ELSE 0 END) / SUM([Number of Records])

Weighted calculation

The calculation to determine the percentage of people selecting an item using weighted data is

   SUM(IF [Value]=1 then [Weight] ELSE 0 END) / SUM([Weight])

Sentiment / Likert Scale (simple stacked bar)

This is very similar to the single-punch question but instead we have several questions and compare them using a stacked bar chart.  I am not a big fan of this approach but it can be useful when you superimpose some type of score (e.g., average Likert value, percent top 2 boxes, etc.).

Figure 4 -- Simple Likert Scale visualization

Figure 4 — Simple Likert Scale visualization

Figure 5 -- Simple Likert Scale visualization with Percent Top 2 Boxes combo chart

Figure 5 — Simple Likert Scale visualization with Percent Top 2 Boxes combo chart

Unweighted calculation – Stacked Bar

The table calculation to determine the percentage of total using unweighted data is

   SUM([Number of Records]) / TOTAL(SUM([Number of Records]))

Weighted calculation – Stacked Bar

The table calculation to determine the percentage of total using weighted data is

   SUM([Weight]) / TOTAL(SUM([Weight]))

Unweighted calculation – Percent Top 2 Boxes

Assuming a 1 through 5 Likert scale, the calculation to determine the percentage of people selecting either Very high degree or High Degree (top 2 boxes) using unweighted data is

   SUM(IF [Value]>=4 then 1 ELSE 0) / SUM([Number of Records])

Weighted calculation – Percent Top 2 Boxes

Assuming a 1 through 5 Likert scale, The calculation to determine the percentage of people selecting either Very high degree or High Degree (top 2 boxes) using weighted data is

   SUM(IF [Value]>=4 then [Weight] ELSE 0) / SUM([Weight])

Sentiment / Likert Scale (divergent stacked bar)

Here is what I believe is a preferable way to show how sentiment skews across different questions.

Figure 6 -- A divergent stacked bar chart

Figure 6 — A divergent stacked bar chart

I’ve covered how to build this type of chart using unweighted values here.

There are six fields we need to fashion the visualization, three of which need to be modified to make the visualization work with weighted data.

  • Count Negative
  • Gantt Percent
  • Gantt Start
  • Percentage
  • Total Count
  • Total Count Negative

Count Negative – Unweighted

Assuming a 1 – 5 Likert scale, the calculation to determine the number of negative sentiment responses using unweighted data is

   IF [Score]<3 THEN 1
   ELSEIF [Score]=3 THEN .5
   ELSE 0 END

Count Negative – Weighted

Assuming a 1 – 5 Likert scale, the calculation to determine the number of negative sentiment responses using weighted data is

   IF [Score]<3 THEN [Weight]
   ELSEIF [Score]=3 THEN .5 * [Weight]
   ELSE 0 END

Percentage – Unweighted

The calculation that determines both the size of the Gantt bar and the label for the bar using unweighted data is

   SUM([Number of Records])/[Total Count]

Percentage – Weighted

The calculation that determines both the size of the Gantt bar and the label for the bar using weighted data is

   SUM([Weight])/[Total Count]

Total Count – Unweighted

The calculation that determines the total number of responses for a particular question for unweighted data is

   TOTAL(SUM([Number of Records]))

Total Count – Weighted

The calculation that determines the total number of responses for a particular question for weighted data is

   TOTAL(SUM([Weight]))

Summary

Here’s a summary of all the unweighted calculations and their weighted equivalents

Unweighted Weighted
SUM([Number of Records]) / TOTAL(SUM([Number of Records])) SUM([Weight]) / TOTAL(SUM([Weight]))
SUM([Value]) / SUM([Number of Records]) SUM(IF [Value]=1 then [Weight] ELSE 0 END) / SUM([Weight])
SUM(IF [Value]>=4 then 1 ELSE 0) / SUM([Number of Records]) SUM(IF [Value]>=4 then [Weight] ELSE 0) / SUM([Weight])
IF [Score]<3 THEN 1 ELSEIF [Score]=3 THEN .5 ELSE 0 END IF [Score]<3 THEN [Weight] ELSEIF [Score]=3 THEN .5 * [Weight] ELSE 0 END
SUM([Number of Records])/[Total Count] SUM([Weight])/[Total Count]
TOTAL(SUM([Number of Records])) TOTAL(SUM([Weight]))

 

Nov 102015
 

Overview

Several weeks ago the data visualization community broke into justified outrage over an inexcusably misleading dual-axis chart from Americans United for Life.  I plan to write an article about this and other “ethically wrong” visualizations in a few weeks but in the meantime I encourage you to read these excellent posts from Alberto Cairo and Emily Schuch, as well as this discussion from Politifact.

Around the same time these posts appeared I came across a “Viz of the Day” dashboard from Emily Le Coz that accompanied a lengthy article in the Daytona Beach News-Journal.  The dashboard contained several visualizations but the one that caught my eye was this dual axis chart.

Figure 1 -- Infographic showing that as the number of firefighters has increased over the past 30 years, the number of fire-related deaths has decreased.

Figure 1 — Infographic showing that as the number of firefighters has increased over the past 30 years, the number of fire-related deaths has decreased.

I engaged in an interesting Twitter discussion about this graphic with Alberto Cairo, Jorge Camoes, and Noah Illinsky. I’ll get into that discussion in a bit (and point out some troubling problems with the visualization) but first want to discuss the use case for dual axis charts.

Why use dual axis charts

There are several reasons to use a dual axis chart (e.g., a Pareto chart that shows individual values along with the cumulative percent) but the primary use case is when you want to compare two completely different measures and see if there is any noteworthy relationship between the two measures.  Consider the example below that shows cyclical sales data for a retail store (bars) and the number of orders placed each month (line).

Figure 2 -- Dual axis chart comparing sales and orders by month.

Figure 2 — Dual axis chart comparing sales and orders by month.

The surprising result is that while November is historically the strongest month for sales ($5M from 2010 to 2013) the total number of orders placed in November is the lowest of any month. And yes, I checked to make sure that this was true of all years and not one crazy blowout year.

I think this dual axis combination chart (where we show bars and a line) makes it easy to see there is something very interesting about November. The low number of orders combined with the high sales – something that is easy to see – means that we either sold more items per order or more expensive items per order.

So, what’s wrong with the firefighter example?

Given that dual axis charts can be so useful I wondered why I had problems with the Firefighter example.  Fortunately, the author made the dashboard downloadable from Tableau public so I was able to see how it was put together.

Cutesy icons set the wrong tone for the piece

My first problem was with the firefighter hat and skull-and-crossbones icons.

Figure 3 -- Icons representing firefighters and civilian deaths.

Figure 3 — Icons representing firefighters and civilian deaths.

In my opinion (and it is just an opinion) I thought this “cartoonified” the visualization. I would much prefer to see either a simple color legend or a label next to both lines.

The author exaggerates the changes over time

A much more troubling issue is that the author uses a fixed Y-axis that exaggerates the changes over time.  The author also fails to show the axis labels so we can’t see that the axis doesn’t start at zero.

Consider the dashboard below that shows the original visualization on the left with an accurate visualization on the right.

Figure 4 -- Comparison of fixed axis vs. automatic axis charts.  Note that the axis uses a SUM() function while the label is using AVERAGE(). The data is repeated three times in the data source which is why the author needs to use AVERAGE(). Yes, the axis should use AVERAGE() as well but the relative positioning of the elements is the same with SUM() so this causes no harm.

Figure 4 — Comparison of fixed axis vs. automatic axis charts.  Note that the axis uses a SUM() function while the label is using AVERAGE(). The data is repeated three times in the data source which is why the author needs to use AVERAGE(). Yes, the axis should use AVERAGE() as well but the relative positioning of the elements is the same with SUM() so this causes no harm.

Because the author fixed the Y-axis rather than starting from zero, the slope of the lines is exaggerated. While this does not alter what is in fact a noteworthy observation, whenever I see this type of “rigging” it makes me question the validity of any and all parts of the story.  That is, even though I don’t think the exaggeration was an intentional attempt to dramatize the difference, seeing this in play will make me question everything that the author and the publication now publishes.

Am I being too hard on the author? I don’t think so as anything that’s published as a “viz of the day” and accompanies a high-profile news article should get a lot more scrutiny than just any old Tableau Public visualization.  While I don’t feel mislead by the overstated changes, I do wonder at what point does a viz cross the line into TURD territory (Truly Unfortunate Representation of Data)? We’ll save that discussion for a later post.

Different approaches

Combination area and line chart

After adjusting the axis I still wondered if having two line charts was causing unnecessary confusion. In my first makeover attempt I tried combining an area graph with a line chart, as shown here.

Figure 5 -- First makeover attempt.  A dual axis chart using an area chart for firefighters and a line chart for civilian deaths.

Figure 5 — First makeover attempt.  A dual axis chart using an area chart for firefighters and a line chart for civilian deaths.

While using two different chart types made it easier to see that I was comparing two different measures, I didn’t love the chart and sought alternatives.

Connected Scatterplots

On Twitter Jorge Camoes offered this connected scatterplot.

Figure 6 -- Jorge Camoes’ connected scatterplot.  Notice that the axes do not start at zero but that the axes labels are at least visible.

Figure 6 — Jorge Camoes’ connected scatterplot.  Notice that the axes do not start at zero but that the axes labels are at least visible.

In a connected scatterplot the path the line takes represents the year.  This is why the line folds back on itself from time to time (more on this in a moment).  Camoes also “normalized” the data using an index so that both civilian deaths and number of firefighters start at a value of 100.

I like this visualization very much but fear that many people won’t understand the index value of 100 so I tried my own connected scatterplot, shown below.

Figure 7 -- Connected scatterplot with regular vs. normalized values.  Notice that the X-axis does not start at zero but that the axes labels are visible.

Figure 7 — Connected scatterplot with regular vs. normalized values.  Notice that the X-axis does not start at zero but that the axes labels are visible.

Before anyone cries foul about the X-axis, here’s a version with the axis starting at zero.

Figure 8 -- Connected scatterplot with both axes starting at zero.  This may be why Camoes normalized the data although his chart doesn’t start at zero, either.

Figure 8 — Connected scatterplot with both axes starting at zero.  This may be why Camoes normalized the data although his chart doesn’t start at zero, either.

I think starting the x-axis at zero obscures the relationship but that’s not what makes me question using this approach.  My problem is that many people will have a hard time understanding how the line “works”, as it were.  This is because whenever we see a line chart that involves time we come to expect marks on the left of the chart to show older dates and marks on the right to show newer dates.  In other words, we expect the chart to behave like this.

Figure 9 – Since grade school we’ve been indoctrinated to expect earlier dates to the left and later dates to the right.

Figure 9 – Since grade school we’ve been indoctrinated to expect earlier dates to the left and later dates to the right.

With a connected scatterplot the X-axis is “owned” by an independent measure so we have to adjust our perception to see that sometimes a later year will appear to the left of an earlier year, as shown below.

Figure 10 -- Connected scatterplot with marks showing all years.

Figure 10 — Connected scatterplot with marks showing all years.

Notice how 1986 appears to the left of 1985 and 1989 appears to the left of 1988.  Unless you are used to this type of approach this can look very strange.

Keep it simple

After experimenting a bit more I decided to forgo the dual axis and connected scatterplots and fashioned this simpler narrative.

Figure 11 -- Two separate charts yielding a simple and easy-to-follow narrative.

Figure 11 — Two separate charts yielding a simple and easy-to-follow narrative.

If you have what you think is a better approach I would love to see it.  If you’re using Tableau you can download the packaged workbook with the original dashboard and various makeover attempts here.

Oct 132015
 

Overview

You can download a trial version of Alteryx here.

You can find the Excel source data here.

You can download the completed Alteryx module from here.

Understanding the workflow

Here’s the Alteryx workflow that will get the data in the format we want.

Figure 9 -- Alteryx workflow for getting the survey data in an optimal format for analysis in Tableau

Figure 9 — Alteryx workflow for getting the survey data in an optimal format for analysis in Tableau

If this is the first time you’ve seen an Alteryx workflow it may look a little complicated (shades of Rube Goldberg) but think about how complicated the flow and decision tree is for what you do when you get up in the morning and get ready for work – it’s way more complicated.

Let’s start by focusing on the three data sources that we need (labeled 1, 2, and 3 in the diagram above).

  1. Data with numeric responses (we’ll use this for our demographic data, too)
  2. Data with text responses
  3. The Question Helper file (contains our meta data).

A note about SPSS files

For this example we’re using Excel data but the process would be identical for survey data saved in an SPSS format (.SAV file).  SPSS encodes both the text and numeric data in a single file, but you need to input values from that same data source twice – once for text and once for numeric values.

When you specify an SPSS file as a data source you’ll see the following check box option.

Figure 10 -- Options that appear when you bring in data from an SPSS file.

Figure 10 — Options that appear when you bring in data from an SPSS file.

Selecting “Expand Factors” will bring in the data as labels.  Unchecking this box will bring in the data as numbers.

Demographics

Figure 11 -- Selecting the demographic components from the data source

Figure 11 — Selecting the demographic components from the data source

Clicking the Select tool reveals the following settings.

Figure 12 -- Specifying which fields we want to keep

Figure 12 — Specifying which fields we want to keep

A translation into English would be “open this Excel file and select these fields”.  Here we keep RespID, Gender, Location, Generation, and Weight.

Numeric values

Figure 13 -- Numeric values section of the Alteryx workflow with the Reshape tool selected

Figure 13 — Numeric values section of the Alteryx workflow with the Transpose tool selected

Using the same data source, selecting the Transpose tool reveals the following settings.

Figure 14 – Transpose tool settings

Figure 14 – Transpose tool settings

Here we indicate that we want to keep RespID and pivot all the other checked fields.  Notice that the demographic components are not checked.

Looking back to the workflow diagram, we next follow this pivot with a Select tool which we use to rename the pivoted fields as shown here.

Figure 15 -- The Select tool allows us to name the pivoted fields

Figure 15 — The Select tool allows us to name the pivoted fields

Here we are naming the first pivoted field Question ID and the second pivoted field Numeric value.

Dealing with nulls

Figure 16 -- The Filter tool allows us to eliminate null values before they hit Tableau.

Figure 16 — The Filter tool allows us to eliminate null values before they hit Tableau.

While not critical, Alteryx allows us to eliminate null values before they come into Tableau.  Here’s the setting for this tool.

Figure 17 -- Setting to eliminate nulls from the Numeric value field

Figure 17 — Setting to eliminate nulls from the Numeric value field

Joining the demographic with the transposed (pivoted) numeric data

Figure 18 -- Joining the demographic and transposed numeric data

Figure 18 — Joining the demographic and transposed numeric data

We’re now ready to join the demographic data with the pivoted numeric survey data. We do this with Alteryx’s Join tool.  Here are the settings.

Figure 19 -- Joining demographics and numeric data using the common field, RespID. Notice that we don't include the redundant version of RespID that is coming from the right table.

Figure 19 — Joining demographics and numeric data using the common field, RespID. Notice that we don’t include the redundant version of RespID that is coming from the right table.

Checking our progress

If we add a Browse tool after the Join we’ll see the following results.

Figure 20 -- Results of the initial Join

Figure 20 — Results of the initial Join

Note that we could, in fact, get by with using just this in Tableau, but the whole point is to make things faster and easier, so we’re going to add text responses and the “Helper” data.

Text Values

Figure 21 -- Workflow for setting the text values

Figure 21 — Workflow for setting the text values

This is identical to the numeric values except that we specify a different data source (one with label responses instead of numeric) and we rename the second pivoted field differently, as shown here.

Figure 22 -- The Select tool allows us to name the pivoted fields

Figure 22 — The Select tool allows us to name the pivoted fields

Here we are naming the first pivoted field Question ID and the second Text value.

Tying Demographic, Numeric and Text Values Together

We need to combine the demographic / numeric values with the text values.  We will again use Alteryx’s Join tool, but we’ll need to join on both RespID and Question ID.

Figure 23 -- Joining demographic, numeric, and text data together

Figure 23 — Joining demographic, numeric, and text data together

The setting for the Join is shown below.

Figure 24 -- Join settings for combining demographics / numeric with text data

Figure 24 — Join settings for combining demographics / numeric with text data

Adding the question meta data and outputting the results

We’re almost done; we just need to add the information that tells us about each Question ID so we can map the ID to its human-readable form, group related IDs together, and so on.

Figure 25 -- Joining the "Helper" file data

Figure 25 — Joining the “Helper” file data

The setting for this Join is shown below.

Figure 26 -- Marrying the question meta data to the demographic and survey data

Figure 26 — Marrying the question meta data to the demographic and survey data

The only thing left to do is output this to a new data source.  In the sample workflow I output to an Excel file but you can just as easily write directly to a Tableau Data Extract file.

But Wait!  There’s More!

I intentionally used data that was well behaved, but what happens if the data is “messy”?  For example, suppose there’s an extra row below the field names?  Or suppose that for a check-all-that-apply question, instead of 1s and 0s we instead have 1s and blanks (which would mess things up in Tableau)?  Alteryx handles both of these scenarios (and dozens more) easily.

Is there a Downside?

If there’s any downside it would be the cost as Alteryx starts at around $4,000 per year.  If all you are doing is some occasional survey data reshaping then this fee would probably be prohibitively expensive. If, however, you need to corral a lot of data – or if you need advanced data blending, geospatial analysis, or predictive modeling – then Alteryx is a huge bargain.

 Posted by on October 13, 2015 2) Visualizing Survey Data, Blog Tagged with: , , , , ,  8 Responses »