Apr 052017
 

More thoughts on the Markimekko chart and in particular how to build one in Tableau.

April 4, 2017

Overview

Given my reluctance to embrace odd chart types and my conviction that I would find something better I was surprised to find myself last month writing about — and endorsing — the Marimekko chart.

If I was surprised then I’m absolutely gobsmacked to be writing about it again.

What precipitated all this was another very good example of the chart in the wild. After admiring it I couldn’t help but “look under the hood” (hey, we are talking about Tableau Public and people sharing this stuff freely) and I thought that the dashboard designer was working harder than he needed to build the visualization.

So, if people are going to use these things I thought I would share an alternative, and I think easier, technique for building them.

The Great Example from Neil Richards

Here’s the terrific Makeover Monday dashboard from Neil Richards where we see the likelihood of certain jobs being replaced by automation.

01_Neil

Neil does a great job highlighting some of the more interesting findings, but if you want to know more than what Neil highlights you’ll need to explore the dashboard on your own.

Notice that in both this case and in Emma Whyte’s we are dealing with only two data segments; e.g., male vs. female and at-risk vs. not at-risk jobs. Having only two colors is one of the main reasons why the chart works well.

Okay! Uncle! I agree that under the right conditions this is a useful chart and I can see what you may want to make one.

But is there an easier way to make one?

An Easier Way to Create a Markimekko Chart in Tableau

It turns out the same technique Joe Mako showed me six years ago for building a divergent stacked bar chart works great for fashioning a Markimekko.  Let’s see how to do this using Superstore data with fields similar to what was available in both Emma and Neil’s dashboards.

Let’s say I want to compare the magnitude of sales with the profitability of items by region.  Figure 2 shows the overall magnitude of sales but makes comparing profitability difficult.

Figure 2 -- Overall sales is easy to see but comparing profitability across regions is difficult.

Figure 2 — Overall sales is easy to see but comparing profitability across regions is difficult.

Here’s another attempt using a 100% stacked bar chart.

Figure 3 -- Showing profitability with a 100% stacked bar chart.

Figure 3 — Showing profitability with a 100% stacked bar chart.

Yes, this does a much better job allowing us to compare the profitability of each region, but there’s no way to easily glean that Sales in the West is almost double sales in the South (which is easy to do in Figure 2.)

So, how can we make the regions that have large sales be wide and the regions that have small sales be  narrow?

Understanding the Fields

Before going much further let’s make sure we understand the following three fields:

  • Percentage Profitable Sales
  • Percentage Unprofitable Sales
  • Sales Percentage of
[Percentage Profitable Sales]

This is defined as

SUM(IF [Profit]>=0 THEN [Sales] END)/SUM(Sales)

… and translates as “if the profit for an item within a partition is profitable, add it up, then divide by the total sales within the partition.”

This is the field that gives us the 90%, 77%, 76%, and 72% results shown in Figure 3.

[Percentage Unprofitable Sales]

This is defined as

1 - [Percentage of Profitable Sales]

… and gives us the 10%, 23%, 24%, ad 28% shown in Figure 3.

[Sales Percentage of]

This is defined as

SUM([Sales]) /TOTAL(SUM([Sales]))

… and we will use it to compute the percentage of sales across the four regions (i.e., show me the sales for one region divided by the sales for all the regions). Here’s how we might use it in a visualization.

Figure 4 -- Using the calculation to figure out how wide each region should be.

Figure 4 — Using the calculation to figure out how wide each region should be.

So, in Figure 4 we can see that the West segment is a lot thicker than the South segment.

How can we apply this additional depth to what we had in Figure 3?

Make it Easy to See if the Math is Correct

At this point it will be helpful to see the interplay of the various measures and dimensions using a cross tab like the one shown in Figure 5.

Figure 5 -- Cross tab showing the relationship among the different measures and dimensions.

Figure 5 — Cross tab showing the relationship among the different measures and dimensions.

The first four columns are easy to interpret:

“I see that sales in the West is $725,458 of which 10% is unprofitable and 90% is profitable.  That $725,458 represents 31.6% of the total sales.”

But how is the field called [Start at] defined and how are we going to use it?

Understanding [Start at]

[Start at] is defined as

PREVIOUS_VALUE(0)+ZN(LOOKUP([Sales Percentage of],-1))

This is the calculation that figures out where the mark should start while [Sales Percentage of] will later determine how thick the mark should be.  Let’s see how this all works together.

Figure 6 -- How [Start at] and [Sales Percentage of] will work together.  Note that “Compute Using” for the two table calculations is set to [Region].

Figure 6 — How [Start at] and [Sales Percentage of] will work together.  Note that “Compute Using” for the two table calculations is set to [Region].

For the West region we want to start at 0% and have a bar that is 31.6% units side. The function

PREVIOUS_VALUE(0)

Tells Tableau to look at whatever is the value for [Sales at] for the row above and if there is no row above make the value 0 (see Item 1 in Figure 6, above.)

Add to this the value for [Sales Percentage of] in the previous row (Item 2 which is also not present) and you get 0 + 0 (Item 3).

For the East region we want to start wherever West left off (Item 3 plus Item 4, which gives us item 5) and make the mark 29.5% wide (item 6).

For the Central region we want to start wherever the previous region left off (Item 5 plus item 6, which gives us item 7) and make the mark 21.8% wide (Item 8).

Let’s see how this all fits together into the Marimekko visualization in Figure 7.

Figure 7 -- Using [Start at ] and [Sales Percentage of] to make the Marimekko work.

Figure 7 — Using [Start at ] and [Sales Percentage of] to make the Marimekko work.

There are three things to keep in mind.

  1. [Start at] is on columns and determines the starting point (how far to the right) for each of the regions.
  2. [Sales Percentage of] is on Size and determines how thick the bars should be.
  3. Size is set to Fixed width, left aligned, where Fixed means the measure on the Size shelf is determining the thickness.
Figure 8 -- Size must be fixed and left-aligned.

Figure 8 — Size must be fixed and left-aligned.

Some Interesting Findings

I built a parameter-driven version of the Marimekko (embedded at the end of this blog post) that allows the viewer to select different dimensions and different ways to sort. Here’s what happens when we look at Sub-Category sorted by Profitability.

Figure 9 -- Profitability by Sub-Category.

Figure 9 — Profitability by Sub-Category.

Okay, not a big surprise here given how many visualizations we’ve all seen showing that Tables are problematic.

That said, I was in for a surprise when I broke this down by state and sorted by the magnitude of sales, as shown below.

Figure 10 -- Profitability by state, sorted by Sales.

Figure 10 — Profitability by state, sorted by Sales.

Wow, after 11 years of living with this data set I never realized that 60% of the items sold in Texas were unprofitable.  Who knew?

To be honest I’m not convinced we need a Marimekko to see this clearly.  A simple sorted bar chart will do the trick, as shown in Figure 11.

Figure 11 -- Sorted bar chart.

Figure 11 — Sorted bar chart.

Indeed, I think this very simple view is better than the Marimekko in many respects.

I guess it depends what you’re trying to get across.

See for Yourself

I’ve included an embedded workbook that has the Superstore example as well as versions of the visualizations Emma Whyte and Neil Richards built, but using this alternative technique.

I encourage you to think long and hard before deploying a Marimekko.  But if you do decide to build one I hope the techniques I explored here will prove useful.

 

Jan 092017
 

By Steve Wexler and Jeffrey Shaffer

January 9, 2017

Please also see follow-up post.

Overview

Makeover Monday, started by Andy Kriebel in 2009 and turned into a weekly social data project by Kriebel and Andy Cotgreave in 2016, is now one of the biggest community endeavors in data visualization. By the end of 2016 there were over 3,000 submissions and 2017 began with record-breaking numbers, with over 100 makeovers in the first week. We are big fans of this project and it’s because of the project’s tremendous success and our love and respect for the two Andys (and now Eva Murray) that we feel compelled to write this post.

Unfortunately, 2017 started off with a truly grand fiasco as over 100 people published findings that cannot be substantiated. In just a few days the MM community has done a lot damage (and if it doesn’t act quickly it will do even more damage.)

What happened

Woah!  That’s quite an indictment. What happened, exactly?

Here’s the article that inspired the Makeover Monday assignment.

So, what’s the problem?

The claims in the article are wrong.  Really, really wrong.

And now, thanks to over 100 well-meaning people, instead of one website that got it really, really wrong there are over 100 tweets, blog posts, and web pages that got it really, really wrong.

It appears that Makeover Monday participants assumed the following about the data and the headline:

  • The data is cited by Makeover Monday so it must be good data.
  • The data comes from the Australian Government so it must be good data that is appropriate for the analysis in question.
  • The headline comes from what appears to be a reputable source, so it must be true.

Some Caveats

Before continuing we want to acknowledge that there is a wage gap in Australia; it just isn’t nearly as pronounced as this article and the makeovers suggest.

The data also looks highly reputable; it’s just not appropriate data for making a useful comparison on wages.

Also, we did not look at all 100+ makeovers. But of the 40 that we did review all of them parroted the findings of the source article.

Some makeover examples

Here are some examples from the 100+ people that created dashboards.

Figure 2 -- A beautiful viz that almost certainly makes bogus claims. Source: https://public.tableau.com/profile/publish/Australias50highestpayingjobsarepayingmensignificantlymore

Figure 2 — A beautiful viz that almost certainly makes bogus claims. Source: https://public.tableau.com/profile/publish/Australias50highestpayingjobsarepayingmensignificantlymore

example2

Figure 3– Another beautiful viz that almost certainly makes bogus claims.  Source: https://public.tableau.com/profile/publish/MM12017/Dashboard1#!/publish-confirm

example3

Figure 4 — A third beautiful viz that almost certainly makes bogus claims.  Source: https://public.tableau.com/profile/publish/AustraliaPayGap_0/Dashboard1#!/publish-confirm

example4

Figure 5 — Yet another beautiful viz that almost certainly makes bogus claims.  Source: https://public.tableau.com/views/GenderDisparityinAustralia/GenderInequality?:embed=y&:display_count=yes&:showVizHome=no#1

Goodness! These dashboards (and the dozens of others that we’ve reviewed) are highlighting a horrible injustice!

[we’re being sarcastic]

Let’s hold off before joining a protest march.

Why these makeovers are wrong

Step back and think for a minute. Over 100 people created a visualization on the gender wage gap and of the dashboards we reviewed, they all visualized, in some form, the difference between male Ophthalmologists earning $552,947 and females that only earned $217,242 (this is the largest gap in the data set.)

Did any of these people ask “Can this be right?”

This should be setting off alarm bells!

There are two BIG factors that make the data we have unusable.

One — The data is based on averages, and without knowing the distributions there’s no way to determine if the data provides an accurate representation.

Here’s a tongue-in-cheek graphic that underscores why averages may not be suited for our comparison.

problems-with-averages

Figure 6 — The danger of using averages.  From Why Not to Trust Statistics.

Here’s another real-world graphic from Ben Jones that compares the salaries of Seattle Seahawks football players.

benjones_salaries

Figure 7 — Seattle Seahawks salary distributions. Source: Ben Jones.

Ben points out

The “average” Seahawks salary this year is $2.8M. If you asked the players on the team whether it’s typical for one of them to make around $3M, they’d say “Hell No!”

Two — The data doesn’t consider part time vs. full time work. The data is from tax returns and doesn’t take into account the number of hours worked.

Let’s see how these two factors work with a “for instance” from the source data.

Figure 8 -- A snippet of the source data in question.

Figure 8 — A snippet of the source data in question.

So, there are 143 women Ophthalmologists making an average of $217K and 423 males making an average of $552K.

Are the women in fact being paid way less?  On average, yes, but suppose the following were the case:

Of the 143 women, 51 work only 25 hours per week

And of those 423 men, 14 of them are making crazy high wages (e.g., one of them is on retainer with the Sultan of Brunei).

Could the 51 part-time workers and the 14 insanely-paid workers exaggerate the gap?

Absolutely.

Is this scenario likely?

About the Sultan of Brunei?  Who knows, but about hours worked?

Very likely.

We did some digging and discovered that as of 2010, 17% of the male workforce in Australia was working part time while 46% of the female workforce was working part time.

This single factor could explain the gap in its entirety.

Note: Not knowing the number of hours worked is only one problem. The data also doesn’t address years of experience, tenure, location, or education, all of which may contribute to the gap.

Findings from other surveys

We did some more digging…

Data from the Workplace Gender Equality Agency (an Australian Government statutory agency) shows that in the Health Care field, 85% of the part-time workers in 2016 were female. This same report shows a 15% pay gap for full-time Health Care employees and only a 1% gap for part-time employees.

Finally, a comprehensive study titled Differences in practice and personal profiles between male and female ophthalmologists, was published in 2005. Key findings from this survey of 254 respondents show:

  • 41% of females worked 40 hours per week compared with 70% for males.
  • 57.5% of females worked part-time compared with 13.6% for males.
  • The average income for females was AUS$ 38,000 less than males, not $335,000 less.
    (Yes, that’s still a big gap, but it’s almost 10 times less than what the article claims).

Why this causes so much damage

It would keep me up at night to think that something I did would lead to somebody saying this:

“Wait!  You think the wage gap here is bad; you should see what it’s like in Australia.  Just the other day I was looking at this really cool infographic…”

So, here we are spreading misinformation. And it appears we did it over 100 times! The visualizations have now been favorited over 500 times, retweeted, and one was featured as the first Tableau Viz of the Day for 2017.

We’re supposed to be the good guys, people that cry foul when we see things like this:

Figure 9 -- Notorious Fox News Misleading Graphic.

Figure 9 — Notorious Fox News Misleading Graphic.

Publishing bogus findings undermines our credibility. It suggests we value style over substance, that we don’t know enough to relentlessly question our data sources, and that we don’t understand when averages work and when they don’t.

It may also make people question everything we publish from now on.

And it desensitizes us to the actual numbers.

Let us explain. There is clearly a gender wage gap in Australia. The Australian government reports the gender wage gap based on total compensation to be around 26% for all industries, 23% for full-time and 15% for full-time health care (base pay is a smaller gap). While we can’t calculate the exact difference for full-time or part-time ophthalmologists (because we only have survey data from 2005), it appears to be less than 15%.

Whatever the number is, it’s far less than the 150% wage gap shown on all the makeovers we reviewed.

And because we’ve reported crazy large amounts, when we see the actual amount — say 15% — instead of protesting a legitimate injustice, people will just shrug because 15% now seems so small.

How to fix this

This is not the first time in MM’s history that questionable data and the lack of proper interrogation has produced erroneous results (see here and here.) The difference is that this time we have more than 100 people publishing what is in fact really, really wrong.

So, how do we, the community, fix this?

  • If you published a dashboard, you should seriously consider publishing a retraction. Many of you have lots of followers, and that’s great. Now tell these followers about this so they don’t spread the misinformation. We suggest adding a prominent disclaimer on your visualization.
  • The good folks at MM recommend that participants should spend no more than one hour working on makeovers. While this is a practical recommendation, you must realize that good work, accurate work, work you can trust, can take much more than one hour. One hour is rarely enough time to vet the data, let alone craft an accurate analysis.
  • Don’t assume that just because Andy and Eva published the data (and shared a headline that too many people mimicked without thinking) that everything about the data and headline is fine and dandy. Specifically:
  • Never trust the data! You should question is ruthlessly:
    • What is the source?
    • Do you trust the source? The source probably isn’t trying to deceive you, but the data presented may not be right for the analysis you wish to conduct.
    • What does the data look like? Is it raw data or aggregations? Is it normalized?
    • If it’s survey data, or a data sample, is it representative of the population? Is the sample size large enough?
    • Does the data pass a reasonableness test?
    • Do not trust somebody else’s conclusions without analyzing their argument.

Remember, the responsibility of the data integrity does not rest solely with the creator or provider of the data. The person performing the analysis needs to take great care in whatever he / she presents.

Alberto Cairo may have expressed it best:

Unfortunately, it is very easy just to get the data and visualize it. I have fallen victim of that drive myself, many times. What is the solution? Avoid designing the graphic. Think about the data first. That’s it.

We realize that the primary purpose of the Makeover Monday project is for the community to learn and we acknowledge that this can be done without verified data. As an example, people are learning Tableau everyday using the Superstore data, data that serves no real-world purpose. However, the community must realize that the MM data sets are real-world data sets, not fake data. If you build stories using incorrect data and faulty assumptions then you contribute to the spread of misinformation

Don’t spread misinformation.

Jeffrey A. Shaffer
Follow on Twitter @HighVizAbility

Steve Wexler
Follow on Twitter @VizBizWiz

Additional reading

Why not trust statistics. Read this to see why the wrong statistic applied the wrong way makes you just plain wrong (thank you, Troy Magennis).

Simpson’s Paradox and UC Berkeley Gender Bias

The Truthful Art by Alberto Cairo.  If everyone would just read this we wouldn’t have to issue mass retractions (you are going to publish a retraction, aren’t you?)

Avoiding Data Pitfalls by Ben Jones. Not yet available, but this looks like a “must read” when it comes out.

Sources:

1. Trend in Hours worked from Australian Labour Market Statistics, Oct 2010.

http://www.abs.gov.au/ausstats/abs@.nsf/featurearticlesbytitle/67AB5016DD143FA6CA2578680014A9D9?OpenDocument

2. Workplace Gender Equality Agency Data Explorer

http://data.wgea.gov.au/industries/1

3. Differences in practice and personal profiles between male and female ophthalmologists, Danesh-Meyer HV1, Deva NC, Ku JY, Carroll SC, Tan YW, Gamble G, 2007.

https://www.ncbi.nlm.nih.gov/pubmed/17539782?dopt=Citation

4. Gender Equity Insights 2016: Inside Australia’s Gender Pay Gap, WGEA Gender Equity Series, 2016.

http://business.curtin.edu.au/wp-content/uploads/sites/5/2016/03/bcec-wgea-gender-pay-equity-insights-report.pdf

5. Will the real gender pay gap please stand up, Rebecca Cassells, 2016.

http://theconversation.com/will-the-real-gender-pay-gap-please-stand-up-64588

Mar 172016
 

Overview

I’m a big fan of Andy Kriebel’s and Andy Cotgreave’s Makeover Monday challenge. For those of you not familiar with this, each week Kriebel and Cotgreave find an existing visualization / data set and ask the data visualization community to come up with alternative ways to present the same data.

As Cotgreave points out in one of his blog posts “It’s about using a tool to debate data. It’s about improving people’s data literacy.”

With one major exception that I’ll discuss in a moment the challenge is meeting its goals as each week several dozen people participate and the submissions and accompanying discussions have been enormously valuable.

But there was one week where the community failed.

Worse than that, the community did some damage that will be difficult to repair.

Bad Data Make Bad Vizzes Make Bogus Conclusions

Week four of the Makeover Monday challenge used survey data from GOBankingRates that posed the question “how much money do you have saved in your savings account?” Here are some of the baseless conclusions from people that participated in the makeover:

Figure 1

Figure 1 — From the source article that spawned the makeover.  Yes, the exploding donut needs a makeover, but it’s the headline “Survey finds that two-thirds of Americans don’t have enough money saved” that presents the bigger problem.

  • Americans Don’t Have Enough Money Saved (See link).
  • 71% of Americans Have Less than $1,000 in Savings. Yikes! (See link).
  • Americans Just Aren’t Saving Money (See link).
  • Most Americans Have Miniscule Savings (See link).
  • 80% of Americans Have Less than $5,000 in Savings! (See link).
  • Americans Are Not Saving Enough Money! (See link).
  • Americans Have Too Little Savings (See link).

So, what’s the problem?

It turns out the key finding from the original publication is not true — and thanks to the makeovers that spurious finding has been amplified dozens of times.

How did this happen?

Let’s dig into the data a little bit.

Is There a Relationship Between Age and Savings?

As I mentioned before I think the Monday Makeover Challenge is great and I’ve participated in a couple of them. I started to work on this one and took a stab at showing the relationship between age and savings, as shown here.

Figure 2 -- Divergent stacked bar chart showing the percentage of people that have different savings amount, sorted by age

Figure 2 — Divergent stacked bar chart showing the percentage of people that have different savings amounts, sorted by age

This looked odd to me as I expected to see a correlation between age and savings; that is, I expected to see a lot more blue among Seniors and Baby Boomers.

I decided to make the demarcations less granular and just compare people with minimal savings and those with $1,000 or more in savings, as shown here.

Figure 3 — Less granular divergent stacked bar chart

This result seemed way off, so either my supposition is wildly incorrect (i.e., as people get older they save more) or there was something wrong with the data.

Note: I try to remind people that Tableau isn’t just for reporting interesting findings. It’s a remarkably useful tool for finding flaws in the data.

It turns out that while there is indeed something wrong with the data, there was a much bigger problem:

Most people didn’t bother to look at the actual question the survey asked.

What the Survey Asked

The survey asked “How much money do you have saved in your savings account?”  It did not ask “How much money do you have saved?

The difference is titanic as the average American savings account yields but .06 percent interest!  That’s infinitesimal — you might as well stick your money in a mattress!

Indeed, I am of the Baby Boomer generation and I have but $20 in my savings account — but (thankfully) more in my savings.

So, the vast majority of people that participated in the makeover didn’t bother to look at the actual question and came to — and published — a bogus conclusion.

Were there any other problems with the survey?

You betcha.

What’s Wrong with the Survey?

A visualization is only as good as its underlying data and the data in question has nothing to do with the savings habits of Americans; it only has to do with having a savings account.

But there are other shortcomings with the survey that should make us question whether the data is even useful for analyzing how much money Americans have sitting in a savings account.

Consider this excellent review of the same Makeover Monday challenge from Christophe Cariou.  He points out the following shortcomings with the survey itself:

  • In the article, we read: ‘The responses are representative of the U.S. internet population’. It is therefore not representative of the US population. See this report by Pew Research Center for age and online access.
  • We also read ‘Demographic information was not available for all respondents, and analysis of responses by demographics is based solely on responses for which the targeted demographic information was available.’ Normally, if it was demographically representative, this would be clarified. This comment adds a doubt.
  • The average savings amount in the article is the sum of the averages of the groups divided by 6. It is not weighted by the size of each group.

Note: Kudos to Bridget Cogley who also saw the problems with the conclusions when the makeovers first appeared in late January 2016.

Further note: In a subsequent makeover challenge blog post Cotgreave alerted participants to questionable data.

So, Where Exactly is the Harm?

So, dozens of people have created visualizations based on bad data and came up with bogus conclusions. Given the number of articles from allegedly reliable sources reporting shortcomings in savings, what’s the harm of sounding an alarm bell?

I suppose if you are an “ends justify the means” type of person then it’s fine to publish bogus findings as long as they change behavior in a positive way.

But I know many of the people in this community and they would be aghast at using data visualization this way.

I also fear that with collective missteps like this people will question the ability of makeover participants to relay accurate information.

So What Should We Do?

Andy Cotgreave and Andy Kriebel have earned their leadership positions in the data visualization community, so I hope they will make note of this makeover mishap and encourage people that published the bogus result to modify their headlines.

I also strongly encourage anyone working in data visualization to understand the data — warts and all — before rushing to publish. Andy Kriebel is providing the data set and we shouldn’t ask him to find all the flaws in it.  Indeed, that’s part of our job.

Finally, I ask others in the community to be more diligent: only publish work that has been carefully vetted and do not tolerate unsubstantiated work from others.

While it’s true that nothing terrible will happen if more Americans open savings accounts, there may be other situations where publishing spurious conclusions will do some serious damage.