Jan 162014
 

Overview

One of the new features in Tableau 8.1 that Tableau Software is trumpeting quite a bit is one-click Box and Whisker Plot generation.  While I appreciate the new functionality, this chart type doesn’t “sing” to me the as much as jittering does.  Indeed, this “jittering” capability was the BIG discovery for me in 2013.

Let’s see how a box and whisker plot compares with jittering using a simple example.

Note: Interactive dashboards that illustrate jittering techniques may be found at the end of this blog post.  Feel free to download and explore.

Salary and Age Bins – Default

Consider the following pre-Tableau 8.1 salary chart that shows how salaries are distributed across age bins.

1_Salarydistribution_Age

Figure 1 — Default Salary Distribution by Age Bins

 

While we can see that the top salaries are enjoyed by people in their 50s, there’s nothing that gives us concrete percentiles nor shows us where the outliers are.  We also can’t tell that there are in fact thousands of dots in the visualization as so many marks are sitting on top of each other.

Salary and Age Bins – Box and Whisker Plot

To see percentiles and outliers we can use Tableau’s Show Me feature and click the Box-and-Whisker Plot button.

2_SalaryDistrib_BoxWhisker

Figure 2 — Salary Distribution by Age Bins with Box and Whisker Overlay

 

This is definitely an improvement, but I really don’t “feel” the data as I can’t see how the dots are distributed; they are all stacked on top of each other.

Salary and Age Bins – Jitters

Here’s the original chart, but with the marks “jittered” using a modified version of Tableau’s built-in INDEX() function.

3_SalDisJitters

Figure 3 — Salary Distribution by Age Bins with the marks “jittered”

This gives me a much better feel for the data as I can how the thousands of marks cluster.  Of course, I can still superimpose the box plot, as shown here.

4_SalDisJittersBox

Figure 4 — Salary Distribution by Age Bins with the marks “jittered” and box plot overlay

Getting Jitters Using INDEX()

To “jitter” the marks I create a calculated field called “Index” that uses Tableau’s INDEX() function.  I put this on the Columns shelf and compute using ID, as shown here.

5_Index

Figure 5 – First attempt using Tableau’s INDEX() function

It turns out that for this particular example INDEX() by itself works because there is an equal distribution of IDs across each of the age bins.  Consider the example below where we show a distribution of Superstore Sales across different customer segments.

6_superstore

Figure 6 – Shortcomings of using INDEX() by itself.

Notice that the strip of dots within “Corporate” is much wider than the other segments because there were more orders within “Corporate” than there are in the other segments.

The easiest way to fix this is to edit the axis and select “Independent ranges for each row or column” from the Edit Axis dialog box.  While this will work fine we’ll look at a different technique that will allow us to control the degree of jittering.

Using Modulus to Control Jittering

When I first blogged about this technique last year, Alex Kerin of Data Driven suggested a simple and elegant solution to different-sized partitions using Tableau’s Mod function.   For those of you that forgot your high school mathematics, we use a modulus is to determine the remainder when you divide one number by another.  Here’s an example

14 ≡ 30 Mod 8

Translation: 14 is equivalent to 30 Mod 8 because you get the same remainder when you divide 14 by 8 as when you divide 30 by 8 (both remainders are equal to 6).

So, how do we use this capability in our visualization?  We want the same number of dots in each segment, so instead of using INDEX() we will instead use INDEX()%25

This will create 25 “rows” of dots within each segment.

Specifically, when

INDEX()=1, INDEX()%25 will be mapped to 1
INDEX()=2, INDEX()%25 will be mapped to 2


INDEX()=26, INDEX()%25 will be mapped to 1
INDEX()=27, INDEX()%25 will be mapped to 2
etc.

Note that 25 is not a magic number.  For this example anything above 15 will do the trick (and in the demo workbook I have a parameter slider that controls the MOD setting).

Conclusion

Jittering is a very simple technique and it helps overcome the problem of marks being stacked atop each other when plotting a distribution within a dimension.  It only takes up a little more screen real estate and it packs a terrific visual wallop.

 

[suffusion-the-author]

[suffusion-the-author display='description']

  14 Responses to “Boxes, Whiskers, and Jitters”

Comments (13) Pingbacks (1)
  1. Hi Steve,

    while I really love this technique, and your description of it, I think with the case of Age bins it could be misinterpreted as showing age as a continuous range within each Age bin, I.e. 51 on the left, 59 on the right.

    Just a thought, what do you think?

    Peter

    • Peter,

      Hmm… I don’t see it that way. Next time I’ll use text dimensions rather than age bins so folks won’t think there is any different between the left side of a partition and the ride side.

      Steve

      • fair enough, for the record I think its pretty clear, but the thought crossed my mind how you’d interpret if you’d never seen one before. Great article, thanks for posting.

  2. Another option would be to right-click on the Index axis, and select “Independent axis ranges for each row or column”. Or am I missing some other benefit that Index with Modulus provides?

    • Joe,

      Of course!  No, I cannot off the top of my head think of an advantage that Mod provides over the independent axis.  (I would love for somebody to tell me that you would get much better performance with Mod than with an independent axis.  That would make me feel better… 😉

      • Also, Index with an Independent axis is less complex to setup than Index+Modulus

        • Joe,

          I’ve found the modulus very useful as I can constrain the width of the jitterplot.  For example, I might have a modulus of 10 but have the axis go from -3 to 13 which places some “air” arount the strip of dots.  This makes it much easier to see reference lines and bands.

  3. Steve love this. I’ve been doing this for about a year too but in my simple jitter plots have always been able to set the “index () by” in such a way that setting the axis to independent works just fine.

    Joe, you beat me to my question. I’ve been setting up fairly simple jitter plots. So, at the risk of asking the same question, let me twist it a bit. Steve or Joe, can either of you envision a situation where we might need to use Index ( ) + Modulus because setting the independent axis won’t work with the index ( ) function.

    What do you guys think of the way the new box and whisker plots won’t run across the full pain?

    In graphs like these, I’m not keen on the way they new box and whisker plots don’t run across the full width of each pane. So, I often use the old method of setting reference lines for the quartiles, min and max.

    • Yes. My original need for a jitter chart had an ID column that did affect the distribution of data (IDs were incremented by hire date and the chart had time as an aspect to the chart). Therefore I used modulus to deal with that.

      • Alex,
        Yes, if there is some type of implied sort when computing using IDs then the modulus would address that. Thanks.

        Steve

  4. Joe, I agree and will update the post tomorrow so folks that don’t read the comments don’t make the unnecessary step.  That said, I do like parameterizing the modulus as it allows me to get a sense of how dense the data is per segment.

  5. Steve – forgetting the technique (nice idea), my wife saw thus and said “Hey – add gender to that salary chart.”…   I did – now she wants to know where that data came from – data that PROVES women are paid less than men – and, that it gets worse as we age!  Damned insights!  
    Keep up the good work, Zen Master…

    • Chuck, while this particular data set is made up, it reflects something that is true in the US — at least based on two major salary studies I’ve worked on — and that is that women make less than men purely because they lack a Y chromosome. With all factors being equal — years of experience, education, hours worked, etc. — women simply make less than men because employers get away with paying them less.

      Steve

Leave a Reply to Joe Mako Cancel reply

(required)

(required)