Overview
One of the new features in Tableau 8.1 that Tableau Software is trumpeting quite a bit is one-click Box and Whisker Plot generation. While I appreciate the new functionality, this chart type doesn’t “sing” to me the as much as jittering does. Indeed, this “jittering” capability was the BIG discovery for me in 2013.
Let’s see how a box and whisker plot compares with jittering using a simple example.
Note: Interactive dashboards that illustrate jittering techniques may be found at the end of this blog post. Feel free to download and explore.
Salary and Age Bins – Default
Consider the following pre-Tableau 8.1 salary chart that shows how salaries are distributed across age bins.
Figure 1 — Default Salary Distribution by Age Bins
While we can see that the top salaries are enjoyed by people in their 50s, there’s nothing that gives us concrete percentiles nor shows us where the outliers are. We also can’t tell that there are in fact thousands of dots in the visualization as so many marks are sitting on top of each other.
Salary and Age Bins – Box and Whisker Plot
To see percentiles and outliers we can use Tableau’s Show Me feature and click the Box-and-Whisker Plot button.
Figure 2 — Salary Distribution by Age Bins with Box and Whisker Overlay
This is definitely an improvement, but I really don’t “feel” the data as I can’t see how the dots are distributed; they are all stacked on top of each other.
Salary and Age Bins – Jitters
Here’s the original chart, but with the marks “jittered” using a modified version of Tableau’s built-in INDEX() function.
Figure 3 — Salary Distribution by Age Bins with the marks “jittered”
This gives me a much better feel for the data as I can how the thousands of marks cluster. Of course, I can still superimpose the box plot, as shown here.
Figure 4 — Salary Distribution by Age Bins with the marks “jittered” and box plot overlay
Getting Jitters Using INDEX()
To “jitter” the marks I create a calculated field called “Index” that uses Tableau’s INDEX() function. I put this on the Columns shelf and compute using ID, as shown here.
Figure 5 – First attempt using Tableau’s INDEX() function
It turns out that for this particular example INDEX() by itself works because there is an equal distribution of IDs across each of the age bins. Consider the example below where we show a distribution of Superstore Sales across different customer segments.
Figure 6 – Shortcomings of using INDEX() by itself.
Notice that the strip of dots within “Corporate” is much wider than the other segments because there were more orders within “Corporate” than there are in the other segments.
The easiest way to fix this is to edit the axis and select “Independent ranges for each row or column” from the Edit Axis dialog box. While this will work fine we’ll look at a different technique that will allow us to control the degree of jittering.
Using Modulus to Control Jittering
When I first blogged about this technique last year, Alex Kerin of Data Driven suggested a simple and elegant solution to different-sized partitions using Tableau’s Mod function. For those of you that forgot your high school mathematics, we use a modulus is to determine the remainder when you divide one number by another. Here’s an example
14 ≡ 30 Mod 8
Translation: 14 is equivalent to 30 Mod 8 because you get the same remainder when you divide 14 by 8 as when you divide 30 by 8 (both remainders are equal to 6).
So, how do we use this capability in our visualization? We want the same number of dots in each segment, so instead of using INDEX() we will instead use INDEX()%25
This will create 25 “rows” of dots within each segment.
Specifically, when
INDEX()=1, INDEX()%25 will be mapped to 1
INDEX()=2, INDEX()%25 will be mapped to 2
…
INDEX()=26, INDEX()%25 will be mapped to 1
INDEX()=27, INDEX()%25 will be mapped to 2
etc.
Note that 25 is not a magic number. For this example anything above 15 will do the trick (and in the demo workbook I have a parameter slider that controls the MOD setting).
Conclusion
Jittering is a very simple technique and it helps overcome the problem of marks being stacked atop each other when plotting a distribution within a dimension. It only takes up a little more screen real estate and it packs a terrific visual wallop.