(UPDATE April 22, 2017: A different technique for creating a viola chart, submitted by Ben Young (who is not on Twitter), has been added to the end of this post. It differs from my technique in that they are based on different means of creating Sankey-esque curves. My thanks to Ben for doing the work to figure out an alternate technique that could well be more practical in many circumstances.)
One fine Makeover Monday found us working with some apparently Russian survey data. (This was fraught with its own issues, but let's leave those aside for now.) The original viz purported to show what different social strata perceived to be the reasons for a person's success.
Since there were no clear source materials, we were speculating here, but it seemed like people were asked, "Do you think that [X] is a reason for a person's success?", and the value presented in the chart was the percentage of people in each social stratum who said yes.
(We later found out, thanks to Chris Love's sleuthing on dicey Russian websites, that people were only allowed to say "yes" to two questions, which makes it a completely different story. Nonetheless, let's proceed.)
The survey data was structured like this:
Dimension: Cohort (3 different social strata: poor, middle class, rich)
Dimension: Survey Question (8 different suggested reasons for people's success)
Measure: Percentage of people per cohort/survey question who answered "yes"
With only 24 data points to work from, we were fairly limited in what we could execute as a final product.
Because we only had aggregated data, we didn't know the population of the survey sample (1500 people, it turns out; thanks @ChrisLuv).
We also didn't know the relative numbers of people in each cohort (only about 1% claimed to be "rich," it also turns out; thanks @ChrisLuv).
Finally, we didn't actually know what constituted classifying someone as "rich" or "poor" (self-reporting, as it happens; ; thanks @ChrisLuv).
So we knew the data was flimsy. But if we accepted the premise of the project, we were still responsible for visualizing it effectively. And from my analysis, a few things became clear--if not remotely surprising.
All groups placed low values on "ability/talent" and on "luck/good fortune."
Rich people, more than other groups, tended to identify positive personal qualities ("hard work," "entrepreneurial spirit") as reasons for success.
Poor people, more than other groups, tended to identify unfair external factors ("access to initial capital," "connections to the right people") as reasons for success.
Point 1, above, was easily shown with a boxplot, which I did here.
(The red dots are the rich people; blue dots, middle-class; gold dots, poor people.) But that left me with points 2 and 3.
Now, for some context: Lindsey Poulter, in the previous week's Makeover Monday, had produced a beautiful viz with barbells and whitespace galore. Many of the early entries for this week's contest--and in point of fact, my own initial design attempts--were in a similar style.
Whether we were unconsciously mimicking the style that captivated us the week before, or whether parallel thinking brought us all to the same design conclusions, I don't know. But I realized that I wasn't saying anything new or interesting with the viz I was working on; and more importantly, I wasn't conveying the analytic messages I wanted to convey strongly enough.
There are some simple ways to show the differences; some worse than others. I looked at line charts, barbells, treemaps, dot plots, area charts, and so on; I thought about showing the differences instead of the raw scores; I thought about a LOT of things, and none of them grabbed me.
They were all failing the main goal, which was to REALLY emphasize the difference between the rich people's value of internal qualities vs. the poor people's value of external qualities.
The area chart was the closest to accomplishing this, especially split out into three individual area charts instead of one stacked area. But it looked...well...
It looked boring.
So if I was going to use an area chart, what could I do to make it less boring?
I knew I could spruce it up with visual imagery, but I wasn't trying to go crazy with photos or custom-made icons or anything this week. Really, I was filled with antipathy towards the data set, and I didn't want to give it the satisfaction of taking up lots and lots of my time creating bespoke graphics. Moreover, the IronViz feeder contest was the previous week and I was pretty much burned out on custom graphics.
Enter the not-a-violin chart.
Now, a true violin chart is a pumped-up version of a boxplot. It includes all of the summary data of a boxplot--median, quartiles, etc--but also describes the distribution of the values in the chart. (http://www.datavizcatalogue.com/methods/violin_plot.html is a good simple explanation and sample of a true violin chart).
I didn't want to use a true violin chart, per se, because we didn't have nearly enough values. Mostly, I just wanted a cooler-looking version of an area chart, one that would emphasize where the bulges were in the distribution of values across all of the questions for each of the social strata.
Like a simple, dumbed down version of a violin chart....
A viola chart! NAILED IT!
[At this point you should imagine a .gif of the Success Kid. I'm not posting the actual image because, believe it or not, that photo is copyrighted.]
For reals, though, there's no true name for this chart. Orchestra nerds would call it a viola chart because violas always bear the brunt of orchestra jokes. Doesn't really matter. Basically it's just a gussied-up area chart with mathy curves taking us from point to point. It's based primarily on the Sankey charts that Jeffrey Shaffer, and later Olivier Catherin, provided excellent tutorials for building.
So how do we create a viola chart?
Creating a Viola Chart
1. Learn From the Masters First, Then Come Back Here
So, yes, it's true: to make these curves, we're going to be adding a whole bunch of points to our viz. When we create mathy curves to connect points in our vizzes, the standard for making smooth lines seems to be a 50x increase in the number of points plotted on the screen.
That's why we need to add an Excel tab or a CSV to our datasource; that sheet contains Jeff's 50-row data densification model. (Yes, this particular solution requires doing some data manipulation outside of Tableau.)
That model looks basically like:
I encourage you to read https://www.dataplusscience.com/SankeyinTableau82.html to understand what the point of doing this is, and to see how you join it to your preferred data set. It explains the process way better than I could ever do fifteenth-hand. I'll wait.
OK, are you back? Great. Let's talk about what we'll have to do to our data set to get it ready for viola-ing.
2. Re-Shaping the Data So That Rows Refer to Curves Instead of Points
Here's what the original data looks like, with the "Link" column added. I shaped the data so that from top to bottom, the "Reason" field goes from internal to external (that's a subjective opinion, of course, but for our purposes just go with it). Within each reason, the Social Strata goes rich-middle-poor. Then I added a sequential ID number. This just makes things easier later. The "Rate" comes from the original viz.
One thing we'll have to address is that the curves in a viola chart need a place to start and a place to finish, and they need to appear continuous from mark to mark. In a Sankey diagram, we can have stacked bar charts form the interstitial layers between each single curve; in this chart, we don't have that luxury. We need to have a series of sigmoid-based curves that all interconnect.
Let's ignore the Social Strata column for now. The Rate column will eventually describe our Y axis. So, for our X axis, let's think of our Reason column as containing eight discrete, and thus evenly-spaced, anchors. With this data structure, our first curve would traverse from the first Reason anchor ("abilities/talents") to the second Reason anchor ("entrepreneurial spirit/courage") along the 50 added points from the "Model" tab. The next curve would go from there to the "hard work" anchor, and so on.
There's one problem with this. We have 8 anchors; but that will only give us 7 curves. Now, we *could* draw the chart that way, but then our first and our last anchors would appear to be getting shortchanged. They'd only get half of a curve each to call their own, while the middle six anchors would get a full curve.
For example, there are eight marks in the curved area chart here. But it's obvious that, perceptually, the first one and the last one get shortchanged. What we will do in our data is to add an anchor to the beginning of the chart and another at the end of the chart, so that it extends to zero on both sides.
I accomplished this in our data by making each row represent a curve, rather than an anchor. I changed "Rate" to "Start Rate" and "Reason" to "Start Reason". Then I added "End Rate" and "End Reason" columns that were filled with the values from the next Reason down in the chart. For the first three rows, "Start Rate" was "0%" and "Start Reason" was Null; the same went for the "End Rate" and "End Reason" of the last three rows.
We end up with 27 total rows (9 curves for each of 3 social strata).
3. You Were NEVER Told There Would Be No Math, 'Cause There Totally Is
Now lets start building our viz. We'll put [ID] on Detail, [Social Strata] on Rows, and [Start Reason] on Columns. You'll want to sort [Start Reason] by [Id] because that will make sure that the Reasons go from Internal to External, left-to-right. It's important that we have the order pre-set so that the curves flow into each other nicely. We could also put [Social Strata] on Color at this point, because we're going to at some point, and there's no time like the present. I also manually sorted [Social Strata] to be Rich-Middle-Poor.
Notice that "Null" is the first column here. Also notice that [Start Reason] is discrete. That's what you want.
Now we're going to start adding calculations. The first one, [Sigmoid Function], is straight out of Jeff's tutorial (were you paying attention?) and describes the shape of the curve.
The next one, [Curve], will end up going on our Rows shelf. This is the function that will actually draw the curve. For each [Start Reason], it starts with a Y value of [Start Rate], ends 49 points farther down the X-axis with a Y value of [End Rate], and varies the Y value according to the mathematics laid out in [Sigmoid Function].
For instance, the 25th point will have a Y value of exactly the mean of [Start Rate] and [End Rate]--just as if it were a straight line between those two points. But because of the math described in [Sigmoid Function], the other points will not fall on a straight line, but instead on a specifically defined S-curve.
The reason we're dividing our [Start Rate] and our [End Rate]-[Start Rate] in half will become obvious later---or already, if you're wicked smart.
4. Build Your Viola
Put the [T] field on the Columns shelf and the [Curve] chart on the Rows shelf. Set them both to be Dimensions, but Continuous. So they'll still be green pills, but they will not be aggregated in any way. If you have the Mark type set to Automatic, it should look something like this:
Starting to take shape! Now change the Mark type to be Area.
Now let's do some sneaky things. We're going to duplicate our one area chart (which is actually 9 area charts (which is actually 27 area charts)) and make it a mirror image of itself. Duplicate [Curve] in the Rows shelf. Right click on the new [Curve] pill and select "Edit in Shelf." Put a minus sign in the text window before "[Curve]" and hit Enter.
Then do the old Dual Axis/Synchronize Axes dance in whichever manner to which you've grown accustomed. If Tableau throws [Measure Names] onto the Color shelf for no good reason, get rid of it. Nobody asked for that nonsense.
And now you know why we had to halve our [Start Rate] values in the [Curve] calculation...just in case we ever wanted to show the [Curve] axis values. Although, I really don't think we ever would.
5. Formatting For Sneaky Petes
Now it's time to do a whole bunch of turning off lines and hiding things, just like always:
Edit the [T] axis so that it has a fixed range of -6 to 6. Otherwise you'll get ugly stripes in your violas. The previous sentence has never been uttered in human history.
Hide every axis.
Hide every header.
From the Format card, turn off every row divider, every column divider, every row and column band, every axis rule, every gridline, every zero line, EVERYTHING.
Make sure your Color card has the opacity set to 60%. It does this by default when you pick Area as your Mark type, but it doesn't hurt to double check.
And now you're basically done with the layout. The cool thing about this layout is that it looks like there are internal gridlines, but really, those are just the borders that Tableau automatically puts around every Area mark. Usually it's incredibly annoying that it draws vertical lines on the sides of any Area mark, but here it works to our advantage in that it gives us a cool effect. If you bump the opacity up to 100%, though, they disappear. When you change the opacity of an Area mark, the borders stay 100%, it seems.
One last bit of business: tooltips.
6. What's So Hard About the Tooltips?
Well, you want the reason and the rate that show up in the tooltip to reflect the actual anchor point that your pointer is closest to. So, since our [T] values (which are our X axis) within each curve range from -6 to +6, we will write a conditional calculation for Reason called [Reason to Show], and one for Rate called [Percentage to Show], so that whenever [T] is negative, tooltips display our [Start Reason] and [Start Rate]; otherwise they'll display the [End Reason] and [End Rate].
It's not hard to do, but it's necessary in order to make your tooltip display accurate data. Then do whatever formatting you want to do within the tooltips. (Or you can go hog-wild on conditionals before you even put anything on the Tooltips card, so that you don't have ugly things like I have in my viz -- I didn't do any checking-for-nulls -- that I just covered up on the Dashboard with text boxes so that you'd never mouse over them. Yes, sometimes there's a mess under the hood.)
7. Messing With the Math
I personally found the standard sigmoid curve to be a little too boxy for my tastes, so I created a parameter called [Smoothness of Sigmoid Curve] to mess with the curvature a little bit, and plugged that into a replacement calculation called [Modified Sigmoid Function].
From there I could mess around with the smoothness of the curve. I ended up setting it to 1.5 and was happy with the result.
For the final Dashboard, there's a ton of floating and adding text fields and making separate worksheets just for header rows, but the technical elements of creating the viola chart is basically just this.
8. And So
This is really just a minor modification of lots of prior math work done by the community at large. I'm sure there are more elegant ways of creating all of these effects in Tableau itself; I'm sure there are ways to create a true violin plot; and I'm sure there are people who are resistant to the idea of having curved area plots when straight lines, or bar charts, or just a dot will do. But in the event there's a use case that you find yourself up against, and you need a chart type based on the instrument that is the butt of orchestral jokes. I hope this helps you out.
Feel free to download the workbook and mess with it to your heart's content...just don't let yourself believe that luck isn't a major component of success.
9. Revisions and Adaptations
Since I originally posted this, Ben Young has remixed and improved on the viola chart technique. His version is based on the Sankey Diagram with Data Densification technique that Chris Love developed. The main advantage of this technique over the one that I used is that it requires less data manipulation outside of Tableau, and relies instead on a Custom SQL statement to join the data to itself, and then manipulate the data within Tableau itself.