Designer, Analyst, Human Being: Lessons Learned from Analyzing (and Re-Analyzing (and Re-Re-Analyzin

Mike Cisneros
Jan 20, 2017
12 min read

There is an award-winning weekly project among people in the Tableau data visualization community called “Makeover Monday.” Each week, the organizers, Andy Kriebel and Eva Murray identify a data visualization that has appeared publicly and challenge participants, using the same data (and only the same data), to create a different version of the published work. Tell the same story differently, tell a better story with the same data, or just try creative techniques to accomplish similar goals. (The provided visuals are not necessarily “bad,” keep in mind. The point is to learn from one another and get better as a community, not to denigrate the work of others.)

The Challenge: Design and Technical Considerations

This week’s provided data set was a collection of Donald Trump’s tweets and retweets, based on a BuzzFeed article entitled “294 Accounts Donald Trump Retweeted During the Election.” The data was collected by the Trump Twitter Archive (@realtrumptweet on Twitter), starting from back in 2009, along with some metadata about each one (timestamp, engagement, the handle of the person he was retweeting, and so on).

I saw the original challenge as: how do I plot every individual tweet—because I want them to be clickable and seen as individual nodes on a diagram—for an entire month, or even an entire year, in a way that also tells a story on multiple levels?

I figured the two easiest dimensions to show cyclical time series data on would be “time of the day” and “day of the month,” because that would work equally well for showing a month, a year, or a decade’s worth of data—all of which were levels of detail I hoped to provide in some way.

I decided to make a scatterplot visualization using polar coordinates, which in effect is a radial chart or a variant on a radar chart. In a circular chart like this, instead of an X and a Y dimension, you have an Angle and a Radius dimension. This seemed perfect for what I wanted to achieve:

The most common everyday object we have that encodes data on an Angular dimension is a clock, and “time of the day” was one of the specific dimensions I wanted to encode.
The Radius dimension in a circular chart, particularly when using discrete, sequential numeric data, is most visually reminiscent of the rings of a tree trunk—which we instinctively associate with a tree’s age, in that inner rings are earlier in the tree’s lifespan, and outer rings are more recent. This reflexive understanding of concentric circles corresponded well to the idea of using that dimension to show “day of the month.”

A lot of the technical work that goes into creating a bespoke chart type like this in Tableau has already been done, in the form of radial bar charts, by other authors already (Dave Hart and Adam Crahen, for example); Charlie Hutcheson has also done a fine explainer on the technique. I simply had to tweak the existing formulas to show points at a specific radius along an angle, rather than compute a bar length with a start and end point along that same angle.

But that’s not why I brought you here today.

The technical details involved in making this dashboard aren't super important. This isn't really a story about the viz itself. It's really a story about making sure you are seeing what you think you're seeing when you start to analyze data.

Round 1: Discoveries vs. "Discoveries"

Here's what I mean.

After I first settled on what type of chart to use, prepped the data, put a few worksheets together, and sketched out the overall design and layout of the dashboard, I got it to a point where I was happy with the result.

As I described above, I had encoded the polar scatterplot so that the inner ring of the circle would be the first day of the month, the outer ring of the circle be the last day of the month, and the angle, going clockwise around the circle, would show the time of day, in a 24-hour cycle (i.e., midnight was at the very top, and noon was at the very bottom).

I built the dashboard so that there were four main areas of visual data:

one big chart showing one year’s worth of tweets;
a strip of charts, a small-multiple view, for each year, so that you could see the year-to-year change in a single glance, but also select any year to zoom in on;
a second small-multiple grid showing the individual months for the selected year; and
a window where the viz would display the full text, the engagement level (likes and retweets), and the client that Trump used to send any particular tweet the user clicked on from the main chart.

I also included space towards the top of the dashboard to write out, in text, the primary story that the visualization was telling.

And boy, did I think there was a story to tell when I first put the dashboard together.

Because I had chosen to use a polar scatter chart, I was able to see something astonishing almost immediately:

There were literally zero tweets between 12 PM and 1 PM at all. None. At all. I checked the data for other years, and found the same thing. A total Twitter blackout for one hour every day, without fail.

This was amazing. Had I discovered a mysterious hole in Trump's day when he completely blocked out the Internet and all outside distractions? Was this a mysterious pattern of behavior that nobody had ever mentioned before?

Or maybe, before we get too excited about what we might have discovered, should we take a closer look at our data set and make sure that it is bulletproof?

Now to be honest, I was really hoping that this was a life-changing discovery. It certainly caught the attention of my wife and my friend, who were in the room with me when I was first working on the visualization. We enjoyed speculating as to what that hour of time was being used for every single day. (All of said speculation, as you can imagine, was extremely charitable and kind-hearted.)

But here's the thing: if there had been 10 or 15 or 23 tweets in that window of time, when every other window of time had hundreds or thousands of tweets, then I would believe that there was something fishy going on with his behavior during that hour. But since there were literally zero tweets, Occam’s razor was telling me that it was just a data problem.

Nevertheless, it was late Sunday night by this time, I was excited about what I had made, and I wanted to post it before I called it a day, even though I didn't have a perfect answer yet. So I sort of copped out. I framed the story as a question: "Is this a data problem or a mysterious behavior?" I peppered the text portions of the viz with a lot of caveats, put it up on Tableau Public, tweeted a screenshot and a link to the workbook, and went to bed.

Round Two: You Got Your Answer. Now What?

By the time I awoke Monday morning, it looked like my question had been answered. George Gorczynski had figured out that, in the Tableau Data Extract (TDE), the original dates were parsed such that the timestamp for any tweet sent between 12PM and 1PM was converted from PM to AM. So, the TDE had way too many tweets than there should have been in the bin from midnight to 1 AM, and none at all from noon to 1PM. Once Andy had remade the TDE, I re-downloaded it and began to rebuild the viz.

While this was going on, Chris Love, my external conscience, pointed out the need for actively removing (not just caveating) work that is out in public once you know for a fact that it contains flawed data or flawed conclusions. At the time, my original version of the viz was still posted on Tableau Public, and was linked to from that Sunday night Twitter post. I could easily put up several new tweets setting the record straight—which I did—and I could add a caveat to the Tableau Public workbook—which I also did.

But that didn’t change the fact that the original, uneditable tweet, containing a screenshot of the original dashboard, was already out in the world, being shared on its own. What were the odds that everyone who saw it decided to check the reply chain for caveats, or clicked on the link to the Tableau dashboard to see the new version? Long odds, I'd say. Long at best.

So as much as I liked getting the instant crack-like feedback of seeing people like and share the original post, the only justifiable action was to delete it entirely, and then get back to work creating a new version, based on accurate data, telling a true story.

Finding the New Story

Keep in mind that the original story (I always like to have the viz tell a specific story) was entirely centered around this missing hour of tweets. That missing hour story turned out to be a story of "hh" vs. "HH" in DATEPARSE, which is not a very exciting story to tell a general audience. I had to find a completely new tale to tell.

After some more investigation, I realized that the new story was that Trump's tweeting patterns changed dramatically from the first four years he used the Twitter platform (2009-2012) to the most recent four years. It seemed like he had truly learned how to tweet.

By that, I don’t mean that he learned the mechanics of posting a tweet or using a client on his phone. I mean that he learned that he couldn't confine himself to only tweeting during an eight-hour window each day, and that he needed to retweet people to build engagement. (In his first few years of using Twitter, he didn't retweet a single person.)

By the time he launched his Presidential campaign, having already learned how to leverage Twitter for its benefits, Trump saw his engagement numbers go through the roof. This change can easily be seen in a year by year comparison of his tweets.

The new story thus became “Trump learned how to tweet, and he learned to do it at all hours of the day.”

You could see that he used to tweet within a narrow, roughly eight-hour range. By 2013 he had expanded that range to almost 16 hours.
He also learned to retweet, but the retweets tended to happen within a different range of hours. It looked like he was tweeting original thoughts from about 10 in the morning to about four in the morning the next day, while his retweets would happen between around 10 PM and about noon the next day.

This is, admittedly, kind of an odd schedule for a businessman in his late 60s, but it doesn't necessarily mean anything nefarious. It just means he has an odd schedule; or, it could mean that he has a staffer or staffers helping him re-tweet things all through the night. (Remember that I said this. It will come up again later in the article.)

With the new corrected viz posted, I was happy to put this whole project to bed, in no small part because I was tired of thinking about this guy's tweets all day long.

Round Three: Does Anybody Really Know What Time it Is?

Unfortunately, the next day I woke up to find a new discussion had been taking place around this project. The new discussion, among Steve Fenn, Matt Francis, Chris, and Andy centered around timestamps: specifically, what time zone are these tweets actually recorded in? And depending on the answer, it made a big difference to my viz, because the whole premise of the new story was based on the time of day they happened.

The answer, which was obviously the most logical answer after not too much consideration, was that the tweets were time stamped at coordinated universal time (UTC); in practice, this is the same as Greenwich Mean Time (GMT).

However, Trump lives in New York and Florida, both of which are in the Eastern Time Zone--either UTC -4 or UTC -5, depending on whether it's daylight savings time or not. Sigh.

Now the next step was trying to figure out how to accurately convert nine years of tweets into the right local time. This took a little bit of research, but not too much. Matt got the calculation ball rolling by computing and sharing the formula to get the correct time for all tweets since Trump announced his Presidential campaign. From there, it wasn't too much labor to get the correct times going as far back as 2009.

Of course, there were subsequent discussions about whether or not we can actually say that Eastern time is better than UTC time, because (as Andy pointed out), Trump travels often, so we can’t know what the “local” time was for any particular tweet. Andy preferred to keep the times coded at UTC 0.

I chose to use Eastern time, correcting for daylight savings in the US, because we know that his two main residences are in the Eastern time zone, and by and large, associating the tweets with the local Eastern time would give us a more genuine baseline. (I suspect that he spends far more time in Eastern Standard Time or Eastern Daylight Time than in Greenwich Mean Time).

Making this change to the data had an effect that, in retrospect, is completely unsurprising: it made the time ranges of his tweeting seem much more reasonable. His Trump-written original tweets happened more often during East Coast business hours, and hours when most people in that time zone are awake. The unusual time-shift seen in version #2 was just an artifact of assuming that the timestamps were relative to the Eastern time zone: an assumption that should never have been made.

Just as I had done the day before, I had deleted the tweet containing version #2 of the dashboard while I worked on the time stamp issue. Once it was corrected, I reposted, for the third time, the finished dashboard.

And What Have We Learned, Mr. Smart Guy?

So now to the actual point of this.

There are two basic guidelines in data storytelling that I failed to follow this week, and in doing so I could have undermined the entire process. If this were a project for a client, or for a social campaign, then the value of the design's appeal would have been completely wasted, because the conclusions would have been completely flawed and would have collapsed under even minor scrutiny.

1. When you see something super exciting in the data, make extra, extra sure that it's a genuine finding, and not a data problem, before you publicize it.

If you think you've discovered something humanity has never seen before, you probably haven't. Most the time it's going to turn out to be Al Capone’s vaults (there’s a reference for you fellow old people out there). If the choice is between “blockbuster thing nobody has ever noticed before” or “simple data conversion error,” then you should assume the latter until your conclusions are bulletproof.

Common sense should've told me, initially, that it was much more likely that there was a simple data transformation problem, rather than that there was an inviolable one-hour daily gap in Trump's tweets that nobody had never heard of until I discovered them while on my couch in front of an NFL game.

Don't let the excitement of "discovery" get the best of you. Stay levelheaded when you're researching a problem, and save yourself the embarrassment of being wrong farther down the road.

2. Do not overlook the details, especially if they’re the basis for the story you’re telling. I overlooked the details when I failed to consider that the encoded time zone for Trump’s tweets might not be Eastern. This is partially due to my East Coast bias; I personally live in the Eastern time zone, which in the United States is often treated as the default time zone. Knowing that Trump lives in the same time zone, I assumed that the tweets were keyed to his home location.

I should have known this was not the case. I know from using Twitter outside my own time zone that when tweets show up in your timeline, you see them in the context of YOUR local time; you don't see them stamped with the local time from the location where the people were sitting when they tweeted them out.

Logically, in order to display tweets properly for readers around the world and maintain coherence in threading, Twitter must store the time stamps in one consistent format, like a universal, coordinated time – hey, wait a minute!

Again, this was a case of getting ahead of myself, and getting too excited about what I was seeing, rather than taking the time to be precise and certain about the nature of the data. Investigate and validate your data as much as you possibly can before running around and posting public conclusions about it.

Sure, these are obvious lessons. Maybe in casual, vizzing-for-fun situations we don't always follow best practices perfectly. But even in these situations, especially if you intend to publish your work publicly, it’s imperative that your product be honest and accurate and vetted to the best of your ability.

After all, you do not always control what happens to your work once it hits the public area. If there’s a mistake in it and it spreads outside of your control, you can’t issue a recall or rm -r it from the whole internet. And when that happens, you become the author of misleading and/or false information. This is not what we’re here for. We're supposed to be part of the solution, not part of the problem.

And you know what? I'm going to add a third guideline, which is not specific to data storytelling, but more to being a person who wants to participate in a public community of practice:

3. Your ego is not your friend.

I’ll also own up to this part of it: I really liked what I came up with. As much as it’s not always polite to admit it, I do have an ego and I like to share things I create when I’m proud of them. I like the positive feedback (and even the negative feedback, when it’s something I can correct or learn from in the future). But if something I've made is out there in public and I realize that it’s wrong, I'm mortified.

I would have been better off, in this case, making sure everything in the viz was as correct and truthful as possible before publishing it, even if it took an extra day or two. Instead, I jumped the gun, spread some visually attractive but under-vetted dashboards into the world, and had to redo the work multiple times to make the story of the viz align properly with the truth of the data.

Live and learn. I'm still happy with the result, and I learned some public lessons about my strengths and weaknesses as a designer, an analyst, and human being. I suppose that’s a success.

#MakeoverMonday #Trump #President #tweet #Twitter #radial #polar #scatterplot

A VIZ APART