Famously, Sherlock Holmes makes reference in one of Sir Arthur Conan Doyle's mysteries to "the curious incident of the dog in the night-time." In the story, a horse is stolen from a barn in the middle of the night; Holmes makes this reference; Watson replies something along the lines of, "But the dog did not bark;" Holmes, in his smugly self-satisfied manner responds, "That was the curious incident." The fact that the dog did not bark while a horse was being stolen indicated that the thief was someone the dog would have been familiar with; as a result, the perpetrator (I believe it was the horse's trainer) was identified.
Occasionally I've heard this theme referred to as "the dog that didn't bark," which I suppose does the trick just as well. In the context of the story, we're meant to learn that observation and analysis requires us not just to pay attention to the things that stand out, but also to the things that *should* stand out, but for some reason do not.
However, in our data analyst lives, we encounter this silent dog all the time. Something should be barking, but isn't. And why is that? Sometimes it's because our assumptions are faulty, as they might have been in the Holmes story above. But more often, it's because the dog that should be there, barking, has gone missing. Which is to say: we don't have enough data to understand our problem in context.
Pedantic Aside (Ooh, Already? It's So Early in the Post!)
Now, an argument can be made that we never have "enough" data to put a problem in context. Especially when our mandate crosses over from "show me what this data is" to "explain to me why this data is like this," we're always going to be working at some level of abstraction. I get that. But you can't refuse to tackle an analysis just because you don't have all the data ever collected. You've got to determine what you need, at a minimum, to be honest, and truthful, and faithful to the reality your data is presenting. The perfect is the enemy of the good, yes; but the crap is the enemy of all of us.
Lost Dogs: Reward
A recent #MakeoverMonday challenge provided us data about Fortune 500 companies. Specifically, this tranche of data showed us that Apple was the most profitable Fortune 500 company by a significant amount; then, it included companies ranked 2 through 25 on that same metric from among the Fortune 500. It also included someone's characterization of the specific business sector to which that company belonged.
We have a lot of lost dogs here.
First of all, there's a huge clue right there in the data. It isn't the Fortune 25. Obviously, we are working with data for only 5% of the qualifying companies. This...is a problem. How are we to make any real assessments of our data with such a small subset of it available to us?
Second of all, the only financial data we have is profit. Great! One year of profit. But we don't know about assets, or revenue, or employees, or cash flow; we don't know about change over time, about acquisitions...we have only one dollar figure for each of 25 companies.
Last, we have incomplete characterizations of what industry these companies belong to. For instance: Berkshire Hathaway is categorized as a "Banking/Finance" company. But it's actually a conglomerate. It has almost 80 wholly-owned subsidiaries and a financial interest in almost 100 other companies. It owns retailers, airlines, banks, food suppliers, transportation companies, insurance companies, manufacturing, services providers, media companies...it crosses into a whole range of industries. And, as we mentioned before: we only have information for 5% of the actual Fortune 500 list.
Descriptive vs. Investigative
I see two paths here.
1. Because we are not being asked a specific business question, we can choose a descriptive path: using only the data provided, find something of interest to tell an audience, and develop a means to convey that insight clearly.
There is absolutely nothing wrong with doing this. In the absence of a mandate from a manager or a customer to provide a specific response, you are fully within your rights to limit your assessments and products only to answering those questions that the existing data source can resolve. In some sense, that's it's own challenge, and a harder one: because to be honest, and truthful, and faithful to the reality your data is presenting, you have a very limited range of assertions available to you. You can caveat your output with "for only the top 25 Fortune 500 companies by profit in 2016, ..."
2. We can go looking for lost dogs.
Far more often than not, this is the direction I find myself taking. Whatever the original data set provides will invariably raise questions in my mind about relationships, causality, context....
If I see outliers in the data I want to know why they are outliers.
If our source has given us a subset of an obviously larger group (like, eight European countries instead of 50), I want to know why we only see that subset.
If our source only provides one piece of financial information, but a rigorous comparative analysis would require several, I want to know why only one figure was included.
And, if we only get 25 member companies of a list that includes 500 total--and what we have been provided is not the top 25 companies on that list (which they aren't, since "profit" isn't how the Fortune 500 is selected), then I want to get the data for that missing 95%.
That Dog Will Hunt
Sometimes it's hard, or impossible, to get this kind of contextual data. But the internet is a wonderful tool. Sometimes, as was the case with this Fortune 500 data, it's pretty easy to find what you're looking for.
This week I just went to the fortune.com website and found the page where they list all of the Fortune 500 companies. They let you see their list with one variable at a time, and as long as you're willing to scroll and then cut and paste (with some minor text cleanup at the end of the process), you can get the complete data set.
With maybe 30 minutes of drudgery, I was able to increase my available data by, in this case, a factor of 60 (20x as many companies, 3x as many variables as the original data). That's a small price to pay for a much broader context into which we can place our analysis.
Pedantic Aside (Yes, Another One)
I've said it before, but it's worth reiterating: don't underestimate the value of grunt work. Doing the pain-in-the-ass data gathering/cleaning/shaping is not fun; having a useful dataset of interest to you, that nobody has assembled before, is a lot of fun. If the data were easy to come by, someone would have done something with it already.
Anyway, once you have all of your dogs back on your property, they'll likely start barking.
There's one dog over here that won't stop yapping about some company that made $3.5 million per employee in 2016. That's CRAZY.
And then this other dog is yapping about how many more employees Walmart has than anybody else, and even though they're profitable they have really small profit margins.
Then another dog that looks like Triumph the Insult Comic wants to talk about these two tobacco companies that have tons of profit and make lots of money per employee.
Another one is writing a blog to tell you that Facebook and eBay both have ridiculously high profit margins.
Suddenly you have a whole team of dogs telling you things you'd never have known. Your insight is richer, and your confidence is greater because you aren't missing so much information anymore. Is Apple still a very profitable company that makes lots of money per employee and as a share of its revenue? Yes! But it's not the unassailable titan that you might have thought it was, based only on the data provided.
"Having gathered these facts, Watson, I smoked several pipes over them, trying to separate those which were crucial from others which were merely incidental." - Sherlock Holmes, The Crooked Man
I'm not suggesting that you must ALWAYS seek out 60 times as much data as you were provided for a task. Nor am I suggesting that everything you collect will be essential. How often will we use EVERYTHING in our source materials when we create our final analysis? Probably as often as you wear everything in your closet at the same time.
What I am saying is that the data you start with should be seen as merely that: your starting point. Be CURIOUS about what is in there. Be skeptical. Be demanding. What is it telling me? What more COULD it be telling me?
Unless you are proscribed from acquiring additional information, you will end up with better analysis and better insight if you let your mind interrogate, and pester, and fully examine your data.
And if that data source can't answer your questions to your satisfaction, well, it might well be worth going out looking for a few stray dogs.