The importance of branding in data science

Christine Lloyd
5 min readOct 6, 2020

Doing science is less about the tests you run than it is about the questions that you ask. For a gorgeous explanation of this, I’d encourage you to read The Ghost Map by Steven Johnson; it’s also a wonderful text on the genesis of epidemiology. For those of you who haven’t read it (yet) and want to keep reading this post, a quick summary of the relevant point for this post:

The London cholera outbreak of 1854 was terrifying and devastating. Dr. Snow, the father of epidemiology, identified contaminated water as the most likely source, but Dr. Farr, one of the founders of medical statistics, was in charge of the examination into the outbreak. Dr. Farr was convinced that miasma was the cause of the outbreak. Because of this, the overwhelming majority of questions which investigators were tasked with answering were about smells, prevailing wind direction, general stinkiness, ventilation in houses where people were sick, and so forth. Only a few questions had any bearing on the contaminated water question.

That is, many interesting and potentially germane questions were asked in great detail, and their answers were scrupulously sought by intelligent and competent men. In the process, they also almost completely missed the point.

It is perhaps only a slight overstatement to say that a well-trained ape could probably run scikit-learn. I know this does not speak well of me in relation to a well-trained ape, given the myriad difficulties I’ve run into while trying to run linear regression on my scraped housing price data, but the fact remains that with a little reading of the documentation and a few lines of Python, an advanced beginner programmer can create a regression model. (My second week of bootcamp has amply demonstrated that I can not only train but also do a fair job of optimizing a linear regression while not fully understanding the underlying math.) Anyone with a moderate amount of experience in the scikit-learn package can take a target and a list of features and use that to output a linear regression.

But is that useful?

A regression trained on sold data which looks at the length of time a listing was on a particular website may perform admirably well on the data I’ve scraped. However, in the real world, I can’t use the total time listed to predict the price a house sells for or whether it is going to be re-listed at a lower price; that data won’t be available until either the house sells or the seller decides to drop the price. I can certainly run a regression using that feature, and it may even perform well, but it’s not a productive way of answering the question.

Every data scientist worth their salt is enough of a code monkey to hack together something that works using the documentation and Stack Overflow with a smidge of assistance from Googling. Many will be able to come up with a faster or more elegant solution; those who are more on the software engineer/production side of things will certainly be able to code it far more beautifully than I can. But if you’re asking the wrong questions, how assiduously you answer them isn’t really going to help you all that much.

And here we get back to branding and the importance of having a personal brand as a data scientist. Leetcode and Hackerrank can check if a data scientist has the requisite coding chops. Recommendations can get to working style and strengths/weaknesses. Job history gives you an idea of the skills and domain knowledge a person has. But what about their ability to ask good questions? I feel like this is something that may come out in the interview, if it’s conducted well, but which otherwise only really shows up in personal projects. Are the questions interesting? Is the approach informative? Do they understand and acknowledge the limitations of the approach?

Coding can be taught. Asking good questions? I’m not sure. It’s a skill that can certainly be cultivated, but the people who are really good at it seem to have a particularly flexible and associative mindset that I haven’t managed to instill in myself.

It may be that I’m so set on this because it’s something that is a relative weakness of mine. One of my grad school colleagues, Larissa Singletary, always amazed me with her questions at seminars and such — as soon as she asked, I would think “oh yeah, how would that affect things?” or “that could really skew the data!” or similar; I also felt like the question would never have occurred to me because I’m so used to just taking what is said to me at face value. My failure to even think of questions like hers was a regular source of annoyance for me. If you wanted a data scientist to ask truly novel questions and come up with unexpected uses for data, it’s in both of our best interests that you don’t hire me.

However, I have a borderline absurd knack for remembering trivia, and I’ve found that spending time consuming the work of good question-askers has allowed me to expand my own ability to ask novel-seeming questions. (Maybe think of me as an AI with a relatively low creativity setting but high retention, while Larissa has a moderate creativity setting.) The more interesting questions I encounter, the greater my ability to assess the current question against the ever-growing library of possibly interesting questions. I might not think of using house pricing data to examine the racial wealth gap or education achievement if somebody else hadn’t thought of it first, but I’m certainly interested in examining every feature that seemed reasonable-ish and was scrapable in my house price predictions.

This means that, at least at the moment, I’m not necessarily the data scientist you want when you’re trying to come up with a completely new way of using a given data set. However, I might be exactly the data scientist you’re looking for if you want to figure out what ways you can use an existing dataset and the various potential pitfalls of same. My magpie stash of trivia also means I’m sometimes able to come up with some surprising but potentially fruitful correlations to test. These are the kinds of differences that resumes just can’t capture, and they’re absolutely critical differences when it comes to hiring. Hence the need for personal projects and a personal brand!

There are almost as many sorts of data scientists as there are people; resumes and tests can tell you skills but not style. So you need somebody who’s proficient in TensorFlow and has the database chops to get their own data? Great, but do you need a rugged pathfinder to venture off alone into the digital wilds, a collaborative team player to balance out your roster, or a steady and relentless scientist? And how do you tell which is which?

(As an aside, at this point in my career, I’m glad I won’t have to be on any hiring teams for a while because I do not have the chops to evaluate those yet.)

So, what is my personal brand, then? That’s a good question; I’ll let you know as I figure it out. For the moment, I’ll say that my strengths are experimental design, a wide domain knowledge base, and a solid understanding of just how tricky doing good statistics can be. But as bootcamp progresses, I’m sure to discover more things that I love, more things that I’m good at, and more things I desperately need to spend a lot of time getting up to speed on.

--

--

Christine Lloyd

Into science communication and public health. Simultaneously overqualified and underqualified. Happy to geek out over many different subjects.