How I Hire for Data Science?

As a former CTO in a data science business, I have been asked this question many times. This post outlines the exact method and procedure that I have used. I can not promise you that preparing for this will get you hired anywhere, but this method and procedure did create one of the stronger data science teams that exists. I hope it helps…

Friso is the co-founder and former CTO of GoDataDriven, and previously held CTO roles in both product and professional services startups. He has been hiring software engineering and data science professionals for at least a decade. Today, Friso works as a self employed interim CTO, or technology and strategy execution advisor to startups and scale-ups. You can find him on LinkedIn or through his personal website.

How I Hire for Data Science?

I have been asked this question many times. This post outlines the exact method and procedure that I have used. I can not promise you that preparing for this will get you hired anywhere, but this method and procedure did create one of the stronger data science teams that exists. I hope it helps…

The essence of data science is to numerically codify fine grained common sense about everyday things, and subsequently use that common sense to automate predictions about those same things.

As an example, here is some common sense: people who watch one episode of a TV series, will likely want to watch the next episode afterwards. Except when they were watching that one episode as part of researching a specific topic; then they will likely want to watch another program on that same topic afterwards. Good data scientists would build a recommender system that covers both cases, and many more. Not because they thought long and hard about all of those individual cases, but because they thought long and hard about the problem (you are not recommending related content, you are recommending content a viewer wants to see, which sometimes is related content). So, what makes a person good at this? Good at asking the right question? Good at exploring the data that holds the answer? Good at selecting techniques that generalise well enough? Good at creatively cheating when the combination of modelling and available data are lacking for some cases? And on top of all that, show consistency and productivity doing this?

None of the requirements above explicitly mention tools and techniques that are common to the field. Meanwhile, all the online data science talk is about heaps of Python libraries, about Keras versus PyTorch, about “traditional machine learning” versus deep learning, etc. But these debates and their supposed correct answers have never been part of my hiring procedures. Instead, I look for more fundamental aspects of the job.

Fundamental Skills of a Data Scientist

In my hiring for data scientists, I try to assess for these five areas of skill or expertise:

  • Math and Statistics
  • Machine Learning Fluency
  • Computer Programming Proficiency
  • Communication
  • Curiosity and Creativity

On Math and Statistics

A foundation in Math and Statistics means you can rigorously defend your analysis and results. It means you will never show a demo machine learning prototype without first sharing the exploratory analysis that underpins it. You care for the predictive power of inputs to be defended with distributions and their resulting significance or confidence.

As a data scientist, your skills in math and statistics allow you to rigorously defend your approach and analysis.

On Machine Learning Fluency

Some people might tell you that clustering is an easy problem, because you just apply k-Means, which is a well established technique with many excellent implementations. In a sense, they are right. It is easy to do that. But other people will tell you that good (unsupervised) clustering is really hard, because it suffers badly from the curse of dimensionality, and getting meaningful cluster boundaries is not an obvious outcome.

In practice, in a closed domain you often get reasonable results from applying k-Means to a set of representations obtained from some embedding (e.g. clustering terms based on their word2vec representation). And you are right to do this if it works, but if you fail to explain why this is fundamentally wrong, you fail the machine learning test for me (k-Means assumes Euclidean distance, not vector distance metrics).

Machine Learning Fluency is about having a mental framework that allows you to reason about different machine learning techniques and their caveats in different situations; not about knowing the basic algorithms by name and their scikit-learn interfaces. Knowing the interface to a k-Means implementation will make you very productive when you need it, but understanding its implementation and theory will help you correctly decide if you need it. Possessing such fluency will also enable you to quickly add new techniques to your toolbelt as well as intuitively look for candidate solutions given a prediction problem.

As a data scientist, you possess working knowledge of the landscape of machine learning models and approaches, understand their underpinning assumptions, and implementation caveats.

On Computer Programming Proficiency

Effectively applying the craft of software engineering to build maintainable, scalable, performant, and production ready systems is not the same thing as writing a computer program to handle a narrowly focused task of transforming data into a model or a visualisation, yet both of these require computer programming. Where do we draw the line between one and the other? When a data scientist writes a prototype UI, I expect it to work, but I do not expect it to scale to 10 million users. Nor do I expect a production ready data pipeline underpinning the model training.

For a data scientist, Computer Programming Proficiency is the skill that unlocks all data sources, all toolkits, all analysis frameworks, and all visualisation options, without being locked into a single vendor or stack. Working exclusively with SQL and Tableau may still yield brilliant analyses, but for applied data science, limiting yourself to one tool is too much constraint.

As a data scientist, you write code to effectively automate your workflows, build reproducible analyses, and deliver working prototypes.

On Communication

To your business audience, machine learning is a mandatory buzzword on a checklist somewhere together with “agile”, and “IoT”. Not a field of research, let alone science. It is your job to help this audience understand why one approach may work where another is doomed. Your education towards business stakeholders will help the right projects attract management buy-in.

As a data scientist, you can effectively explain the validity of your modelling approaches to both data science colleagues as well as layman.

On Curiosity and Creativity

I once used a hiring procedure with a take home assignment that provided a data set taken from the bug tracker of a software project. The assignment was to use this data set to build a REST service that would predict the due date for a bug given its details, such as description, reporter, component of the software, and more. There are a lot of different approaches to this problem, and the assessment of the assignment was not tied to any particular approach, as long as you could defend your approach and decisions. Nobody was expecting state of the art.

One candidate built a solution that was very well supported by exploratory analysis and made a very strong case for using one variable as the single best predictor for the target. Using a OLS regression for the model, the result looked good, and the residuals of the prediction looked nicely normal. The only issue: the chosen variable was the start time of working on the bug. Meaning that the regression would do nothing more than just predict the average (mean) time it would take for a bug to be fixed, once the work had started.

It turns out the candidate didn’t actually know or bother to find out what a bug tracker does and how they are used in practice. This is how you fail the Curiosity and Creativity test.

As a data scientist, you naturally explore domain knowledge before jumping into solutions.

The Procedure

A challenge with defining a profile in terms of these more broadly defined, fundamental skills, is assessment. It is much easier to ask a candidate to write some code and then decide if the coding standards meet your requirements, but how worthwhile is it to reject a candidate because they have the wrong IDE settings? Or use a different clustering method than you would have?

So instead of hard criteria, we used a scale to grade candidates. Because this process is inherently subjective to an extent (even with structured, consistent interview questions), the hiring process is distributed across a hiring committee. My process has five stages:

  • Introduction
  • Interview
  • Take home assignment
  • Take home assignment review / assessment
  • Offer

The introduction is basically a screening call, which I use to make sure the candidate is a sane human being, and did not very obviously lie on their CV. It should take no more than 20 minutes, typically.

The interview has two interviewers on the employers side. They have one simple assignment: assess the candidate on the five topics (math & statistics, machine learning, computer programming, communication, curiosity & creativity) on a four level scale:

  1. Non-existent
  2. Mediocre
  3. Good
  4. Outstanding

For each score, the interviewers have to defend their assessment by providing exact questions that were asked, and what parts of the answers led them to their conclusion. Often these are structured interviews, meaning different candidates get the same questions. The essence of these levels is that there is no middle ground; there is no option to doubt. A score of 1 or 2 means no; a score of 3 or 4 means yes. Candidates need to score 3 or 4 in all five areas independently. You can not make up for poor programming skills by being a math wizard.

Because the interview is just a conversation, and only one data point describing a candidate in a not necessarily comfortable situation,the next step is a take home assignment. These are not universally popular (especially amongst software engineering candidates), and I can understand why. Many companies will ask candidates to do an assignment, and then write a one line rejection message stating it was not good enough. Not only does that feel arbitrary, it is also rude. If a candidate is going to put in several hours of work, I will always have the courtesy of spending a couple of hours with them talking about it, regardless of what I think of the solution when it comes in. A take home assignment is an excellent way to create the playing field for a conversation where the candidate can be well prepared. I have some rules when it comes to take home assignments:

  • The problem given in the assignment can not be a problem that the interviewers have already solved themselves. It must be new, otherwise there is bias towards a known good solution.
  • The documentation of the assignment must be very explicit about what the interviewers are looking to assess (the five areas of skill). This prevents interviewers from cherry picking in order to bend the outcome to their favour.
  • Make it very explicit that the result is meant as a piece of work to have a conversation about, in which a candidate can elaborate on their approach and decision process. The delivered solution itself is not standalone.

I would advise candidates to stay away from interviewing procedures that have assignments which do not follow these rules.

The outcome of the assessment is again a scoring on the same scale. And again, all areas have to be 3 or 4. With a good hiring team, I have always committed to providing feedback to the candidate no later than 15 minutes after the assessment session. Yes means you will receive and offer, no means this is where it ends.

Closing Notes

You may see many vacancies with requirements like Tensorflow, Python, scikit-learn, etc. It is important to appreciate that you can easily achieve professional looking results using Tensorflow glue code and off the shelf, pre-trained models. But that does not prove a deeper understanding of machine learning. Candidates who show a strong foundation can learn the techniques on the job. If you are looking to be hired not only today, but also when the next Tensorflow comes around, focus on the foundation.