Shutterstock 1504251113 M 900 X 635 2 X 2 X

The missing data science guide

By: Giri Tatavarty, VP, Data Science & Patrick Halpin, VP, Data Science, Data Scientists

As leaders of a mid-size data science company, with years of analytic and data science experience, we find there are simple themes or principles that distinguish a successful project from one that has gone astray over time, over budget and with an unhappy client. These principles are often not taught in school and unfortunately, are sometimes learned the hard way.

We’ve compiled a list of themes below that can be applied to a wide variety of data science projects, so those starting a new project can take advantage of key learnings gained from years of mistakes and iteration. “Love the problem” not the method for a successful outcome.

The primary goal of data science is solving business problems to reach a successful outcome. Usually, the techniques and methods are a means to an end and not the end itself. For this reason, focusing on the client problem is paramount. Sometimes, it may not be a data science problem, rather it’s a process or people problem. In these cases, no amount of data science will solve the original business challenge.

Success is based on model accuracy and business impact. Solely measuring the fit of a model solves only half the equation. Always measure accuracy metrics, as well as business metrics. Your company may ask, “what value does science bring to the table?” Having business metrics paired with model metrics ensures you have a complete picture. Does driving an AUC value from 0.78 to 0.79 impact the business value? Measuring both is the key to success.

Ask these questions:

  • Are you focused on a method or the question?

  • What does this science bring to the table?

  • How can I measure business impact of the science?

  • Can we scale the method given the business requirements?

Always have a baseline and measure throughout the process.

Before starting any modeling or data science project, identify key metrics and a baseline value that needs to be surpassed. Ask yourself the naïve question, “what would I do if I didn’t know machine learning or data science?” Start implementing that method. During the model-building, keep incrementing the features or the complexity of the model to surpass the baseline. The baseline will tell you if the juice is worth the further squeeze.

Ask these questions:

  • What is the simplest solution I can start with?

  • What are my key measures for success?

  • How can I compare different science approaches?

Bring stakeholders along on the journey. Being good at science is no guarantee for success.

Client requirements are always changing. Keeping communication channels active is critical to ensure the original assumptions are valid and the client is still interested and aligned to the work. As much as we all want to impress the client with great work, setting the right expectations and sharing progress updates will earn you trust and support. It also helps to get support from the client for contingencies, such as data issues, software and platform delays or other personnel related matters, which we all know can happen.

As you get closer to the solution, increase the client communication. Remember to plan for the end state of the production model. If the end goal changes, adapt, and pivot your work. In addition, gather all details early to avoid last minute surprises. You don’t want to find roadblocks in the last mile of a project. If we can't deploy into production, we can't get the maximum value from the model or science.

Ask these questions:

  • Will we need millisecond response times?

  • Will the model scale to score 50million clients?

  • What happens if the data feeds are updated?

Deployment is just the beginning. Prepare for support, monitoring and maintenance.

Data scientists often consider the problem solved once their model is deployed. But that’s just the beginning of a long journey. Make sure to account for support and maintenance time as part of the work plan.

Ask these questions:

  • How often do you update your models?

  • Are you refactoring and rebuilding automatically based on performance?

  • Will there be data drift that will cause changes to your model

  • Are you monitoring the accuracy of the model?

  • If the model breaks, what is the support plan? What are the SLA's ? Will you revisit the approach?

Continuous learning, experimentation & research are key to keeping a growth mindset.

Data science tools and techniques get obsolete faster than fashion trends. The shelf-life of packages, tools and platforms is short. Over the past few years, there has been an explosion of data science modeling methods. You will need time to research new developments, maintain your skills and stay relevant. Architect in such a way that you’re able to iterate and replace the model with a better model. Typically, this will require you to implement some sort of experimentation platform to enable AB and/or Multi Armed Bandit testing.

Data scientists can be biased and always have an opinion of their process. Make sure you have an objective peer review or science review to assess your data science process. Have a panel of experts or peers review key aspects of design and implementation. Always get impartial feedback and expert help.

Ask these questions:

  • How can you enable experimentation with your models?

  • Do you have an experimentation platform or setup to replace models or run a champion challenger?

  • What kind of disruptions are needed for the business to replace a model? How can you minimize them?

  • Do I have science reviews or peer reviews to get the expert help and feedback?

Data Strategy is as often as important as Science. Nurture & curate data.

We all know that data is key to the performance of any model. Without data, models are of no use. Curation of data features often leads to much superior models. So, have a wholistic strategy for curation and data sharing. Invest in a good data platform. Create feature marts, if necessary, for reusing features.

Ask these questions:

  • Does my data platform enable me to create and reuse features effortlessly?

  • Do I have visibility to data pipelines? Are they automated?

  • Do I have statistical process control checks to detect quality issues? Am I collecting data about my models?

  • How is this data being used to improve the product/business?

Let the data tell the right story, not the easy one.

Be honest and let the data speak for itself. This is a critical principle for a scientific process. There is always business pressure. Statistics are often used to justify an executive decision, rather than help inform the decision. Honesty is indeed a best policy. This will enable you to fail fast and create something truly impactful.

Ask these questions:

  • Are you torturing the data until it proves what you intended to prove? Slicing and dicing hundred cuts and cherry picking a few to tell the story is a common symptom of this.

  • Are you communicating the model uncertainty along with the accuracy? Are you also communicating significance (p-values) ? Are you scaling the charts, shifting the axis to show "large impacts"?

  • Are you communicating the baselines, expected variations of data, alongside of model impacts?

Like with many disciplines the success and impact of Data Science Projects depends upon various soft skills, having wider context and best practices, which are not always taught in school. Being aware of them sets you up for success and avoids learning the hard way.

Interested in seeing how we put these practices to work? Check out our Data Science Knowledge Hub.

Giri Tatavarty, VP, Data Science & Patrick Halpin, VP, Data Science, Data Scientists

Visit our knowledge hub

See what you can learn from our latest posts.

We’re leading a data revolution in the retail business, and we’re looking for partners who are ready for a deeper, more personal approach to customer engagement.

Let’s connect