Shutterstock 1505994923 M

Better, Faster, Standardized: Pushing the envelope with deployment of tools and resources to enable data scientists

October 10, 2022

Welcome to the third 84.51° Data University blog, a series of quarterly insights for prospective and current data science professionals.

When data scientists are tasked with building a solution to prescribe an action (e.g., who to target for a mailer or other promotions), the goal is to efficiently develop a solution that’s accurate, reliable, and scalable with the least friction possible.

But because the data science process is complex, oftentimes a data scientist does not get to spend enough time on the modeling or algorithmic portion of the work required to develop the optimal solution. This challenge is highlighted by a paper published by Google Inc. called “The Hidden Technical Debt in Machine Learning Systems.”

Getting down to the science

The data science process involves several core steps:

  • Data wrangling
  • Data prep
  • Feature building
  • Model building and testing
  • Deploying and scoring
  • Evaluation of the model’s performance and maintenance

At 84.51°, we wanted to enable our data scientists to spend more time on the actual model building and science creation. We also wanted to establish and drive adherence to best practices.

To achieve these goals, we created standard, reusable components that data scientists can use to simplify the end-to-end process and make it easier for a data scientist to go from an ideation to production. These components, which we often refer to as being like Lego bricks, embed best practices and enable data scientists to efficiently solve for portions of their solution. They also make it easier for data scientists to use advanced methods and techniques found in the ever-growing landscape of open source software packages. In short, they make it easier for a data scientist to do their job of creating a better solution, faster — while using leading-edge techniques.

For example, for data wrangling and prep we have components that make extracting and aggregating data reduced to just a few lines of code, instead of custom code each time. This saves many hours of work in writing and QA testing code. We have also created these components to perform diagnostics on the data and create relevant KPIs to be used at time of creation of the model or to help QA and monitor model health.

Other components allow for broader use of data and streamline the feature generation step. We make it easy to locate and extract relevant approved data, model scores or segmentations created throughout the company for use as inputs to the modeling process. It’s here that data scientists get to spend more time on what they are best suited to do – creatively solving complex business problems through model building and testing.

As for the modeling process, we provide a variety of tools and empower the data scientist to select the right tool for the job based on their expertise. Our data scientists can easily take the components used during code creation to feed open-source modeling techniques, or use automated machine learning tools like DataRobot to streamline the modeling process. This allows them to utilize cutting edge techniques to develop and test hundreds of models in hours instead of weeks.

Once the models are built, we have components to help provide diagnostics on the model, like variable importance, statistical checks, responsible AI checks and summaries. Once again, this is done with a few lines of streamlined code versus hundreds of custom lines of code that would need to be QA checked and performance tuned without these certified components.

A critical step in the data science process is our data science reviews where we evaluate our models to make sure we are using the best techniques and data available to solve the business problem at hand. Using our components allows those reviews to go faster and to focus on application of the data and science more than the code and process itself.

We don’t stop there: We want to make sure we properly m__onitor our models in production__, so we have developed components for that as well. Our Machine Learning Operations (MLOps) components make it easy to track model health including performance against actuals (when available), feature stability (i.e., feature drift), data drift, and service health (e.g., latency, service errors, etc.).

Standardization promotes freedom, flexibility, and accuracy

While some may think of the concept of standardization as being bureaucratic or confining, in this case the opposite is true. The standards and frameworks we have developed don’t take away from the creativity of solving a problem. Rather, by providing components that standardize aspects of the data science process to ensure quality, we reduce the burden on our data scientists by making it easier for them to use cutting-edge techniques. This allows them the time to apply flexibility and creativity in the more valuable and challenging aspects of their solution paths.

Our component solutions address some of the more difficult, mundane and engineering-centric tasks that are common across many of our use cases. Imagine an architect tasked with developing a building: The building requires electricity, plumbing and HVAC, but it would be foolish to task the architect with specifying the details of those systems (e.g., what size wires, what pipes and valves, what brand of boilers, etc.). Abstractions and patterns (i.e., component solutions) free the architect to provide the required information to about those systems, so that they can spend the bulk of their energy on the details they were specifically trained to address such as the functionality of the space, aesthetics, etc.

Similarly, our component solutions free our data scientists to spend more time applying their expertise of framing and solving problems using advanced mathematics. We have used learnings from our modeling and standards process to take what has worked and built those steps into components that enable our data scientists to take the pieces, build upon our best practices, and know the code is streamlined, performant and QA’d so they can focus on modeling —what data scientists like to do best.

Our components are all developed using best practices and are managed in the code hosting and collaboration platform GitHub. We encourage active contributions from our users in the form of pull requests or issues reporting. And because they are certified as data science assets, our components are rigorously reviewed to ensure accuracy and reliability, for quality solutions marked by innovation that give businesses a better understanding of their customers and how to target them.

Visit our knowledge hub

See what you can learn from our latest posts.