Ad-hoc vs. Productized Data Science

Photo by Charles on Unsplash

In graduate school, I was developing an algorithm that would analyze vehicular movement data at a traffic intersection and produce some useful information about it. One day, during a review meeting with my advisor, we decided that the algorithm would need to be more generic and apply to more than one intersection. This would require a major overhaul of the code, or so I thought.

When I went back to change the code, I noticed that I had written it in a way that would make the genericization dead simple. I couldn’t believe that my past self had given my future self this gift!

The moral of the story is probably quite obvious to most software specialists — make your code a “product” rather than something “ad-hoc” and it will pay off in the future i.e. productize your code.

However, I have learnt this lesson the hard way over and over again. I have also heard many complaints from software specialists that their non-software specialist colleagues who also write software, especially data scientists, do not produce productized code. Therefore, I outline 3 rules in this article that will result in productized work that has broad applicability, is user-friendly and reusable. While these rules were developed through my programming experiences, I believe they apply to more than just software — any activity that creates consumable content should benefit from implementing these rules.

Often, software solves a very specific problem. However, taking time to think about the generic class of problem that is being solved can help produce more elegant, more flexible, and more reusable code. And if you identify the generic problem as a common one that many people likely need a solution to, the chances that you find existing code that you can leverage increase significantly.

Personal example:

Our team needed to monitor incoming time-series data from industrial assets. This required defining rules and triggering alarms when the rules were met. We identified this as a generic alerting system and found that Prometheus was a generic, configurable, and open-source software that we could leverage. Using Prometheus probably cut down our time to delivery by several months and ensured that our team focused on our specialised domain instead of re-building the solution to a common problem. The lesson here is that Prometheus is built to solve the generic use-case of alerting and therefore, we could adapt it to a new domain with minimal effort. If Prometheus were built only to cater to the specific problem for which it was originally built, it wouldn’t be such a widely used open-source project.

As mentioned before, software is written to solve problems. However, the solutions to those problems that are implemented in software are always subject to testing and changes. In this environment, having software that is configurable and user-friendly is invaluable. In particular, the data science domain often requires trial-and-error when exploring datasets and developing approaches to solve problems. Being able to change a configuration file instead of changing code, for example, saves a lot of time in the long run. But the main advantage with user-friendly code is that more people can be put on the job with minimal training. The ability to scale the team quickly reduces the risk of failure and makes the team more agile. While agile software frameworks generally only prefer “just-in-time” approaches to documentation, I prefer documenting code as it is written (or before) so the risk is minimized right from the get-go.

Photo by David Travis on Unsplash

Personal example:

Our data science team once had a tight customer deliverable for a pilot project. We had to prototype a data-based system that would detect certain types of anomalies that occur in an ammonia production plant. The algorithm was built, the results were obtained for a test dataset, and the prototype was shipped off to the client. They ended up liking the work and came back to us a few months later asking for a productized solution for the same problem. During these few months, the data scientist who had worked on the initial problem had left our company. So, the productization was to be done by others who had no exposure to the original work. When we dug into the prototype code, we realized that it was not configurable and there was no documentation. We realized that it was going to be difficult to change the code to use a differently structured dataset, use a slightly modified algorithm, and produce more comprehensive output. In short, we had the classic problem of ad-hoc code. We ended up re-doing the initial work from scratch to avoid the same situation again in the future; this was a big risk to take on and a huge waste of time.

A related rule to the previous one is to expose your work to others as much and as early as reasonable. The idea here is that the assumptions made in the work and the overall utility of it are tested earlier in the process. There are also other added benefits to following this rule:

  1. Making your work public early forces you to make incremental progress of high quality.
  2. It allows you to get feedback multiple times about how well you followed rules 1 and 2.
  3. It reduces overall risk because more people will have knowledge of the work. This means that they will be able to contribute to it from a non-zero starting point in the future.

Personal example:

A small team in our company built a module that can combine many different anomaly detection methods and produce output that is better than any method by itself. This tool was made public (to the rest of the company) very early in its development process. Just a day or two after we had first demonstrated the concept and prototype to the broader team, we got feedback that there was an open-source project on Github that already solved one particular aspect of our solution. This discovery reduced risk and reduced delivery time by at least a few weeks.

Therefore, make it generic, usable, and public to avoid re-work, save time, and reduce risk. You can remember it using the mnemonic G-U-P or the much cuter PUG!

I think things and write them down sometimes.