If you're a data scientist or a machine learning engineer, you might recognize some of the following problems that I've seen quite often in machine learning code:
- It's written like a one-time quick script, but at the same time the code structure is quite complex. This is problematic, because it's code that you need to change quite often if for example you want to play around with different models or parameters or attach various tracking tools or data export mechanisms.
- Data and configuration settings are all over the place, in different files and classes.
- Abstractions are not used enough to separate different parts of the code.
- The way data flows through the application is ad hoc and not logical at all.
I believe part of the reason this is happening is that most Data Science studies and most Data Science content online generally focuses on the techniques and the tools: you learn about statistics, machine learning libraries like TensorFlow, data cleaning techniques, and so on. Python is then purely viewed as a tool to apply those techniques and use the libraries. It makes sense that most studies can’t spend more time on the Python language and software engineering in general, because the Data Science field is already complex enough without it.
However, writing good software is hard, and learning to design and organize your code well is crucial in being a great software developer. This also applies to data scientists and machine learning engineers. Even though you may run your Python code on a variety of platforms and in different contexts, having code that is easy to understand, and easy to reuse in those different contexts is important.
The Information Expert Principle
There's one particularly important design principle to be aware of for data science, and that's the Information Expert Principle. It's part of a larger set of design principles called “General Responsibility Assignment Software Principles”, or GRASP – proposed by Craig Larman in 1997 in his book Applying UML and Patterns.
According to this principle, you should assign responsibility to the information expert - the part of the code that has the information necessary to fulfil the responsibility. In other words, the design of the software follows the structure of the data and how it's being used. What does this mean?
An important idea in software design is to separate responsibilities. In order for your code to be readable and easy to maintain, you need to make sure that functions, methods, classes, modules don’t try to do too many things at the same time. Take a look at this function:
def register_vehicle(brand: string):
vehicle_id = generate_vehicle_id(12)
license_plate = generate_vehicle_license(vehicle_id)
catalogue_price = 0
if brand == "Tesla Model 3":
catalogue_price = 6000000
elif brand == "Volkswagen ID3":
catalogue_price = 3500000
tax_rate = 0.05
if brand == "Tesla Model 3" or brand == "Volkswagen ID3":
tax_rate = 0.02
payable_tax = int(tax_rate * catalogue_price)
print("Registration complete. Vehicle information:")
print(f"Id: {vehicle_id}")
print(f"License plate: {license_plate}")
print(f"Payable tax: {payable_tax}")
This function is responsible for generating a vehicle ID and license, for determining what the catalogue price is of a particular brand and model, what the tax percentage is for different types of vehicles, for computing the tax, and for printing out registration information. Because everything is in a single function, it’s hard to change. For example, we can’t reuse the catalogue price information in another part of the program because it’s embedded directly into this function. If we simply want to know what tax to pay, this function computes it, but we can’t do anything with the value since it’s simply printed to the screen.
You can also set this up very differently, like so:
@dataclass
class Vehicle:
brand: str
catalogue_price: int
electric: bool
id: str = field(init=False)
license_plate: str = field(init=False)
def __post_init__(self):
self.id = generate_vehicle_id(12)
self.license_plate = generate_vehicle_license(self.id)
@property
def tax(self) -> int:
tax_rate = 0.02 if self.electric else 0.05
return int(tax_rate * self.catalogue_price)
def main():
tesla = Vehicle("Tesla Model 3", 6000000, True)
volkswagen = Vehicle("Volkswagen ID3", 3500000, True)
print(volkswagen)
print(volkswagen.tax)
The main difference between this version and single function is that now we grouped data and behavior according to the Information Expert principle. Everything that belongs to a vehicle is now located in the Vehicle class, including the tax computation. This means we now have much more freedom in how we use vehicles in our program. We separated the responsibility of managing information around vehicles from the part of the program that creates vehicles.
Applying The Information Expert Principle To Data Science
If you apply the Information Expert Principle to Data Science code, then the way the data flows informs the design. Look at your current code, and ask yourself the following questions.
Do your functions have all the data they need, or do they need to call lots of other methods/functions to access the data? If they call a lot of other functions, you should look at whether the data flow can be improved by providing extra parameters or structuring the data differently.
Do you use classes to structure data that belongs together, and do you define methods in the classes such that they are closest to the data they need? Or do you have functions that access global variables everywhere? If the latter, perhaps you can provide more structure to your data by using classes and use methods to easily change that data or compute things about it.
Do your methods/functions return tuples containing lots of different things? If so, this also points to your data needing more structure.
Final Thoughts
Overall, in data science projects, the data is central. Design your code around the data and use the Information Expert principle to assign responsibilities close to the data that they need. This is the first way to think about this. The second way is to consider how objects or modules communicate with each other. You generally want to design the system in such a way that the amount of different data that flows between them is minimized. If two pieces of code are communicating and sending lots of different data to each other, you should probably rethink your design and try to optimize how these pieces of code communicate.
I’ve recently published a series of videos where I dissect a data science project from a software designer's point of view and completely restructure the code, by using Python's latest features and by applying design principles such as the Information Expert to improve the code. Here are the links to the videos:
- Part 1: https://youtu.be/ka70COItN40
- Part 2: https://youtu.be/Tx4AxbQNv3U