You wonder how a Data Scientist works? The data science process (i.e., building data-driven products such as recommendations systems, fraud detection systems, chatbots, etc.) is, in some sense, similar to what a chef like Paul Bocuse in a restaurant does while preparing a new menu. She must create something new that appeals to the customers, which is then repeated throughout the season. Even if a recipe exists, getting the right taste, consistency, or look usually implies iterating and relying on experiments, i.e., learning from trial and error, using ingredients (data), some preexisting intuition (domain and technical/professional expertise), and the right tools.
Sie möchten den Artikel lieber auf Deutsch lesen? Hier geht es zum Artikel: »Was hat der Drei-Sterne-Koch Paul Bocuse mit Data Science zu tun?«
Business and Data Understanding
Why do we even need to cook? For whom? For what occasion? What kind of dish are we talking about? And how do we know when we succeed? All these questions about the business goal, the potential added value, the stakeholders, and the expected outcome are the starting point of any data science project (usually referred to as business understanding or problem-framing phase). Obviously, cooking spaghetti al ragu is not the same as cooking a sauce-drenched timpano. Yet, a number of ingredients (meat balls, pasta, and tomato sauce), steps, and tools are common for both dishes. What differs are the combination and the proportions of the ingredients, the tools used and their settings, and the sequence and timing of the preparation steps.
The business requirements usually narrow the pool of suitable dishes to cook. This also restricts the set of ingredients to use (i.e., the data). For example, the use of chocolate is most likely to be excluded when cooking a variation of penne al’arrabbiata (unless one is very innovative). At that point, new questions emerge: Are all ingredients available? In what quantity? And what quality? If some ingredients are missing, where to get them and how long does it take to get them? Planting tomatoes on the balcony may not be an effective long-term supply solution. Asking the neighbors for some may contribute to good neighborly relations (as a one-time solution). Going to the supermarket, or (better) directly to the producer is probably more efficient.
During the data understanding phase, the goal is to choose a dish from the ones that potentially fit the business objective; for which the appropriate ingredients (data) are available; and that can actually be cooked (meaning the time, skills, tools, etc. are present).
A dish is by no means a juxtaposition of ingredients. Ingredients must be prepared, usually following a certain order, and different treatment of the same ingredients can have massive effects on the outcome. For example, a dessert such as a floating island requires dividing the egg whites from the yolks first and then whisking the egg whites, whereas an omelet requires whisking the whole eggs directly. The same is true with data.
Let’s get the ingredients (i.e., the data) first. Both may come from different sources (supermarkets, wholesalers, producers, etc., resp. data warehouses, cloud storage, API, etc.) in different forms and packaging (data formats). The data ingestion process aims at gathering all ingredients and making them available in a useful form on the worktop in order to start cooking.
Like data, ingredients may differ in quality. There may, for example, be problems in the data formatting: Vegetables or fruits might not always be calibrated, some might be more ripe than others, etc. There might also be missing values: like an egg carton containing 6 instead of 12 eggs. The data may be imbalanced: too much pasta and not enough sauce, etc. (see more examples in the “Handbook of Bad Data” (McCallum 2013)). A chef will always check the quality of their ingredients, sort them out, and even change supplier if necessary. This is what the data cleaning process aims for.
Preparing some intermediate products corresponds to the feature engineering process. This can be straightforward, like chopping onions or blending spices, or more complex, like marinating some piece of meat or preparing some stuffing.
Modeling, Evaluation & Deployment
Let’s cook! While the type of dish already constrains the kind of cookware (i.e., the model) to use (pans, oven, kitchen utensils, etc.), there is still plenty of room for experiment (temperature, cooking time, stirring or not, etc.). Similar to a chef, who will try many alternatives before the expected consistency, taste, or look is achieved, Data Scientists also try different model versions, each with slight variations ((hyper-)parameters) in order to find the best combination of ingredients (data), intermediary products (features), and cookware (model). This corresponds to the modeling phase.
Taste is subjective, and what a cook likes might not always reflect what the customers want or are ready to order. The art of a chef is to understand the customers’ tastes and adapt the dish when necessary. The same is true for data-driven products and the work of a Data Scientist. Both may perform well in a controlled environment (e.g., at home with some friends), but poorly in a production environment facing all different types of customers (e.g., in a restaurant or a restaurant chain). The goal of the evaluation process is to get feedback on the performance, and to adapt or change the dish (product) if necessary. This may be done with a specific set of customers (e.g., the regular ones) or on a special occasion. The idea here is to avoid losing too much time and to evaluate the product as soon as possible (using, for example, an MVP in an A/B test setting).
Getting a new dish from the restaurant’s kitchen to the dining room implies several things. Obviously, the menu must be adapted so that the customers can find, understand, and order the new dish (i.e., incorporating the new data-driven product into the current portfolio may require new UX decisions). A price has to be chosen. The waiters should know how to describe and sell the dish to the customers. The kitchen team must be able to cook the dish within a given amount of time, even when the chef is not there. The restaurant must ensure that feedback, either from the customers directly or from the waiters, is obtained continuously, etc. This corresponds to the deployment process.
The conclusion when comparing the work of chef and Data Scientist
Just as in the kitchen, the different phases or processes involved in Data Science are not independent of each other. Many iterations usually take place. It may be that a phase fails (e.g., not enough ingredients; not the right ones for the dish; the restaurant’s customers do not order the new dish, etc.) and that adaptation is required (ordering new ingredients, changing the dish, organizing the menu, etc.). Furthermore, a kitchen has to be well organized to keep up during rush hours, to avoid waste, and to ensure high standards of quality and hygiene. The recipes must be written down and updated when needed to ensure that customers get the same outcome when they order the same dish. Finally, while being technically good is a prerequisite, understanding the business problem and the customers’ needs is essential.
Are you interested in becoming a Data Scientist? Fraunhofer IESE and the Fraunhofer Big Data Alliance are jointly offering a three-level certificate program for Data Scientists.
For further information about Data Science, the author recommends this website: https://towardsdatascience.com/