It works on my notebook! Theory versus practice in Data Science

It works on my notebook! Theory versus practice in Data Science


When you start on the road to a
Data Driven company, you begin to understand how to use the tools that the current market offers such as artificial intelligence or Internet of Things. You hire data scientists who can solve your business problems without first asking yourself WHAT you need to do artificial intelligence and HOW you plan to do it.

The answer to the first question is simple, we need data. Now, the situation starts to get complicated when we try to answer our second question.

Everything seems perfect, but where do I have this data, what format does it take, is it easy to access, how often can I access it, is it complete, without errors, without null records, how long have I had this data, how long have I had this data? And assuming all this is solved, how easy is it to develop a Machine Learning model and implement it?

There are many questions, but there are also many answers depending on the problem to be solved.

When we talk about Data Science, we are not talking about a tool, skill or method, but more like a scientific approach that uses statistical theory, applied mathematics and computer tools to process large amounts of data. Data science is a detailed process that mainly involves preprocessing, analysis, visualization and prediction.

We all know that Data Science is a very powerful scientific approach, with all kinds of interesting applications. However, it is also well known that in Data Science there is a big gap between theory and practice: when it comes to theory, we know everything, but we don’t know how to apply it in real life.

For this reason it is important to prioritize when working with data. This list may change depending on the company, but most of them agree on many of these points.

Step 1 – Define the business problem

This first step is fundamental, and requires much more of the human factor for the understanding of the problem to be solved, the agreement of criteria for the definition of the objectives, scope and timeframe, than of the system itself that will be used as a means to reach them.

Surely the data scientist has many ways to solve a problem, but the one who must set the course of the solution must be the one who knows the business. Interaction and teamwork are essential.

Step 2 – Data acquisition

This step is perhaps where we find the biggest difference between theory and practice. In theory, when we want to make a machine learning model, all we need to do is download a dataset from sites such as Kaggle or Github and we will have clear, neat and well described information. In practice, sometimes the sources can be:

  1. Very varied: which would take a previous work of ETL’s, modeling, etc.
  2. Poorly described or without description: Without having a clear description of what variable we are working with we do not know what we have and if it can help us to solve our problem.
  3. With erroneous data / null records: As it is often said in Data Science, Garbage in / Garbage out.
  4. Unknown: In a sectorized company where the data are within the area that worked with them, the opportunity to combine them or use them for other business purposes may be lost.
  5. With restricted access: Depending on the data security standards within the company, accessing data often becomes a titanic task and involves a bureaucratic process that is difficult to measure over time.

These and many problems with data sources can be solved with proper data governance and fundamentally a very well communicated organization.

Step 3 – Data preparation

This step involves data cleansing and data transformation, data cleansing is the most time consuming as it involves handling many complex scenarios such as inconsistent data types, misspelled attributes, missing values and duplicates. Then, in data transformation, we have to modify the data based on the defined mapping rules.

Step 4 – Exploratory Data Analysis

With the help of Exploratory Data Analysis we define and refine the selection of variables to be used for the development of our model. It is important to always keep in mind the solution we want to target.

Step 5 – Data modeling

The main activity of a data science project is known as data modeling. In this step, we repeatedly apply machine learning techniques of type strength such as KNN, decision trees, Naive Bayes, etc. to the data so that we can identify the model that best fits the business requirement. We train the model on the training dataset and test it to select the best performing model.

Step 6 – Visualization and communication

This point is perhaps the most relevant of all because we can have the best data extraction and transformation process, the best trained Machine Learning model, but if we do not know how to visualize it, explain it, communicate it and give value to the business, all the previous work will not matter much. It is essential to reinforce soft skills at this point to know how to reach stakeholders.

Step 7 – Implementation and maintenance

And finally, in this step, the data scientist implements and maintains the model, tests the selected model in a pre-production environment before implementing it in the production environment, which is the best practice. After implementing it, we have to get real-time analytics and monitor and maintain the performance of the project.

As you will see, there is a huge difference between what we study (Theory) that practically starts and ends in a local Notebook versus what is needed to carry out the whole process in real life (Practice). It is for this reason that it is often overwhelming and sometimes frustrating to try to work with data and generate results.

That is why Macrotest #DataLab helps you all the way with our end-to-end solution so that you have a complete understanding of the tools, methodologies and processes.