Data preprocessing is the step in which data gets encoded to bring it to a numerical state by which the machine can easily read through it. Data preprocessing techniques are part of data mining, which create end products out of raw data which is standardized/normalized, contains no null values, and more.
Data preprocessing is essential for machine and deep learning tasks, for anything from algorithm development to computer vision. In this article, you will be introduced to common data preprocessing techniques in Python and learn how to implement them on your own. This article assumes you have imported and set up a dataset for manipulation.
Replacing Null Values
Replacing null values is usually the most common of data preprocessing techniques because it allows us to have a full dataset of values to work with. To execute replacing null values as part of data preprocessing, I suggest using Google Colab or opening a Jupyter notebook. For simplicity sake, I will be using Google Colab. Your first step will be to import
SimpleImputer which is part of the
sklearn library. The
SimpleImputer class provides basic strategies for imputing, or representing, missing values.
from sklearn.impute import SimpleImputer
Next, you’re going to want to specify which missing values to replace. We will be replacing those missing values with the mean of that row of the dataset, which we can do by setting the strategy variable equal to mean.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
The imputer fills missing values with some statistics (e.g. mean, median, …) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit and then stores it. It then uses that data on the test portion, done during the transform.
imputer.fit(X[:, 1:3]) #looks @ rows and columns X[:, 1:3] = imputer.transform(X[:, 1:3])
Feature scaling is a data preprocessing technique used to normalize our set of data values. The reason we use feature scaling is because some sets of data might be overtaken by others in such a way that the machine learning model disregards the overtaken data. The sets of data in this case represent separate features.
The next bit of code we’ll be using will scale our data features using a function called standardization, which operates by subtracting each value of the feature by the mean of all the values of that feature, and then dividing that difference by the standard deviation of the feature. Doing this will allow all the values to be within three numbers, or values, of each other. Standardization is a commonly used data preprocessing technique. Another feature scaling function we could have used is normalization, which works by subtracting each feature value by the minimum, and then divides that by the difference of the maximum and minimum. Normalization puts all values between 0 and 1. However, normalization is a recommended data preprocessing technique when most of your features exhibit a normal distribution – which may not always be the case. Since standardization would work for both cases, we’ll be using it here.
from sklearn.preprocessing import StandardScaler sc = StandardScaler() #sc = standard scaler variable x_train = sc.fit_transform(x_train)#only apply feature scaling to numerical values x_test = sc.transform(x_test)
Above, we begin by importing the class
sklearn preprocessing module. After this, we create an object of the class in the variable sc. Since we’ll be applying it to all our values, we don’t need to pass any parameters. Then, we take the training set X and fit our standard scalar object only on the columns containing independent values. We have to do this same process for the testing values, which is why we have
Data preprocessing techniques are important to create a final product out of data sets. The above were two common steps or methods of data preprocessing. For more information about the workings behind machine learning, I suggest you checkout my recent article, Understanding Artificial Neural Networks: What’s Behind the Network.