Data Preprocessing Techniques for Machine Learning with Python

AI automation background
About

Hi, we are viso.ai from Switzerland. We power a no-code computer vision platform. Thank you for reading our blog.

Contents
Need Computer Vision?

Viso Suite is an all-in-one solution for organizations to build computer vision apps without coding. Learn more.

Data preprocessing is the step in which data gets encoded to bring it to a numerical state by which the machine can easily read through it. Data preprocessing techniques are part of data mining, which create end products out of raw data which is standardized/normalized, contains no null values, and more.

Data preprocessing is essential for machine and deep learning tasks, for anything from algorithm development to computer vision. In this article, you will be introduced to common data preprocessing techniques in Python and learn how to implement them on your own. This article assumes you have imported and set up a dataset for manipulation.

Replacing Null Values

Replacing null values is usually the most common of data preprocessing techniques because it allows us to have a full dataset of values to work with. To execute replacing null values as part of data preprocessing, I suggest using Google Colab or opening a Jupyter notebook. For simplicity’s sake, I will be using Google Colab. Your first step will be to import SimpleImputer which is part of the sklearn library. The SimpleImputer class provides basic strategies for imputing, or representing, missing values.

from sklearn.impute import SimpleImputer

Next, you’re going to want to specify which missing values to replace. We will be replacing those missing values with the mean of that row of the dataset, which we can do by setting the strategy variable equal to mean.

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

The imputer fills missing values with some statistics (e.g. mean, median, …) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit and then stores it. It then uses that data on the test portion, done during the transform.

imputer.fit(X[:, 1:3]) #looks @ rows and columns X[:, 1:3] = imputer.transform(X[:, 1:3])

Feature Scaling

Feature scaling is a data preprocessing technique used to normalize our set of data values. The reason we use feature scaling is that some sets of data might be overtaken by others in such a way that the machine learning model disregards the overtaken data. The sets of data in this case represent separate features.

The next bit of code we’ll be using will scale our data features using a Python function called standardization, which operates by subtracting each value of the feature by the mean of all the values of that feature, and then dividing that difference by the standard deviation of the feature. Doing this will allow all the values to be within three numbers, or values, of each other. Standardization is a commonly used data preprocessing technique. Another feature scaling function we could have used is normalization, which works by subtracting each feature value by the minimum, and then divides that by the difference of the maximum and minimum. Normalization puts all values between 0 and 1. However, normalization is a recommended data preprocessing technique when most of your features exhibit a normal distribution – which may not always be the case. Since standardization would work for both cases, we’ll be using it here.

from sklearn.preprocessing import StandardScaler sc = StandardScaler() #sc = standard scaler variable x_train = sc.fit_transform(x_train)#only apply feature scaling to numerical values x_test = sc.transform(x_test)

Above, we begin by importing the class StandardScaler from sklearn preprocessing module. After this, we create an object of the class in the variable sc. Since we’ll be applying it to all our values, we don’t need to pass any parameters. Then, we take the training set X and fit our standard scalar object only on the columns containing independent values. We have to do this same process for the testing values, which is why we have x_test.

What’s Next?

Data preprocessing techniques are important to create a final product out of data sets. The above were two common steps or methods of data preprocessing with Python.

For more information about the workings behind machine learning, we recommend the following articles:

Related Articles

Join 6,300+ Fellow
AI Enthusiasts

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy.

All-in-one platform to build Computer Vision 10x faster, implement any solution with automated tools.

By using viso.ai, you agree to our Cookie Policy.

Request a live demo

By clicking “Request Demo” you agree to our Terms of Use and Privacy Policy.

Not interested?

We’re always looking to improve, so please let us know why you are not interested in using Computer Vision with Viso Suite.

error: Alert: Content is protected