Data Preprocessing Techniques for Machine Learning with Python

Build, deploy, operate computer vision at scale

One platform for all use cases
Connect all your cameras
Flexible for your needs

Data preprocessing methods are of great importance to transform raw collected data into high-quality datasets that can be used for algorithms, machine learning, and data analysis.

In this article, you will be introduced to common data preprocessing techniques in Python and learn how to implement them on your own. The article contains some technical parts that require basic knowledge and assumes you have imported and set up a dataset for manipulation.

What Is Data Preprocessing?

Data preprocessing describes the steps needed to encode data with the purpose of transforming it into a numerical state that machines can read. Data preprocessing techniques are part of data mining, which creates end products out of raw data which is standardized/normalized, contains no null values, and more.

Data preprocessing is essential for machine and deep learning tasks, and it is the basis of algorithm development, ML model training, and computer vision or image processing systems.

Data preprocessing for machine learning — Data preprocessing is essential to increase the AI model quality in machine learning for computer vision

Why Is Data Preprocessing Important?

The massive growth in the scale of data in recent years is the basis of Big Data and IoT systems. sensors and applications generate vast amounts of raw data. That data cannot be directly treated by humans or manual methods to obtain organized knowledge and useful insights. Big data’s high volume, variety, and velocity require advanced, high-performance processing to easily understand and automatically extract information from data.

The emergence of big data has led to the development of data science or data mining. This popular discipline is rapidly gaining importance in the current world of data. The implementation of emerging technologies and services, like Cloud and Edge computing or machine learning, as well as the reduction in hardware costs, accelerate the adoption rate of intelligent AI applications.

However, it is well-known that low-quality data will lead to low-quality information. Thus, data processing is a major and critical stage to obtain final data sets that can be used for further data analysis or machine learning algorithms. This is why data preprocessing is an integral part of AIoT applications and systems for image recognition with AI vision.

Python Data Preprocessing Techniques

Replacing Null Values

Replacing null values is usually the most common of data preprocessing techniques because it allows us to have a full dataset of values to work with. To execute replacing null values as part of data preprocessing, I suggest using Google Colab or opening a Jupyter notebook. For simplicity’s sake, I will be using Google Colab. Your first step will be to import SimpleImputer which is part of the sklearn library. The SimpleImputer class provides basic strategies for imputing, or representing missing values.

from sklearn.impute import SimpleImputer

Next, you’re going to want to specify which missing values to replace. We will be replacing those missing values with the mean of that row of the dataset, which we can do by setting the strategy variable equal to the mean.

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

The imputer fills missing values with some statistics (e.g., mean, median) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit and then stores it. It then uses that data on the test portion, done during the transform.

imputer.fit(X[:, 1:3]) #looks @ rows and columns X[:, 1:3] = imputer.transform(X[:, 1:3])

Feature Scaling

Feature scaling is a data preprocessing technique used to normalize our set of data values. The reason we use feature scaling is that some sets of data might be overtaken by others in such a way that the machine learning model disregards the overtaken data. The sets of data, in this case, represent separate features.

The next bit of code we’ll be using will scale our data features using a Python function called standardization. The function operates by subtracting each value of the feature by the mean of all the values of that feature, and then dividing that difference by the standard deviation of the feature.

Doing this will allow all the values to be within three numbers, or values, of each other. Standardization is a commonly used data preprocessing technique. Another feature scaling function we could have used is normalization, which works by subtracting each feature value by the minimum and then dividing that by the difference of the maximum and minimum.

Normalization puts all values between 0 and 1. However, normalization is a recommended data preprocessing technique when most of your features exhibit a normal distribution – which may not always be the case. Since standardization would work for both cases, we’ll be using it here.

from sklearn.preprocessing import StandardScaler sc = StandardScaler() #sc = standard scaler variable x_train = sc.fit_transform(x_train)#only apply feature scaling to numerical values x_test = sc.transform(x_test)

Above, we begin by importing the class StandardScaler from sklearn preprocessing module. After this, we create an object of the class in the variable sc. Since we’ll be applying it to all our values, we don’t need to pass any parameters.

Then, we take the training set X and fit our standard scalar object only on the columns containing independent values. We have to do this same process for the testing values, which is why we have x_test.

What’s Next?

Data preprocessing techniques are important to create a final product out of data sets. The above were two common steps or methods of data preprocessing with Python.

For more information about the workings behind machine learning and data analysis, we recommend the following articles:

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKEN	session	This cookie is used to distinguish between humans and bots.
zfccn	session	Zoho sets this cookie for website security when a request is sent to campaigns.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId	1 year	This cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitId	one year	Used for identifying returning visits of users to the webpage.
zft-sdc	24hours	It records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts	1 year	These cookies are used to measure and analyze the traffic of this website and expire in 1 year.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
2d719b1dd3	session	This cookie has not yet been given a description. Our team is working to provide more information.
4662279173	session	This cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645	session	This cookie has not yet been given a description. Our team is working to provide more information.
zc_consent	1 year	No description available.
zc_show	1 year	No description available.
zsc2feeae1d12f14395b6d5128904ae3746	1 minute	This cookie has not yet been given a description. Our team is working to provide more information.