Data Pre-processing Techniques for Machine Learning: Standardization, Scaling, Encoding, and Feature Engineering

Pre-processing Data: Data pre-processing is an essential step in machine learning, it is the process of cleaning, transforming, and preparing the data for a model to learn from it.
Standardize: Standardization is a technique to transform the data so that it has a mean of zero and a standard deviation of one. It is used to bring all the variables to the same scale so that one variable does not dominate the others.
Scale to Range: Scaling to a range is a technique to transform the data to a specific range. It is used to normalize the data so that all the values are within a specific range.
Dummy Variables: Dummy variables are used to handle categorical variables in the data. It is used to convert categorical variables into numerical variables that can be used in the model.
Label Encoder: Label Encoding is a technique to convert categorical variables into numerical variables. It assigns a unique number to each category.
Frequency Encoding: Frequency Encoding is a technique to handle categorical variables in the data by replacing the categorical variables with their frequencies.
Pulling Categories from Strings: This technique is used to extract categorical variables from strings. It is used to convert free-text variables into categorical variables.
Other Categorical Encoding: There are several other categorical encoding techniques used such as one-hot encoding, ordinal encoding, and more.
Date Feature Engineering: Date Feature Engineering is used to extract features from date variables such as day of the week, month, year, and more.
Add col _na Feature: Adding a column with the number of missing values is a technique to capture the missing values in the data.
Manual Feature Engineering: Manual Feature Engineering is the process of creating new features from the existing data by applying domain knowledge.

day-7

In [1]:

import pandas as pd
#read the data
data = pd.read_csv('./California_housing_price_data.csv')
data.head()

Out[1]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

In [2]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

In [3]:

#drop the ocean_proximity column as it is not a numerical column
categorical_data = data['ocean_proximity']
data = data.drop('ocean_proximity', axis=1)
data.head()

Out[3]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0

In [4]:

#standardization of data
from sklearn.preprocessing import StandardScaler
print("Data Before Standardization:")
print(data.head())

scaler = StandardScaler()
data_std = scaler.fit_transform(data)
print("\n\n Data After Standardization:")
print(data_std)

Data Before Standardization:
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value  
0       322.0       126.0         8.3252            452600.0  
1      2401.0      1138.0         8.3014            358500.0  
2       496.0       177.0         7.2574            352100.0  
3       558.0       219.0         5.6431            341300.0  
4       565.0       259.0         3.8462            342200.0  


 Data After Standardization:
[[-1.32783522  1.05254828  0.98214266 ... -0.97703285  2.34476576
   2.12963148]
 [-1.32284391  1.04318455 -0.60701891 ...  1.66996103  2.33223796
   1.31415614]
 [-1.33282653  1.03850269  1.85618152 ... -0.84363692  1.7826994
   1.25869341]
 ...
 [-0.8237132   1.77823747 -0.92485123 ... -0.17404163 -1.14259331
  -0.99274649]
 [-0.87362627  1.77823747 -0.84539315 ... -0.39375258 -1.05458292
  -1.05860847]
 [-0.83369581  1.75014627 -1.00430931 ...  0.07967221 -0.78012947
  -1.01787803]]

In [5]:

# MinMaxScaler for normalization
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
data_scaled = scaler.fit_transform(data)
print("Data After Min Max Scaling:")
print(data_scaled)

Data After Min Max Scaling:
[[0.21115538 0.5674814  0.78431373 ... 0.02055583 0.53966842 0.90226638]
 [0.21215139 0.565356   0.39215686 ... 0.18697583 0.53802706 0.70824656]
 [0.21015936 0.5642933  1.         ... 0.02894261 0.46602805 0.69505074]
 ...
 [0.31175299 0.73219979 0.31372549 ... 0.07104095 0.08276438 0.15938285]
 [0.30179283 0.73219979 0.33333333 ... 0.05722743 0.09429525 0.14371281]
 [0.30976096 0.72582359 0.29411765 ... 0.08699227 0.13025338 0.15340349]]

In [6]:

# using pandas for dummy variables
print("Data Before Dummy Variables:")
print(categorical_data.head())
dummy_data = pd.get_dummies(categorical_data)
print("\n\n Data After Dummy Variables:")
print(dummy_data.head())

Data Before Dummy Variables:
0    NEAR BAY
1    NEAR BAY
2    NEAR BAY
3    NEAR BAY
4    NEAR BAY
Name: ocean_proximity, dtype: object


 Data After Dummy Variables:
   <1H OCEAN  INLAND  ISLAND  NEAR BAY  NEAR OCEAN
0          0       0       0         1           0
1          0       0       0         1           0
2          0       0       0         1           0
3          0       0       0         1           0
4          0       0       0         1           0

In [7]:

# using sklearn for LabelEncoder 
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
new_categorical_data = encoder.fit_transform(categorical_data)
print("Data After Label Encoder:")
print(new_categorical_data)

Data After Label Encoder:
[3 3 3 ... 1 1 1]

In [8]:

# yop can see here how label encoder works and choose the best one for your data
import matplotlib.pyplot as plt
plt.scatter(categorical_data, new_categorical_data, color='red')
plt.show()

In [9]:

# Frequency Encoding : it is used when the categories are not ordinal and there is no relationship between them
cat_column = categorical_data.map(categorical_data.value_counts())
cat_column

Out[9]:

0        2290
1        2290
2        2290
3        2290
4        2290
         ... 
20635    6551
20636    6551
20637    6551
20638    6551
20639    6551
Name: ocean_proximity, Length: 20640, dtype: int64

In [10]:

# puling the categories from the data
import re
print('categories before pulling:', categorical_data[0])
pull_categorical = categorical_data.apply(lambda x: re.findall("\w+", x)[0])
print('categories after pulling:', pull_categorical[0])

categories before pulling: NEAR BAY
categories after pulling: NEAR

In [11]:

data

Out[11]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0
...	...	...	...	...	...	...	...	...	...
20635	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0
20636	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0
20637	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0
20638	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0
20639	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0

20640 rows × 9 columns

In [12]:

# Date Time Features : we can extract the date time features from the date column
stock_data = pd.read_csv('./AMBUJACEM.csv')
print(stock_data.Date.head())

new_data = pd.DataFrame()

# extracting the date time features
new_data['year'] = pd.DatetimeIndex(stock_data['Date']).year
new_data['month'] = pd.DatetimeIndex(stock_data['Date']).month
new_data['day'] = pd.DatetimeIndex(stock_data['Date']).day
new_data['dayofweek'] = pd.DatetimeIndex(stock_data['Date']).dayofweek
new_data['dayofyear'] = pd.DatetimeIndex(stock_data['Date']).dayofyear
new_data['weekofyear'] = pd.DatetimeIndex(stock_data['Date']).weekofyear
new_data['quarter'] = pd.DatetimeIndex(stock_data['Date']).quarter
new_data['is_month_start'] = pd.DatetimeIndex(stock_data['Date']).is_month_start
new_data['is_month_end'] = pd.DatetimeIndex(stock_data['Date']).is_month_end
new_data['is_quarter_start'] = pd.DatetimeIndex(stock_data['Date']).is_quarter_start
new_data['is_quarter_end'] = pd.DatetimeIndex(stock_data['Date']).is_quarter_end
new_data['is_year_start'] = pd.DatetimeIndex(stock_data['Date']).is_year_start
new_data['is_year_end'] = pd.DatetimeIndex(stock_data['Date']).is_year_end
new_data['is_leap_year'] = pd.DatetimeIndex(stock_data['Date']).is_leap_year
new_data['days_in_month'] = pd.DatetimeIndex(stock_data['Date']).days_in_month
new_data['is_weekend'] = pd.DatetimeIndex(stock_data['Date']).dayofweek.isin([5,6])
new_data.head()

0    2023-01-13
1    2023-01-12
2    2023-01-11
3    2023-01-10
4    2023-01-09
Name: Date, dtype: object

C:\Users\ASUS\AppData\Local\Temp\ipykernel_15200\3193144317.py:13: FutureWarning: weekofyear and week have been deprecated, please use DatetimeIndex.isocalendar().week instead, which returns a Series. To exactly reproduce the behavior of week and weekofyear and return an Index, you may call pd.Int64Index(idx.isocalendar().week)
  new_data['weekofyear'] = pd.DatetimeIndex(stock_data['Date']).weekofyear

Out[12]:

	year	month	day	dayofweek	dayofyear	weekofyear	quarter	is_month_start	is_month_end	is_quarter_start	is_quarter_end	is_year_start	is_year_end	is_leap_year	days_in_month	is_weekend
0	2023	1	13	4	13	2	1	False	False	False	False	False	False	False	31	False
1	2023	1	12	3	12	2	1	False	False	False	False	False	False	False	31	False
2	2023	1	11	2	11	2	1	False	False	False	False	False	False	False	31	False
3	2023	1	10	1	10	2	1	False	False	False	False	False	False	False	31	False
4	2023	1	9	0	9	2	1	False	False	False	False	False	False	False	31	False

In [13]:

# Manual Feature Engineering : it is the process of creating new features from the existing features
# here we are creating a new feature from the total_bedrooms and total_rooms
data['new_feature'] = data['total_bedrooms'] / data['total_rooms']
data['new_feature'].head()

Out[13]:

0    0.146591
1    0.155797
2    0.129516
3    0.184458
4    0.172096
Name: new_feature, dtype: float64

Machine Learning 30 Days Challange

Machine Learning Challenge: Day 7

The End¶

Comments

Post a Comment

Popular posts from this blog

Roadmap for 30 Day Machine Learning Challange

Machine Learning Challenge: Day 6

Machine Learning Challenge: Day 8