feature engineering in machine learning

Feature selection

Feature selection is the process of identifying which features are most important in your data. If you can find a way to remove unimportant features, then it's possible that you'll be able to improve the performance of your machine-learning model. Feature selection can be done via many different techniques, and there are many different ways to go about it. In this post, we will cover some of these techniques and explain why they work so well for feature selection:

Collinear Columns

Collinear columns are two columns that are highly correlated. If you have several input variables, and one of the inputs is strongly related to another input, it can be hard to tell which is causing the effect. For example, if you have two features (A and B) with a high correlation between them, then it’s hard to tell whether A causes B or vice versa.

In order for us humans to use our eyesight effectively in everyday life—like when we look at pictures—we need all sorts of different kinds of information: color vision helps us recognize shapes, contrast vision helps us spot small details like the text on a page, etc.; however, this means that there may be some overlap between how our eyes work together so that what one person sees as blue might actually just look more yellow than pink for someone else!

Lasso Regression

Lasso Regression is a type of regression used to find the coefficients of a linear model that best fits a set of data. It uses a regularization parameter (lambda), AIC, or BIC as criteria for performing machine learning on each sample point to select the optimal model parameter that minimizes error while not overfitting.

The lasso method makes use of regularization techniques to control the strength of penalty on coefficients, which allows it to capture nonlinear relationships between variables in your dataset and make predictions about them.

Recursive Feature Elimination

Recursive feature elimination (RFE), also known as recursive feature selection and recursive partitioning, is a greedy algorithm that iteratively finds the optimal subset of features to be used in an overall model. It's a form of forward selection and backward elimination: you first select a subset of training data and then remove some features from your model until you're left with only those that improve its performance on testing data.

The process works like this: first, you choose all possible subsets based on some criterion; then, for each possible subset, remove one or more features from your model depending on how well they have been selected already; repeat until either no other good subsets remain or there aren't any leftovers!

Mutual Information

Mutual information is a measure of the dependence between two random variables. It can be computed in many ways, but we'll use one simple method:

Mutual information I(x) = -Pr(x|y)Pr(y|x)

Mutual Information measures how much an average value from one variable depends on an average value from another variable. In other words, it tells us how much more likely it is for both variables to change together than for them to change separately. For example, suppose your company wants to know how important sales will be this year compared with last year's performance. In that case, you'd want to look at how well each product performed last year compared with its own historical performance (i.e., whether it did better or worse). By comparing these two numbers against each other—and then subtracting them—you'll get an accurate estimate of whether your products are performing well enough right now so that they'll continue doing well in future years as well!

Principal Component Analysis

Principal Component Analysis (PCA) is a dimension reduction technique that reduces the number of features in a dataset without losing information. PCA finds the directions (principal components) that maximize the variance of your data set.

Principal Components Analysis can determine the most important features for predicting an outcome, like how many people will buy your product or service.

The idea behind PCA is that you can reduce the dimensionality of your data by projecting it onto a new set of axes that are linear combinations of the original features. The new set of axes will be uncorrelated from each other, and each axis will be orthogonal to all other axes. This means that each axis represents a different aspect of your data set, allowing you to visualize these aspects separately.

Feature Importance

Feature importance is a term used to describe the relative importance of each feature for predicting the outcome. For example, if you have 10 features and your model is using just 1 or 2 of them, it means that those features are not very important in predicting your target variable. To avoid this problem, you could use a smaller number of more relevant variables instead.

You can also think about it as an indicator for which columns should be included in your model when using machine learning techniques such as linear regression or neural networks (which we'll talk about later). If there are several columns with similar values but different sizes, you might leave out some columns because they add little information compared to others that provide more value in making predictions on new data sets (or even existing ones).

Conclusion

I explained how to use feature selection in machine learning in this article. Feature selection is a powerful tool that can help you improve your model's performance by removing unimportant features. By examining the data with different tools and finding out which features are important for your problem, you can end up with a better model that has more relevant information about what is important in reality. Before using these methods for real-world applications, you should try them out on your own problems!

Code Link

day-8

In [1]:

import pandas as pd

In [2]:

#read the data
df = pd.read_csv('./California_housing_price_data.csv')
df.head()

Out[2]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

In [8]:

#check for nan values
df.isna().sum()

Out[8]:

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [7]:

#replace nan values with mean
df.fillna(df.mean(), inplace=True)

C:\Users\ASUS\AppData\Roaming\Python\Python37\site-packages\ipykernel_launcher.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.

1) Feature Selection:

In [9]:

from sklearn.feature_selection import SelectKBest, chi2

# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income']]
y = df[['median_house_value']]
# Select the top 2 features using chi2 test
selector = SelectKBest(chi2, k=2)
X_new = selector.fit_transform(X, y)

# Get the selected feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)

Index(['total_rooms', 'total_bedrooms'], dtype='object')

2) Collinear Columns:

In [12]:

import numpy as np

# Define X variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'median_house_value','latitude', 'longitude']]
# Calculate correlation matrix
corr_matrix = np.corrcoef(X.T)

# Identify columns with correlation greater than 0.8
collinear_columns = [X.columns[i] for i in range(corr_matrix.shape[0]) if (corr_matrix[i,:] > 0.8).sum() > 1]
print(collinear_columns)

['total_rooms', 'total_bedrooms', 'households']

3) Lasso Regression:

In [16]:

from sklearn.linear_model import Lasso

# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]
y = df[['median_house_value']]

# Initialize Lasso model with alpha value of 0.1
lasso = Lasso(alpha=0.1)

# Fit the model to the data
lasso.fit(X, y)

# Print the coefficients of the features
print(lasso.coef_)
#plot the coefficients
import matplotlib.pyplot as plt
plt.plot(lasso.coef_)

[ 1.19926487e+03 -1.41761511e+01  1.34261848e+02 -4.40893808e+01
  4.19298045e+04 -4.11791107e+04 -4.26151151e+04]

Out[16]:

[<matplotlib.lines.Line2D at 0x203252eda88>]

4) Recursive Feature Elimination:

In [18]:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]
y = df[['median_house_value']]
# Initialize Linear Regression model
lr = LinearRegression()

# Create RFE object with 2 features to be selected
rfe = RFE(lr, n_features_to_select=2)

# Fit the RFE model to the data
rfe.fit(X, y)

# Get the selected feature names
selected_features = X.columns[rfe.get_support()]
print(selected_features)

Index(['latitude', 'longitude'], dtype='object')

5) Mutual Information:

In [21]:

from sklearn.feature_selection import SelectKBest, mutual_info_regression

# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]
y = df[['median_house_value']]

# Select the top 2 features using mutual information test
selector = SelectKBest(mutual_info_regression, k=2)
X_new = selector.fit_transform(X, y)

# Get the selected feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)

c:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\utils\validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Index(['median_income', 'longitude'], dtype='object')

6) Principal Component Analysis:

In [24]:

from sklearn.decomposition import PCA

X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]

pca = PCA(n_components=2)

X_transformed = pca.fit_transform(X)
print(X_transformed)

#plot the transformed data
plt.scatter(X_transformed[:,0], X_transformed[:,1])
plt.show()

[[-1836.81197904  -124.91993175]
 [ 4537.80814312  -222.25634601]
 [-1247.62924293  -185.89662395]
 ...
 [ -390.7469054      7.93750154]
 [ -800.20757763    -9.16819524]
 [  163.52162585    40.12447885]]

7) Feature Importance

In [32]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Define X and y variables
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_classes=2)
print(X.shape)
print(y.shape)

(1000, 10)
(1000,)

In [34]:

# Initialize Random Forest Classifier
rfc = RandomForestClassifier()

# Fit the model to the data
rfc.fit(X, y)

# Print the feature importance scores
print(rfc.feature_importances_)

[0.03233567 0.10300892 0.17113546 0.03582479 0.10209856 0.16777026
 0.11737581 0.03440321 0.13959609 0.09645123]

In [36]:

#plot the feature importance using bar plot
plt.bar([i for i in range(len(rfc.feature_importances_))], rfc.feature_importances_)
plt.show()

Machine Learning 30 Days Challange

Machine Learning Challenge: Day 8

feature engineering in machine learning

The End¶

Comments

Post a Comment

Popular posts from this blog

Roadmap for 30 Day Machine Learning Challange

Machine Learning Challenge: Day 6

Machine Learning Challenge: Day 12