Machine Learning Challenge: Day 8
feature engineering in machine learning
Feature selection
Feature selection is the
process of identifying which features are most important in your data. If you
can find a way to remove unimportant features, then it's possible that you'll
be able to improve the performance of your machine-learning model. Feature selection
can be done via many different techniques, and there are many different ways to
go about it. In this post, we will cover some of these techniques and explain
why they work so well for feature selection:
Collinear Columns
Collinear columns are two
columns that are highly correlated. If you have several input variables, and
one of the inputs is strongly related to another input, it can be hard to tell
which is causing the effect. For example, if you have two features (A and B)
with a high correlation between them, then it’s hard to tell whether A causes B
or vice versa.
In order for us humans to use
our eyesight effectively in everyday life—like when we look at pictures—we need
all sorts of different kinds of information: color vision helps us recognize
shapes, contrast vision helps us spot small details like the text on a page,
etc.; however, this means that there may be some overlap between how our eyes
work together so that what one person sees as blue might actually just look
more yellow than pink for someone else!
Lasso Regression
Lasso Regression is a type of
regression used to find the coefficients of a linear model that best fits a set
of data. It uses a regularization parameter (lambda), AIC, or BIC as criteria
for performing machine learning on each sample point to select the optimal
model parameter that minimizes error while not overfitting.
The lasso method makes use of
regularization techniques to control the strength of penalty on coefficients,
which allows it to capture nonlinear relationships between variables in your
dataset and make predictions about them.
Recursive Feature Elimination
Recursive feature elimination
(RFE), also known as recursive feature selection and recursive partitioning, is
a greedy algorithm that iteratively finds the optimal subset of features to be
used in an overall model. It's a form of forward selection and backward
elimination: you first select a subset of training data and then remove some
features from your model until you're left with only those that improve its
performance on testing data.
The process works like this:
first, you choose all possible subsets based on some criterion; then, for each
possible subset, remove one or more features from your model depending on how
well they have been selected already; repeat until either no other good subsets
remain or there aren't any leftovers!
Mutual Information
Mutual information is a
measure of the dependence between two random variables. It can be computed in
many ways, but we'll use one simple method:
- Mutual information I(x) = -Pr(x|y)Pr(y|x)
Mutual Information measures
how much an average value from one variable depends on an average value from
another variable. In other words, it tells us how much more likely it is for
both variables to change together than for them to change separately. For
example, suppose your company wants to know how important sales will be this year compared with last year's performance. In that case, you'd want to look at how well each
product performed last year compared with its own historical performance (i.e.,
whether it did better or worse). By comparing these two numbers against each
other—and then subtracting them—you'll get an accurate estimate of whether your
products are performing well enough right now so that they'll continue doing
well in future years as well!
Principal Component Analysis
Principal Component Analysis
(PCA) is a dimension reduction technique that reduces the number of features in
a dataset without losing information. PCA finds the directions (principal
components) that maximize the variance of your data set.
Principal Components Analysis
can determine the most important features for predicting an outcome, like how
many people will buy your product or service.
The idea behind PCA is that
you can reduce the dimensionality of your data by projecting it onto a new set
of axes that are linear combinations of the original features. The new set of
axes will be uncorrelated from each other, and each axis will be orthogonal to
all other axes. This means that each axis represents a different aspect of your
data set, allowing you to visualize these aspects separately.
Feature Importance
Feature importance is a term
used to describe the relative importance of each feature for predicting the
outcome. For example, if you have 10 features and your model is using just 1 or
2 of them, it means that those features are not very important in predicting
your target variable. To avoid this problem, you could use a smaller number of
more relevant variables instead.
You can also think about it as an indicator for which columns should be included in your model when using machine learning techniques such as linear regression or neural networks (which we'll talk about later). If there are several columns with similar values but different sizes, you might leave out some columns because they add little information compared to others that provide more value in making predictions on new data sets (or even existing ones).
Conclusion
I explained how to use feature
selection in machine learning in this article. Feature selection is a powerful
tool that can help you improve your model's performance by removing unimportant
features. By examining the data with different tools and finding out which
features are important for your problem, you can end up with a better model
that has more relevant information about what is important in reality. Before
using these methods for real-world applications, you should try them out on
your own problems!
import pandas as pd
#read the data
df = pd.read_csv('./California_housing_price_data.csv')
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
#check for nan values
df.isna().sum()
longitude 0 latitude 0 housing_median_age 0 total_rooms 0 total_bedrooms 0 population 0 households 0 median_income 0 median_house_value 0 ocean_proximity 0 dtype: int64
#replace nan values with mean
df.fillna(df.mean(), inplace=True)
C:\Users\ASUS\AppData\Roaming\Python\Python37\site-packages\ipykernel_launcher.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
1) Feature Selection:
from sklearn.feature_selection import SelectKBest, chi2
# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income']]
y = df[['median_house_value']]
# Select the top 2 features using chi2 test
selector = SelectKBest(chi2, k=2)
X_new = selector.fit_transform(X, y)
# Get the selected feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)
Index(['total_rooms', 'total_bedrooms'], dtype='object')
2) Collinear Columns:
import numpy as np
# Define X variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'median_house_value','latitude', 'longitude']]
# Calculate correlation matrix
corr_matrix = np.corrcoef(X.T)
# Identify columns with correlation greater than 0.8
collinear_columns = [X.columns[i] for i in range(corr_matrix.shape[0]) if (corr_matrix[i,:] > 0.8).sum() > 1]
print(collinear_columns)
['total_rooms', 'total_bedrooms', 'households']
3) Lasso Regression:
from sklearn.linear_model import Lasso
# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]
y = df[['median_house_value']]
# Initialize Lasso model with alpha value of 0.1
lasso = Lasso(alpha=0.1)
# Fit the model to the data
lasso.fit(X, y)
# Print the coefficients of the features
print(lasso.coef_)
#plot the coefficients
import matplotlib.pyplot as plt
plt.plot(lasso.coef_)
[ 1.19926487e+03 -1.41761511e+01 1.34261848e+02 -4.40893808e+01 4.19298045e+04 -4.11791107e+04 -4.26151151e+04]
[<matplotlib.lines.Line2D at 0x203252eda88>]
4) Recursive Feature Elimination:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]
y = df[['median_house_value']]
# Initialize Linear Regression model
lr = LinearRegression()
# Create RFE object with 2 features to be selected
rfe = RFE(lr, n_features_to_select=2)
# Fit the RFE model to the data
rfe.fit(X, y)
# Get the selected feature names
selected_features = X.columns[rfe.get_support()]
print(selected_features)
Index(['latitude', 'longitude'], dtype='object')
5) Mutual Information:
from sklearn.feature_selection import SelectKBest, mutual_info_regression
# Define X and y variables
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]
y = df[['median_house_value']]
# Select the top 2 features using mutual information test
selector = SelectKBest(mutual_info_regression, k=2)
X_new = selector.fit_transform(X, y)
# Get the selected feature names
selected_features = X.columns[selector.get_support()]
print(selected_features)
c:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\utils\validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Index(['median_income', 'longitude'], dtype='object')
6) Principal Component Analysis:
from sklearn.decomposition import PCA
X = df[['housing_median_age', 'total_rooms', 'total_bedrooms','households', 'median_income', 'latitude', 'longitude']]
pca = PCA(n_components=2)
X_transformed = pca.fit_transform(X)
print(X_transformed)
#plot the transformed data
plt.scatter(X_transformed[:,0], X_transformed[:,1])
plt.show()
[[-1836.81197904 -124.91993175] [ 4537.80814312 -222.25634601] [-1247.62924293 -185.89662395] ... [ -390.7469054 7.93750154] [ -800.20757763 -9.16819524] [ 163.52162585 40.12447885]]
7) Feature Importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Define X and y variables
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_classes=2)
print(X.shape)
print(y.shape)
(1000, 10) (1000,)
# Initialize Random Forest Classifier
rfc = RandomForestClassifier()
# Fit the model to the data
rfc.fit(X, y)
# Print the feature importance scores
print(rfc.feature_importances_)
[0.03233567 0.10300892 0.17113546 0.03582479 0.10209856 0.16777026 0.11737581 0.03440321 0.13959609 0.09645123]
#plot the feature importance using bar plot
plt.bar([i for i in range(len(rfc.feature_importances_))], rfc.feature_importances_)
plt.show()
Comments
Post a Comment