Machine Learning Challenge: Day 6
Exploring and Understanding Your Data: Techniques and Tools for Machine Learning¶
Machine learning is a powerful tool that allows us to make predictions and decisions based on data. One of the first steps in any machine learning project is data exploration, which involves analyzing and understanding the data that will be used to train and evaluate the model.
import pandas as pd
# read in data
data = pd.read_csv('housing_data.csv')
# explore data
data.head()
suburb | rooms | type | price | method | seller_g | date | distance | postcode | bedroom2 | ... | car | landsize | building_area | year_built | council_area | latitude | longitude | region_name | property_count | yr_qtr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abbotsford | 2 | h | NaN | SS | Jellis | 2016-09-03 | 2.5 | 3067.0 | 2.0 | ... | 1.0 | 126.0 | NaN | NaN | Yarra City Council | -37.8014 | 144.9958 | Northern Metropolitan | 4019.0 | 2016.3 |
1 | Abbotsford | 2 | h | 1480000.0 | S | Biggin | 2016-12-03 | 2.5 | 3067.0 | 2.0 | ... | 1.0 | 202.0 | NaN | NaN | Yarra City Council | -37.7996 | 144.9984 | Northern Metropolitan | 4019.0 | 2016.4 |
2 | Abbotsford | 2 | h | 1035000.0 | S | Biggin | 2016-02-04 | 2.5 | 3067.0 | 2.0 | ... | 0.0 | 156.0 | 79.0 | 1900.0 | Yarra City Council | -37.8079 | 144.9934 | Northern Metropolitan | 4019.0 | 2016.1 |
3 | Abbotsford | 3 | u | NaN | VB | Rounds | 2016-02-04 | 2.5 | 3067.0 | 3.0 | ... | 1.0 | 0.0 | NaN | NaN | Yarra City Council | -37.8114 | 145.0116 | Northern Metropolitan | 4019.0 | 2016.1 |
4 | Abbotsford | 3 | h | 1465000.0 | SP | Biggin | 2017-03-04 | 2.5 | 3067.0 | 3.0 | ... | 0.0 | 134.0 | 150.0 | 1900.0 | Yarra City Council | -37.8093 | 144.9944 | Northern Metropolitan | 4019.0 | 2017.1 |
5 rows × 21 columns
#get the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 34857 entries, 0 to 34856 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 suburb 34857 non-null object 1 rooms 34857 non-null int64 2 type 34857 non-null object 3 price 27247 non-null float64 4 method 34857 non-null object 5 seller_g 34857 non-null object 6 date 34857 non-null object 7 distance 34856 non-null float64 8 postcode 34856 non-null float64 9 bedroom2 26640 non-null float64 10 bathroom 26631 non-null float64 11 car 26129 non-null float64 12 landsize 23047 non-null float64 13 building_area 13742 non-null float64 14 year_built 15551 non-null float64 15 council_area 34854 non-null object 16 latitude 26881 non-null float64 17 longitude 26881 non-null float64 18 region_name 34854 non-null object 19 property_count 34854 non-null float64 20 yr_qtr 34857 non-null float64 dtypes: float64(13), int64(1), object(7) memory usage: 5.6+ MB
Data Size & Summary Stats:¶
Data size is an important consideration in data exploration, as the amount of data available can impact the performance and accuracy of a machine learning model. Summary statistics such as mean, median and standard deviation can provide a quick overview of the data and help identify any outliers or anomalies.
data.shape
(34857, 21)
data.describe()
rooms | price | distance | postcode | bedroom2 | bathroom | car | landsize | building_area | year_built | latitude | longitude | property_count | yr_qtr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 34857.000000 | 2.724700e+04 | 34856.000000 | 34856.000000 | 26640.000000 | 26631.000000 | 26129.000000 | 23047.000000 | 13742.00000 | 15551.000000 | 26881.000000 | 26881.000000 | 34854.000000 | 34857.000000 |
mean | 3.031012 | 1.050173e+06 | 11.184929 | 3116.062859 | 3.084647 | 1.624798 | 1.728845 | 593.598993 | 160.25640 | 1965.289885 | -37.810634 | 145.001851 | 7572.888306 | 2017.108357 |
std | 0.969933 | 6.414671e+05 | 6.788892 | 109.023903 | 0.980690 | 0.724212 | 1.010771 | 3398.841946 | 401.26706 | 37.328178 | 0.090279 | 0.120169 | 4428.090313 | 0.592372 |
min | 1.000000 | 8.500000e+04 | 0.000000 | 3000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1196.000000 | -38.190430 | 144.423790 | 83.000000 | 2016.100000 |
25% | 2.000000 | 6.350000e+05 | 6.400000 | 3051.000000 | 2.000000 | 1.000000 | 1.000000 | 224.000000 | 102.00000 | 1940.000000 | -37.862950 | 144.933500 | 4385.000000 | 2016.400000 |
50% | 3.000000 | 8.700000e+05 | 10.300000 | 3103.000000 | 3.000000 | 2.000000 | 2.000000 | 521.000000 | 136.00000 | 1970.000000 | -37.807600 | 145.007800 | 6763.000000 | 2017.300000 |
75% | 4.000000 | 1.295000e+06 | 14.000000 | 3156.000000 | 4.000000 | 2.000000 | 2.000000 | 670.000000 | 188.00000 | 2000.000000 | -37.754100 | 145.071900 | 10412.000000 | 2017.400000 |
max | 16.000000 | 1.120000e+07 | 48.100000 | 3978.000000 | 30.000000 | 12.000000 | 26.000000 | 433014.000000 | 44515.00000 | 2106.000000 | -37.390200 | 145.526350 | 21650.000000 | 2018.100000 |
Histogram¶
Histograms are a useful tool for visualizing the distribution of a dataset. They provide a graphical representation of the frequency of different values in a dataset, and can help identify patterns or trends.
import matplotlib.pyplot as plt
data.hist(column='price', bins=10)
plt.show()
#create a sequesnce of histogram
data.hist(bins=10, figsize=(20,15))
plt.show()
Scatter Plot & Joint Plot¶
Scatter plots allow us to visualize the relationship between two variables, and can be useful for identifying patterns or outliers in the data. Joint plots are similar to scatter plots, but also include histograms of the individual variables.
import seaborn as sns
sns.scatterplot(x='price', y='property_count', data=data)
plt.show()
# create a joint plot
sns.jointplot(x='price', y='property_count', data=data)
plt.show()
Pair Grid & Box and Violin Plots¶
Pair grids and box and violin plots are also useful tools for visualizing the relationship between variables. Pair grids allow us to plot multiple variables at once, while box and violin plots provide a more detailed view of the distribution of a variable.
# create a pair plot
sns.pairplot(data)
plt.show()
# create a box plot
sns.boxplot(x='price', y='property_count', data=data)
sns.violinplot(x='price', y='property_count', data=data)
plt.show()
Comparing Two Ordinal Values¶
Comparing two ordinal values, such as categorical variables, can be done using a variety of tools such as bar plots, or stacked bar plots.
data.columns
Index(['suburb', 'rooms', 'type', 'price', 'method', 'seller_g', 'date', 'distance', 'postcode', 'bedroom2', 'bathroom', 'car', 'landsize', 'building_area', 'year_built', 'council_area', 'latitude', 'longitude', 'region_name', 'property_count', 'yr_qtr'], dtype='object')
sns.countplot(x='rooms', hue='type', data=data)
plt.show()
Correlation, RadViz and Parallel Coordinates¶
Correlation is a measure of the relationship between two variables, and can be useful for identifying patterns or trends in the data. RadViz is a visualization tool that is particularly useful for identifying patterns in high-dimensional data. Finally, parallel coordinates plots are used to visualize high dimensional data, by plotting each variable on a separate axis and connecting the observations by lines. It helps to identify the relationship between different variables and also to identify outliers.
corr = data.corr()
print(corr)
rooms price distance postcode bedroom2 bathroom \ rooms 1.000000 0.465238 0.271511 0.085890 0.946755 0.611826 price 0.465238 1.000000 -0.211384 0.044950 0.430275 0.429878 distance 0.271511 -0.211384 1.000000 0.481566 0.269524 0.126201 postcode 0.085890 0.044950 0.481566 1.000000 0.089292 0.120080 bedroom2 0.946755 0.430275 0.269524 0.089292 1.000000 0.614892 bathroom 0.611826 0.429878 0.126201 0.120080 0.614892 1.000000 car 0.393878 0.201803 0.241835 0.067886 0.388491 0.307518 landsize 0.037402 0.032748 0.060862 0.040664 0.037019 0.036333 building_area 0.156229 0.100754 0.076301 0.042437 0.154157 0.147558 year_built -0.012749 -0.333306 0.323059 0.089805 -0.002022 0.167955 latitude 0.004872 -0.215607 -0.100417 -0.231027 0.003447 -0.059183 longitude 0.103235 0.197874 0.200946 0.362895 0.106164 0.106531 property_count -0.071677 -0.059017 -0.018140 0.017108 -0.053451 -0.032887 yr_qtr 0.091880 -0.020871 0.252297 0.109255 0.202242 0.102500 car landsize building_area year_built latitude \ rooms 0.393878 0.037402 0.156229 -0.012749 0.004872 price 0.201803 0.032748 0.100754 -0.333306 -0.215607 distance 0.241835 0.060862 0.076301 0.323059 -0.100417 postcode 0.067886 0.040664 0.042437 0.089805 -0.231027 bedroom2 0.388491 0.037019 0.154157 -0.002022 0.003447 bathroom 0.307518 0.036333 0.147558 0.167955 -0.059183 car 1.000000 0.037829 0.104373 0.128702 -0.009020 landsize 0.037829 1.000000 0.354530 0.044474 0.025318 building_area 0.104373 0.354530 1.000000 0.067811 0.017155 year_built 0.128702 0.044474 0.067811 1.000000 0.091592 latitude -0.009020 0.025318 0.017155 0.091592 1.000000 longitude 0.047213 -0.002582 -0.002143 -0.022175 -0.345589 property_count -0.009617 -0.018195 -0.024523 0.022420 0.011112 yr_qtr 0.155759 0.030151 0.025467 0.101301 0.023835 longitude property_count yr_qtr rooms 0.103235 -0.071677 0.091880 price 0.197874 -0.059017 -0.020871 distance 0.200946 -0.018140 0.252297 postcode 0.362895 0.017108 0.109255 bedroom2 0.106164 -0.053451 0.202242 bathroom 0.106531 -0.032887 0.102500 car 0.047213 -0.009617 0.155759 landsize -0.002582 -0.018195 0.030151 building_area -0.002143 -0.024523 0.025467 year_built -0.022175 0.022420 0.101301 latitude -0.345589 0.011112 0.023835 longitude 1.000000 0.016326 0.050221 property_count 0.016326 1.000000 0.013224 yr_qtr 0.050221 0.013224 1.000000
# for the RedViz we are using the temperature and humidity
from yellowbrick.datasets import load_occupancy
from yellowbrick.features import RadViz
# Load the classification dataset
X, y = load_occupancy()
# Specify the target classes
classes = ["unoccupied", "occupied"]
# Instantiate the visualizer
visualizer = RadViz(classes=classes)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Finalize and render the figure
<AxesSubplot:title={'center':'RadViz for 5 Features'}>
import plotly.express as px
fig = px.parallel_coordinates(data, color="rooms",
dimensions=['price', 'property_count', 'landsize','building_area'],
color_continuous_scale=px.colors.diverging.Tealrose,
color_continuous_midpoint=2)
fig.show()
The End
Comments
Post a Comment