Machine Learning Challenge: Day 6
Exploring and Understanding Your Data: Techniques and Tools for Machine Learning¶
Machine learning is a powerful tool that allows us to make predictions and decisions based on data. One of the first steps in any machine learning project is data exploration, which involves analyzing and understanding the data that will be used to train and evaluate the model.
import pandas as pd
# read in data
data = pd.read_csv('housing_data.csv')
# explore data
data.head()
| suburb | rooms | type | price | method | seller_g | date | distance | postcode | bedroom2 | ... | car | landsize | building_area | year_built | council_area | latitude | longitude | region_name | property_count | yr_qtr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abbotsford | 2 | h | NaN | SS | Jellis | 2016-09-03 | 2.5 | 3067.0 | 2.0 | ... | 1.0 | 126.0 | NaN | NaN | Yarra City Council | -37.8014 | 144.9958 | Northern Metropolitan | 4019.0 | 2016.3 |
| 1 | Abbotsford | 2 | h | 1480000.0 | S | Biggin | 2016-12-03 | 2.5 | 3067.0 | 2.0 | ... | 1.0 | 202.0 | NaN | NaN | Yarra City Council | -37.7996 | 144.9984 | Northern Metropolitan | 4019.0 | 2016.4 |
| 2 | Abbotsford | 2 | h | 1035000.0 | S | Biggin | 2016-02-04 | 2.5 | 3067.0 | 2.0 | ... | 0.0 | 156.0 | 79.0 | 1900.0 | Yarra City Council | -37.8079 | 144.9934 | Northern Metropolitan | 4019.0 | 2016.1 |
| 3 | Abbotsford | 3 | u | NaN | VB | Rounds | 2016-02-04 | 2.5 | 3067.0 | 3.0 | ... | 1.0 | 0.0 | NaN | NaN | Yarra City Council | -37.8114 | 145.0116 | Northern Metropolitan | 4019.0 | 2016.1 |
| 4 | Abbotsford | 3 | h | 1465000.0 | SP | Biggin | 2017-03-04 | 2.5 | 3067.0 | 3.0 | ... | 0.0 | 134.0 | 150.0 | 1900.0 | Yarra City Council | -37.8093 | 144.9944 | Northern Metropolitan | 4019.0 | 2017.1 |
5 rows × 21 columns
#get the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 34857 entries, 0 to 34856 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 suburb 34857 non-null object 1 rooms 34857 non-null int64 2 type 34857 non-null object 3 price 27247 non-null float64 4 method 34857 non-null object 5 seller_g 34857 non-null object 6 date 34857 non-null object 7 distance 34856 non-null float64 8 postcode 34856 non-null float64 9 bedroom2 26640 non-null float64 10 bathroom 26631 non-null float64 11 car 26129 non-null float64 12 landsize 23047 non-null float64 13 building_area 13742 non-null float64 14 year_built 15551 non-null float64 15 council_area 34854 non-null object 16 latitude 26881 non-null float64 17 longitude 26881 non-null float64 18 region_name 34854 non-null object 19 property_count 34854 non-null float64 20 yr_qtr 34857 non-null float64 dtypes: float64(13), int64(1), object(7) memory usage: 5.6+ MB
Data Size & Summary Stats:¶
Data size is an important consideration in data exploration, as the amount of data available can impact the performance and accuracy of a machine learning model. Summary statistics such as mean, median and standard deviation can provide a quick overview of the data and help identify any outliers or anomalies.
data.shape
(34857, 21)
data.describe()
| rooms | price | distance | postcode | bedroom2 | bathroom | car | landsize | building_area | year_built | latitude | longitude | property_count | yr_qtr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 34857.000000 | 2.724700e+04 | 34856.000000 | 34856.000000 | 26640.000000 | 26631.000000 | 26129.000000 | 23047.000000 | 13742.00000 | 15551.000000 | 26881.000000 | 26881.000000 | 34854.000000 | 34857.000000 |
| mean | 3.031012 | 1.050173e+06 | 11.184929 | 3116.062859 | 3.084647 | 1.624798 | 1.728845 | 593.598993 | 160.25640 | 1965.289885 | -37.810634 | 145.001851 | 7572.888306 | 2017.108357 |
| std | 0.969933 | 6.414671e+05 | 6.788892 | 109.023903 | 0.980690 | 0.724212 | 1.010771 | 3398.841946 | 401.26706 | 37.328178 | 0.090279 | 0.120169 | 4428.090313 | 0.592372 |
| min | 1.000000 | 8.500000e+04 | 0.000000 | 3000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1196.000000 | -38.190430 | 144.423790 | 83.000000 | 2016.100000 |
| 25% | 2.000000 | 6.350000e+05 | 6.400000 | 3051.000000 | 2.000000 | 1.000000 | 1.000000 | 224.000000 | 102.00000 | 1940.000000 | -37.862950 | 144.933500 | 4385.000000 | 2016.400000 |
| 50% | 3.000000 | 8.700000e+05 | 10.300000 | 3103.000000 | 3.000000 | 2.000000 | 2.000000 | 521.000000 | 136.00000 | 1970.000000 | -37.807600 | 145.007800 | 6763.000000 | 2017.300000 |
| 75% | 4.000000 | 1.295000e+06 | 14.000000 | 3156.000000 | 4.000000 | 2.000000 | 2.000000 | 670.000000 | 188.00000 | 2000.000000 | -37.754100 | 145.071900 | 10412.000000 | 2017.400000 |
| max | 16.000000 | 1.120000e+07 | 48.100000 | 3978.000000 | 30.000000 | 12.000000 | 26.000000 | 433014.000000 | 44515.00000 | 2106.000000 | -37.390200 | 145.526350 | 21650.000000 | 2018.100000 |
Histogram¶
Histograms are a useful tool for visualizing the distribution of a dataset. They provide a graphical representation of the frequency of different values in a dataset, and can help identify patterns or trends.
import matplotlib.pyplot as plt
data.hist(column='price', bins=10)
plt.show()
#create a sequesnce of histogram
data.hist(bins=10, figsize=(20,15))
plt.show()
Scatter Plot & Joint Plot¶
Scatter plots allow us to visualize the relationship between two variables, and can be useful for identifying patterns or outliers in the data. Joint plots are similar to scatter plots, but also include histograms of the individual variables.
import seaborn as sns
sns.scatterplot(x='price', y='property_count', data=data)
plt.show()
# create a joint plot
sns.jointplot(x='price', y='property_count', data=data)
plt.show()
Pair Grid & Box and Violin Plots¶
Pair grids and box and violin plots are also useful tools for visualizing the relationship between variables. Pair grids allow us to plot multiple variables at once, while box and violin plots provide a more detailed view of the distribution of a variable.
# create a pair plot
sns.pairplot(data)
plt.show()
# create a box plot
sns.boxplot(x='price', y='property_count', data=data)
sns.violinplot(x='price', y='property_count', data=data)
plt.show()
Comparing Two Ordinal Values¶
Comparing two ordinal values, such as categorical variables, can be done using a variety of tools such as bar plots, or stacked bar plots.
data.columns
Index(['suburb', 'rooms', 'type', 'price', 'method', 'seller_g', 'date',
'distance', 'postcode', 'bedroom2', 'bathroom', 'car', 'landsize',
'building_area', 'year_built', 'council_area', 'latitude', 'longitude',
'region_name', 'property_count', 'yr_qtr'],
dtype='object')
sns.countplot(x='rooms', hue='type', data=data)
plt.show()
Correlation, RadViz and Parallel Coordinates¶
Correlation is a measure of the relationship between two variables, and can be useful for identifying patterns or trends in the data. RadViz is a visualization tool that is particularly useful for identifying patterns in high-dimensional data. Finally, parallel coordinates plots are used to visualize high dimensional data, by plotting each variable on a separate axis and connecting the observations by lines. It helps to identify the relationship between different variables and also to identify outliers.
corr = data.corr()
print(corr)
rooms price distance postcode bedroom2 bathroom \
rooms 1.000000 0.465238 0.271511 0.085890 0.946755 0.611826
price 0.465238 1.000000 -0.211384 0.044950 0.430275 0.429878
distance 0.271511 -0.211384 1.000000 0.481566 0.269524 0.126201
postcode 0.085890 0.044950 0.481566 1.000000 0.089292 0.120080
bedroom2 0.946755 0.430275 0.269524 0.089292 1.000000 0.614892
bathroom 0.611826 0.429878 0.126201 0.120080 0.614892 1.000000
car 0.393878 0.201803 0.241835 0.067886 0.388491 0.307518
landsize 0.037402 0.032748 0.060862 0.040664 0.037019 0.036333
building_area 0.156229 0.100754 0.076301 0.042437 0.154157 0.147558
year_built -0.012749 -0.333306 0.323059 0.089805 -0.002022 0.167955
latitude 0.004872 -0.215607 -0.100417 -0.231027 0.003447 -0.059183
longitude 0.103235 0.197874 0.200946 0.362895 0.106164 0.106531
property_count -0.071677 -0.059017 -0.018140 0.017108 -0.053451 -0.032887
yr_qtr 0.091880 -0.020871 0.252297 0.109255 0.202242 0.102500
car landsize building_area year_built latitude \
rooms 0.393878 0.037402 0.156229 -0.012749 0.004872
price 0.201803 0.032748 0.100754 -0.333306 -0.215607
distance 0.241835 0.060862 0.076301 0.323059 -0.100417
postcode 0.067886 0.040664 0.042437 0.089805 -0.231027
bedroom2 0.388491 0.037019 0.154157 -0.002022 0.003447
bathroom 0.307518 0.036333 0.147558 0.167955 -0.059183
car 1.000000 0.037829 0.104373 0.128702 -0.009020
landsize 0.037829 1.000000 0.354530 0.044474 0.025318
building_area 0.104373 0.354530 1.000000 0.067811 0.017155
year_built 0.128702 0.044474 0.067811 1.000000 0.091592
latitude -0.009020 0.025318 0.017155 0.091592 1.000000
longitude 0.047213 -0.002582 -0.002143 -0.022175 -0.345589
property_count -0.009617 -0.018195 -0.024523 0.022420 0.011112
yr_qtr 0.155759 0.030151 0.025467 0.101301 0.023835
longitude property_count yr_qtr
rooms 0.103235 -0.071677 0.091880
price 0.197874 -0.059017 -0.020871
distance 0.200946 -0.018140 0.252297
postcode 0.362895 0.017108 0.109255
bedroom2 0.106164 -0.053451 0.202242
bathroom 0.106531 -0.032887 0.102500
car 0.047213 -0.009617 0.155759
landsize -0.002582 -0.018195 0.030151
building_area -0.002143 -0.024523 0.025467
year_built -0.022175 0.022420 0.101301
latitude -0.345589 0.011112 0.023835
longitude 1.000000 0.016326 0.050221
property_count 0.016326 1.000000 0.013224
yr_qtr 0.050221 0.013224 1.000000
# for the RedViz we are using the temperature and humidity
from yellowbrick.datasets import load_occupancy
from yellowbrick.features import RadViz
# Load the classification dataset
X, y = load_occupancy()
# Specify the target classes
classes = ["unoccupied", "occupied"]
# Instantiate the visualizer
visualizer = RadViz(classes=classes)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Finalize and render the figure
<AxesSubplot:title={'center':'RadViz for 5 Features'}>
import plotly.express as px
fig = px.parallel_coordinates(data, color="rooms",
dimensions=['price', 'property_count', 'landsize','building_area'],
color_continuous_scale=px.colors.diverging.Tealrose,
color_continuous_midpoint=2)
fig.show()
The End
Comments
Post a Comment