data_exploration

Exploring and Understanding Your Data: Techniques and Tools for Machine Learning¶

Machine learning is a powerful tool that allows us to make predictions and decisions based on data. One of the first steps in any machine learning project is data exploration, which involves analyzing and understanding the data that will be used to train and evaluate the model.

In [1]:

import pandas as pd
# read in data
data = pd.read_csv('housing_data.csv')
# explore data
data.head()

Out[1]:

	suburb	rooms	type	price	method	seller_g	date	distance	postcode	bedroom2	...	car	landsize	building_area	year_built	council_area	latitude	longitude	region_name	property_count	yr_qtr
0	Abbotsford	2	h	NaN	SS	Jellis	2016-09-03	2.5	3067.0	2.0	...	1.0	126.0	NaN	NaN	Yarra City Council	-37.8014	144.9958	Northern Metropolitan	4019.0	2016.3
1	Abbotsford	2	h	1480000.0	S	Biggin	2016-12-03	2.5	3067.0	2.0	...	1.0	202.0	NaN	NaN	Yarra City Council	-37.7996	144.9984	Northern Metropolitan	4019.0	2016.4
2	Abbotsford	2	h	1035000.0	S	Biggin	2016-02-04	2.5	3067.0	2.0	...	0.0	156.0	79.0	1900.0	Yarra City Council	-37.8079	144.9934	Northern Metropolitan	4019.0	2016.1
3	Abbotsford	3	u	NaN	VB	Rounds	2016-02-04	2.5	3067.0	3.0	...	1.0	0.0	NaN	NaN	Yarra City Council	-37.8114	145.0116	Northern Metropolitan	4019.0	2016.1
4	Abbotsford	3	h	1465000.0	SP	Biggin	2017-03-04	2.5	3067.0	3.0	...	0.0	134.0	150.0	1900.0	Yarra City Council	-37.8093	144.9944	Northern Metropolitan	4019.0	2017.1

5 rows × 21 columns

In [2]:

#get the info of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   suburb          34857 non-null  object 
 1   rooms           34857 non-null  int64  
 2   type            34857 non-null  object 
 3   price           27247 non-null  float64
 4   method          34857 non-null  object 
 5   seller_g        34857 non-null  object 
 6   date            34857 non-null  object 
 7   distance        34856 non-null  float64
 8   postcode        34856 non-null  float64
 9   bedroom2        26640 non-null  float64
 10  bathroom        26631 non-null  float64
 11  car             26129 non-null  float64
 12  landsize        23047 non-null  float64
 13  building_area   13742 non-null  float64
 14  year_built      15551 non-null  float64
 15  council_area    34854 non-null  object 
 16  latitude        26881 non-null  float64
 17  longitude       26881 non-null  float64
 18  region_name     34854 non-null  object 
 19  property_count  34854 non-null  float64
 20  yr_qtr          34857 non-null  float64
dtypes: float64(13), int64(1), object(7)
memory usage: 5.6+ MB

Data Size & Summary Stats:¶

Data size is an important consideration in data exploration, as the amount of data available can impact the performance and accuracy of a machine learning model. Summary statistics such as mean, median and standard deviation can provide a quick overview of the data and help identify any outliers or anomalies.

In [3]:

data.shape

Out[3]:

(34857, 21)

In [4]:

data.describe()

Out[4]:

	rooms	price	distance	postcode	bedroom2	bathroom	car	landsize	building_area	year_built	latitude	longitude	property_count	yr_qtr
count	34857.000000	2.724700e+04	34856.000000	34856.000000	26640.000000	26631.000000	26129.000000	23047.000000	13742.00000	15551.000000	26881.000000	26881.000000	34854.000000	34857.000000
mean	3.031012	1.050173e+06	11.184929	3116.062859	3.084647	1.624798	1.728845	593.598993	160.25640	1965.289885	-37.810634	145.001851	7572.888306	2017.108357
std	0.969933	6.414671e+05	6.788892	109.023903	0.980690	0.724212	1.010771	3398.841946	401.26706	37.328178	0.090279	0.120169	4428.090313	0.592372
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.00000	1196.000000	-38.190430	144.423790	83.000000	2016.100000
25%	2.000000	6.350000e+05	6.400000	3051.000000	2.000000	1.000000	1.000000	224.000000	102.00000	1940.000000	-37.862950	144.933500	4385.000000	2016.400000
50%	3.000000	8.700000e+05	10.300000	3103.000000	3.000000	2.000000	2.000000	521.000000	136.00000	1970.000000	-37.807600	145.007800	6763.000000	2017.300000
75%	4.000000	1.295000e+06	14.000000	3156.000000	4.000000	2.000000	2.000000	670.000000	188.00000	2000.000000	-37.754100	145.071900	10412.000000	2017.400000
max	16.000000	1.120000e+07	48.100000	3978.000000	30.000000	12.000000	26.000000	433014.000000	44515.00000	2106.000000	-37.390200	145.526350	21650.000000	2018.100000

Histogram¶

Histograms are a useful tool for visualizing the distribution of a dataset. They provide a graphical representation of the frequency of different values in a dataset, and can help identify patterns or trends.

In [ ]:

import matplotlib.pyplot as plt

data.hist(column='price', bins=10)
plt.show()

In [6]:

#create a sequesnce of histogram 
data.hist(bins=10, figsize=(20,15))
plt.show()

Scatter Plot & Joint Plot¶

Scatter plots allow us to visualize the relationship between two variables, and can be useful for identifying patterns or outliers in the data. Joint plots are similar to scatter plots, but also include histograms of the individual variables.

In [ ]:

import seaborn as sns
sns.scatterplot(x='price', y='property_count', data=data)
plt.show()

In [8]:

# create a joint plot
sns.jointplot(x='price', y='property_count', data=data)
plt.show()

Pair Grid & Box and Violin Plots¶

Pair grids and box and violin plots are also useful tools for visualizing the relationship between variables. Pair grids allow us to plot multiple variables at once, while box and violin plots provide a more detailed view of the distribution of a variable.

In [9]:

# create a pair plot
sns.pairplot(data)
plt.show()

In [11]:

# create a box plot
sns.boxplot(x='price', y='property_count', data=data)
sns.violinplot(x='price', y='property_count', data=data)
plt.show()

Comparing Two Ordinal Values¶

Comparing two ordinal values, such as categorical variables, can be done using a variety of tools such as bar plots, or stacked bar plots.

In [7]:

data.columns

Out[7]:

Index(['suburb', 'rooms', 'type', 'price', 'method', 'seller_g', 'date',
       'distance', 'postcode', 'bedroom2', 'bathroom', 'car', 'landsize',
       'building_area', 'year_built', 'council_area', 'latitude', 'longitude',
       'region_name', 'property_count', 'yr_qtr'],
      dtype='object')

In [14]:

sns.countplot(x='rooms', hue='type', data=data)
plt.show()

Correlation, RadViz and Parallel Coordinates¶

Correlation is a measure of the relationship between two variables, and can be useful for identifying patterns or trends in the data. RadViz is a visualization tool that is particularly useful for identifying patterns in high-dimensional data. Finally, parallel coordinates plots are used to visualize high dimensional data, by plotting each variable on a separate axis and connecting the observations by lines. It helps to identify the relationship between different variables and also to identify outliers.

In [15]:

corr = data.corr()
print(corr)

                   rooms     price  distance  postcode  bedroom2  bathroom  \
rooms           1.000000  0.465238  0.271511  0.085890  0.946755  0.611826   
price           0.465238  1.000000 -0.211384  0.044950  0.430275  0.429878   
distance        0.271511 -0.211384  1.000000  0.481566  0.269524  0.126201   
postcode        0.085890  0.044950  0.481566  1.000000  0.089292  0.120080   
bedroom2        0.946755  0.430275  0.269524  0.089292  1.000000  0.614892   
bathroom        0.611826  0.429878  0.126201  0.120080  0.614892  1.000000   
car             0.393878  0.201803  0.241835  0.067886  0.388491  0.307518   
landsize        0.037402  0.032748  0.060862  0.040664  0.037019  0.036333   
building_area   0.156229  0.100754  0.076301  0.042437  0.154157  0.147558   
year_built     -0.012749 -0.333306  0.323059  0.089805 -0.002022  0.167955   
latitude        0.004872 -0.215607 -0.100417 -0.231027  0.003447 -0.059183   
longitude       0.103235  0.197874  0.200946  0.362895  0.106164  0.106531   
property_count -0.071677 -0.059017 -0.018140  0.017108 -0.053451 -0.032887   
yr_qtr          0.091880 -0.020871  0.252297  0.109255  0.202242  0.102500   

                     car  landsize  building_area  year_built  latitude  \
rooms           0.393878  0.037402       0.156229   -0.012749  0.004872   
price           0.201803  0.032748       0.100754   -0.333306 -0.215607   
distance        0.241835  0.060862       0.076301    0.323059 -0.100417   
postcode        0.067886  0.040664       0.042437    0.089805 -0.231027   
bedroom2        0.388491  0.037019       0.154157   -0.002022  0.003447   
bathroom        0.307518  0.036333       0.147558    0.167955 -0.059183   
car             1.000000  0.037829       0.104373    0.128702 -0.009020   
landsize        0.037829  1.000000       0.354530    0.044474  0.025318   
building_area   0.104373  0.354530       1.000000    0.067811  0.017155   
year_built      0.128702  0.044474       0.067811    1.000000  0.091592   
latitude       -0.009020  0.025318       0.017155    0.091592  1.000000   
longitude       0.047213 -0.002582      -0.002143   -0.022175 -0.345589   
property_count -0.009617 -0.018195      -0.024523    0.022420  0.011112   
yr_qtr          0.155759  0.030151       0.025467    0.101301  0.023835   

                longitude  property_count    yr_qtr  
rooms            0.103235       -0.071677  0.091880  
price            0.197874       -0.059017 -0.020871  
distance         0.200946       -0.018140  0.252297  
postcode         0.362895        0.017108  0.109255  
bedroom2         0.106164       -0.053451  0.202242  
bathroom         0.106531       -0.032887  0.102500  
car              0.047213       -0.009617  0.155759  
landsize        -0.002582       -0.018195  0.030151  
building_area   -0.002143       -0.024523  0.025467  
year_built      -0.022175        0.022420  0.101301  
latitude        -0.345589        0.011112  0.023835  
longitude        1.000000        0.016326  0.050221  
property_count   0.016326        1.000000  0.013224  
yr_qtr           0.050221        0.013224  1.000000

In [57]:

# for the RedViz we are using the temperature and humidity
from yellowbrick.datasets import load_occupancy
from yellowbrick.features import RadViz

# Load the classification dataset
X, y = load_occupancy()

# Specify the target classes
classes = ["unoccupied", "occupied"]

# Instantiate the visualizer
visualizer = RadViz(classes=classes)

visualizer.fit(X, y)           # Fit the data to the visualizer
visualizer.transform(X)        # Transform the data
visualizer.show()              # Finalize and render the figure

Out[57]:

<AxesSubplot:title={'center':'RadViz for 5 Features'}>

In [2]:

import plotly.express as px
fig = px.parallel_coordinates(data, color="rooms",
                              dimensions=['price', 'property_count', 'landsize','building_area'],
                              color_continuous_scale=px.colors.diverging.Tealrose,
                              color_continuous_midpoint=2)
fig.show()

The End

Machine Learning 30 Days Challange

Machine Learning Challenge: Day 6

Exploring and Understanding Your Data: Techniques and Tools for Machine Learning¶

Data Size & Summary Stats:¶

Histogram¶

Scatter Plot & Joint Plot¶

Pair Grid & Box and Violin Plots¶

Comparing Two Ordinal Values¶

Correlation, RadViz and Parallel Coordinates¶

Comments

Post a Comment

Popular posts from this blog

Roadmap for 30 Day Machine Learning Challange

Machine Learning Challenge: Day 3