一、使用pandas+seaborn+pyplot进行数据展示和预处理操作
Pandas基础教学:这是前两天在Datawhale看到的一篇关于Pandas的介绍,写得非常好,记录备用
https://mp.weixin.qq.com/s/A232A6OLxrlsZUx0VGv3Ow
1. 读取与查看数据
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
display.set_matplotlib_formats("svg")
读取数据
data = pd.read_feather("./data/house_sales.ftr")
data.shape
数据基本信息
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164944 entries, 0 to 164943
Columns: 1789 entries, Id to Zip
dtypes: object(1789)
memory usage: 2.2+ GB
查看数据
data.head()
Id | Address | Sold Price | Sold On | Summary | Type | Year built | Heating | Cooling | Parking | ... | Well Disclosure | remodeled | DOH2 | SerialX | Full Baths | Tax Legal Lot Number | Tax Legal Block Number | Tax Legal Tract Number | Building Name | Zip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2080183300 | 11205 Monterey, | $2,000,000 | 01/31/20 | 11205 Monterey, San Martin, CA 95046 is a sing... | SingleFamily | No Data | No Data | No Data | 0 spaces | ... | None | None | None | None | None | None | None | None | None | 95046 |
1 | 20926300 | 5281 Castle Rd, | $2,100,000 | 02/25/21 | Spectacular Mountain and incredible L.A. City ... | SingleFamily | 1951 | Central | Central Air, Dual | Driveway, Driveway - Brick | ... | None | None | None | None | None | None | None | None | None | 91011 |
2 | 19595300 | 3581 Butcher Dr, | $1,125,000 | 11/06/19 | Eichler Style home! with Santa Clara High! in ... | SingleFamily | 1954 | Central Forced Air - Gas | Central AC | Garage, Garage - Attached, Covered | ... | None | None | None | None | None | None | None | None | None | 95051 |
3 | 300472200 | 2021 N Milpitas Blvd, | $36,250,000 | 10/02/20 | 2021 N Milpitas Blvd, Milpitas, CA 95035 is a ... | Apartment | 1989 | Other | No Data | Mixed, Covered | ... | None | None | None | None | None | None | None | None | None | 95035 |
4 | 2074492000 | LOT 4 Tool Box Spring Rd, | $140,000 | 10/19/20 | Beautiful level lot dotted with pine trees ro... | VacantLand | No Data | No Data | No Data | 0 spaces | ... | None | None | None | None | None | None | None | None | None | 92561 |
5 rows × 1789 columns
2. 数据预处理: Drop + Filter + Replace
We drop columns that at least 30% values are null to simplify our EDA.(将缺失数据大于30%的列舍弃(舍弃这些feature))
null_sum = data.isnull().sum() # .sum默认对列进行求和
data.columns[null_sum<len(data)*0.3]
Index(['Id', 'Address', 'Sold Price', 'Sold On', 'Summary', 'Type',
'Year built', 'Heating', 'Cooling', 'Parking', 'Bedrooms', 'Bathrooms',
'Total interior livable area', 'Total spaces', 'Garage spaces',
'Home type', 'Region', 'Elementary School', 'Elementary School Score',
'Elementary School Distance', 'High School', 'High School Score',
'High School Distance', 'Heating features', 'Parking features',
'Lot size', 'Parcel number', 'Tax assessed value', 'Annual tax amount',
'Listed On', 'Listed Price', 'Zip'],
dtype='object')
data.drop(columns=data.columns[null_sum>len(data)*0.3],inplace=True) # 将缺失大于30%的列去掉,并且直接替换
数据格式
data.dtypes
Id object
Address object
Sold Price object
Sold On object
Summary object
Type object
Year built object
Heating object
Cooling object
Parking object
Bedrooms object
Bathrooms object
Total interior livable area object
Total spaces object
Garage spaces object
Home type object
Region object
Elementary School object
Elementary School Score object
Elementary School Distance object
High School object
High School Score object
High School Distance object
Heating features object
Parking features object
Lot size object
Parcel number object
Tax assessed value object
Annual tax amount object
Listed On object
Listed Price object
Zip object
dtype: object
Convert currency from string format such as $1,000,000 to float. (将string类型的数据转换为float)
currency = ['Sold Price', 'Listed Price', 'Tax assessed value', 'Annual tax amount']
for c in currency:
data[c] = data[c].replace(
r'[$,-]', '', regex=True).replace(
r'^\s*$', np.nan, regex=True).astype(float)
Also convert areas from string format such as 1000 sqft and 1 Acres to float as well.
areas = ['Total interior livable area', 'Lot size']
for c in areas:
acres = data[c].str.contains('Acres') == True
col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','', regex=True).astype(float)
col[acres] *= 43560
data[c] = col
3. 数据统计
Now we can check values of the numerical columns. You could see the min and max values for several columns do not make sense. (查看数据统计信息)
data.describe()
Sold Price | Total interior livable area | Lot size | Tax assessed value | Annual tax amount | Listed Price | |
---|---|---|---|---|---|---|
count | 1.648590e+05 | 1.465450e+05 | 1.358450e+05 | 1.450650e+05 | 1.433500e+05 | 1.250060e+05 |
mean | 1.194842e+06 | 3.182221e+03 | 9.525061e+05 | 8.898781e+05 | 1.123415e+04 | 1.197671e+06 |
std | 3.336365e+06 | 4.609881e+05 | 1.357197e+08 | 3.126888e+06 | 3.859389e+04 | 2.874721e+06 |
min | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
25% | 4.350000e+05 | 1.170000e+03 | 4.800000e+03 | 2.550000e+05 | 3.434250e+03 | 4.990000e+05 |
50% | 8.050000e+05 | 1.558000e+03 | 6.603000e+03 | 5.635010e+05 | 7.372000e+03 | 8.490000e+05 |
75% | 1.370000e+06 | 2.144000e+03 | 1.209000e+04 | 1.033832e+06 | 1.321300e+04 | 1.395000e+06 |
max | 8.660000e+08 | 1.764164e+08 | 4.856770e+10 | 8.256328e+08 | 9.977342e+06 | 6.250000e+08 |
We filter out houses whose living areas are too small or too hard to simplify the visualization later. (将面积太大和太小的过虑出来)
abnormal = (data[areas[1]] < 10) | (data[areas[1]] > 1e4)
data = data[~abnormal]
sum(abnormal)
41000
Let’s check the histogram of the ‘Sold Price’, which is the target we want to predict.
ax = sns.histplot(np.log10(data['Sold Price']))
ax.set_xlim([3, 8])
ax.set_xticks(range(3, 9))
ax.set_xticklabels(['%.0e'%a for a in 10**ax.get_xticks()]);
A house has different types. Here are the top types:
data['Type'].value_counts()[0:20]
SingleFamily 74318
Condo 18749
MultiFamily 6586
VacantLand 6199
Townhouse 5846
Unknown 5390
MobileManufactured 2588
Apartment 1416
Cooperative 161
Residential Lot 75
Single Family 69
Single Family Lot 56
Acreage 48
2 Story 39
3 Story 25
Hi-Rise (9+), Luxury 21
RESIDENTIAL 19
Condominium 19
Duplex 19
Mid-Rise (4-8) 17
Name: Type, dtype: int64
Price density for different house types.
types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
sns.displot(pd.DataFrame({
'Sold Price':np.log10(data[types]['Sold Price']),
'Type':data[types]['Type']}),
x='Sold Price', hue='Type', kind='kde');
Another important measurement is the sale price per living sqft. Let’s check the differences between different house types.
4. 数据可视化
data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types], fliersize=0)
ax.set_ylim([0, 2000]);
We know the location affect the price. Let’s check the price for the top 20 zip codes.
d = data[data['Zip'].isin(data['Zip'].value_counts()[:20].keys())]
ax = sns.boxplot(x='Zip', y='Price per living sqft', data=d, fliersize=0)
ax.set_ylim([0, 2000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
Last, we visualize the correlation matrix of several columns.
_, ax = plt.subplots(figsize=(6,6))
columns = ['Sold Price', 'Listed Price', 'Annual tax amount', 'Price per living sqft', 'Elementary School Score', 'High School Score']
sns.heatmap(data[columns].corr(),annot=True,cmap='RdYlGn', ax=ax);
二、总结
This notebook demonstrates the basic technologies for EDA, including
- Understanding column data types, values, and distributions
- Understanding the interactions between columns
We only explored a small aspect of the data. You are welcome to dive deep into more details.
转载:https://blog.csdn.net/jerry_liufeng/article/details/123428011