实用机器学习笔记（三）：数据预处理_小言_互联网的博客

实用机器学习笔记（三）：数据预处理

2022-03-12 17:41 608人阅读评论(0)

文章目录

一、使用pandas+seaborn+pyplot进行数据展示和预处理操作

Pandas基础教学：这是前两天在Datawhale看到的一篇关于Pandas的介绍，写得非常好，记录备用

https://mp.weixin.qq.com/s/A232A6OLxrlsZUx0VGv3Ow

1. 读取与查看数据

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
from IPython import display
display.set_matplotlib_formats("svg")

读取数据

data = pd.read_feather("./data/house_sales.ftr")
data.shape

数据基本信息

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164944 entries, 0 to 164943
Columns: 1789 entries, Id to Zip
dtypes: object(1789)
memory usage: 2.2+ GB

查看数据

data.head()

	Id	Address	Sold Price	Sold On	Summary	Type	Year built	Heating	Cooling	Parking	...	Well Disclosure	remodeled	DOH2	SerialX	Full Baths	Tax Legal Lot Number	Tax Legal Block Number	Tax Legal Tract Number	Building Name	Zip
0	2080183300	11205 Monterey,	$2,000,000	01/31/20	11205 Monterey, San Martin, CA 95046 is a sing...	SingleFamily	No Data	No Data	No Data	0 spaces	...	None	None	None	None	None	None	None	None	None	95046
1	20926300	5281 Castle Rd,	$2,100,000	02/25/21	Spectacular Mountain and incredible L.A. City ...	SingleFamily	1951	Central	Central Air, Dual	Driveway, Driveway - Brick	...	None	None	None	None	None	None	None	None	None	91011
2	19595300	3581 Butcher Dr,	$1,125,000	11/06/19	Eichler Style home! with Santa Clara High! in ...	SingleFamily	1954	Central Forced Air - Gas	Central AC	Garage, Garage - Attached, Covered	...	None	None	None	None	None	None	None	None	None	95051
3	300472200	2021 N Milpitas Blvd,	$36,250,000	10/02/20	2021 N Milpitas Blvd, Milpitas, CA 95035 is a ...	Apartment	1989	Other	No Data	Mixed, Covered	...	None	None	None	None	None	None	None	None	None	95035
4	2074492000	LOT 4 Tool Box Spring Rd,	$140,000	10/19/20	Beautiful level lot dotted with pine trees ro...	VacantLand	No Data	No Data	No Data	0 spaces	...	None	None	None	None	None	None	None	None	None	92561

5 rows × 1789 columns

2. 数据预处理： Drop + Filter + Replace

We drop columns that at least 30% values are null to simplify our EDA.(将缺失数据大于30%的列舍弃（舍弃这些feature）)

null_sum = data.isnull().sum() # .sum默认对列进行求和
data.columns[null_sum<len(data)*0.3]

Index(['Id', 'Address', 'Sold Price', 'Sold On', 'Summary', 'Type',
       'Year built', 'Heating', 'Cooling', 'Parking', 'Bedrooms', 'Bathrooms',
       'Total interior livable area', 'Total spaces', 'Garage spaces',
       'Home type', 'Region', 'Elementary School', 'Elementary School Score',
       'Elementary School Distance', 'High School', 'High School Score',
       'High School Distance', 'Heating features', 'Parking features',
       'Lot size', 'Parcel number', 'Tax assessed value', 'Annual tax amount',
       'Listed On', 'Listed Price', 'Zip'],
      dtype='object')

data.drop(columns=data.columns[null_sum>len(data)*0.3],inplace=True) # 将缺失大于30%的列去掉，并且直接替换

数据格式

data.dtypes

Id                             object
Address                        object
Sold Price                     object
Sold On                        object
Summary                        object
Type                           object
Year built                     object
Heating                        object
Cooling                        object
Parking                        object
Bedrooms                       object
Bathrooms                      object
Total interior livable area    object
Total spaces                   object
Garage spaces                  object
Home type                      object
Region                         object
Elementary School              object
Elementary School Score        object
Elementary School Distance     object
High School                    object
High School Score              object
High School Distance           object
Heating features               object
Parking features               object
Lot size                       object
Parcel number                  object
Tax assessed value             object
Annual tax amount              object
Listed On                      object
Listed Price                   object
Zip                            object
dtype: object

Convert currency from string format such as $1,000,000 to float. （将string类型的数据转换为float）

currency = ['Sold Price', 'Listed Price', 'Tax assessed value', 'Annual tax amount']
for c in currency:
    data[c] = data[c].replace(
        r'[$,-]', '', regex=True).replace(
        r'^\s*$', np.nan, regex=True).astype(float)

Also convert areas from string format such as 1000 sqft and 1 Acres to float as well.

areas = ['Total interior livable area', 'Lot size']
for c in areas:
    acres = data[c].str.contains('Acres') == True
    col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','', regex=True).astype(float)
    col[acres] *= 43560
    data[c] = col

3. 数据统计

Now we can check values of the numerical columns. You could see the min and max values for several columns do not make sense. （查看数据统计信息）

data.describe()

	Sold Price	Total interior livable area	Lot size	Tax assessed value	Annual tax amount	Listed Price
count	1.648590e+05	1.465450e+05	1.358450e+05	1.450650e+05	1.433500e+05	1.250060e+05
mean	1.194842e+06	3.182221e+03	9.525061e+05	8.898781e+05	1.123415e+04	1.197671e+06
std	3.336365e+06	4.609881e+05	1.357197e+08	3.126888e+06	3.859389e+04	2.874721e+06
min	1.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	1.000000e+00
25%	4.350000e+05	1.170000e+03	4.800000e+03	2.550000e+05	3.434250e+03	4.990000e+05
50%	8.050000e+05	1.558000e+03	6.603000e+03	5.635010e+05	7.372000e+03	8.490000e+05
75%	1.370000e+06	2.144000e+03	1.209000e+04	1.033832e+06	1.321300e+04	1.395000e+06
max	8.660000e+08	1.764164e+08	4.856770e+10	8.256328e+08	9.977342e+06	6.250000e+08

We filter out houses whose living areas are too small or too hard to simplify the visualization later. （将面积太大和太小的过虑出来）

abnormal = (data[areas[1]] < 10) | (data[areas[1]] > 1e4)
data = data[~abnormal]
sum(abnormal)

Let’s check the histogram of the ‘Sold Price’, which is the target we want to predict.

ax = sns.histplot(np.log10(data['Sold Price']))
ax.set_xlim([3, 8])
ax.set_xticks(range(3, 9))
ax.set_xticklabels(['%.0e'%a for a in 10**ax.get_xticks()]);

A house has different types. Here are the top types:

data['Type'].value_counts()[0:20]

SingleFamily            74318
Condo                   18749
MultiFamily              6586
VacantLand               6199
Townhouse                5846
Unknown                  5390
MobileManufactured       2588
Apartment                1416
Cooperative               161
Residential Lot            75
Single Family              69
Single Family Lot          56
Acreage                    48
2 Story                    39
3 Story                    25
Hi-Rise (9+), Luxury       21
RESIDENTIAL                19
Condominium                19
Duplex                     19
Mid-Rise (4-8)             17
Name: Type, dtype: int64

Price density for different house types.

types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
sns.displot(pd.DataFrame({
   'Sold Price':np.log10(data[types]['Sold Price']),
                          'Type':data[types]['Type']}),
            x='Sold Price', hue='Type', kind='kde');

Another important measurement is the sale price per living sqft. Let’s check the differences between different house types.

4. 数据可视化

data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types], fliersize=0)
ax.set_ylim([0, 2000]);

We know the location affect the price. Let’s check the price for the top 20 zip codes.

d = data[data['Zip'].isin(data['Zip'].value_counts()[:20].keys())]
ax = sns.boxplot(x='Zip', y='Price per living sqft', data=d, fliersize=0)
ax.set_ylim([0, 2000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

Last, we visualize the correlation matrix of several columns.

_, ax = plt.subplots(figsize=(6,6))
columns = ['Sold Price', 'Listed Price', 'Annual tax amount', 'Price per living sqft', 'Elementary School Score', 'High School Score']
sns.heatmap(data[columns].corr(),annot=True,cmap='RdYlGn', ax=ax);

二、总结

This notebook demonstrates the basic technologies for EDA, including

Understanding column data types, values, and distributions
Understanding the interactions between columns

We only explored a small aspect of the data. You are welcome to dive deep into more details.

转载：https://blog.csdn.net/jerry_liufeng/article/details/123428011

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章