小言_互联网的博客

实用机器学习笔记(三):数据预处理

511人阅读  评论(0)

一、使用pandas+seaborn+pyplot进行数据展示和预处理操作

Pandas基础教学:这是前两天在Datawhale看到的一篇关于Pandas的介绍,写得非常好,记录备用

https://mp.weixin.qq.com/s/A232A6OLxrlsZUx0VGv3Ow

1. 读取与查看数据

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
from IPython import display
display.set_matplotlib_formats("svg")

读取数据

data = pd.read_feather("./data/house_sales.ftr")
data.shape

数据基本信息

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164944 entries, 0 to 164943
Columns: 1789 entries, Id to Zip
dtypes: object(1789)
memory usage: 2.2+ GB

查看数据

data.head()
Id Address Sold Price Sold On Summary Type Year built Heating Cooling Parking ... Well Disclosure remodeled DOH2 SerialX Full Baths Tax Legal Lot Number Tax Legal Block Number Tax Legal Tract Number Building Name Zip
0 2080183300 11205 Monterey, $2,000,000 01/31/20 11205 Monterey, San Martin, CA 95046 is a sing... SingleFamily No Data No Data No Data 0 spaces ... None None None None None None None None None 95046
1 20926300 5281 Castle Rd, $2,100,000 02/25/21 Spectacular Mountain and incredible L.A. City ... SingleFamily 1951 Central Central Air, Dual Driveway, Driveway - Brick ... None None None None None None None None None 91011
2 19595300 3581 Butcher Dr, $1,125,000 11/06/19 Eichler Style home! with Santa Clara High! in ... SingleFamily 1954 Central Forced Air - Gas Central AC Garage, Garage - Attached, Covered ... None None None None None None None None None 95051
3 300472200 2021 N Milpitas Blvd, $36,250,000 10/02/20 2021 N Milpitas Blvd, Milpitas, CA 95035 is a ... Apartment 1989 Other No Data Mixed, Covered ... None None None None None None None None None 95035
4 2074492000 LOT 4 Tool Box Spring Rd, $140,000 10/19/20 Beautiful level lot dotted with pine trees ro... VacantLand No Data No Data No Data 0 spaces ... None None None None None None None None None 92561

5 rows × 1789 columns

2. 数据预处理: Drop + Filter + Replace

We drop columns that at least 30% values are null to simplify our EDA.(将缺失数据大于30%的列舍弃(舍弃这些feature))

null_sum = data.isnull().sum() # .sum默认对列进行求和
data.columns[null_sum<len(data)*0.3]
Index(['Id', 'Address', 'Sold Price', 'Sold On', 'Summary', 'Type',
       'Year built', 'Heating', 'Cooling', 'Parking', 'Bedrooms', 'Bathrooms',
       'Total interior livable area', 'Total spaces', 'Garage spaces',
       'Home type', 'Region', 'Elementary School', 'Elementary School Score',
       'Elementary School Distance', 'High School', 'High School Score',
       'High School Distance', 'Heating features', 'Parking features',
       'Lot size', 'Parcel number', 'Tax assessed value', 'Annual tax amount',
       'Listed On', 'Listed Price', 'Zip'],
      dtype='object')
data.drop(columns=data.columns[null_sum>len(data)*0.3],inplace=True) # 将缺失大于30%的列去掉,并且直接替换

数据格式

data.dtypes
Id                             object
Address                        object
Sold Price                     object
Sold On                        object
Summary                        object
Type                           object
Year built                     object
Heating                        object
Cooling                        object
Parking                        object
Bedrooms                       object
Bathrooms                      object
Total interior livable area    object
Total spaces                   object
Garage spaces                  object
Home type                      object
Region                         object
Elementary School              object
Elementary School Score        object
Elementary School Distance     object
High School                    object
High School Score              object
High School Distance           object
Heating features               object
Parking features               object
Lot size                       object
Parcel number                  object
Tax assessed value             object
Annual tax amount              object
Listed On                      object
Listed Price                   object
Zip                            object
dtype: object

Convert currency from string format such as $1,000,000 to float. (将string类型的数据转换为float)

currency = ['Sold Price', 'Listed Price', 'Tax assessed value', 'Annual tax amount']
for c in currency:
    data[c] = data[c].replace(
        r'[$,-]', '', regex=True).replace(
        r'^\s*$', np.nan, regex=True).astype(float)

Also convert areas from string format such as 1000 sqft and 1 Acres to float as well.

areas = ['Total interior livable area', 'Lot size']
for c in areas:
    acres = data[c].str.contains('Acres') == True
    col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','', regex=True).astype(float)
    col[acres] *= 43560
    data[c] = col

3. 数据统计

Now we can check values of the numerical columns. You could see the min and max values for several columns do not make sense. (查看数据统计信息)

data.describe()
Sold Price Total interior livable area Lot size Tax assessed value Annual tax amount Listed Price
count 1.648590e+05 1.465450e+05 1.358450e+05 1.450650e+05 1.433500e+05 1.250060e+05
mean 1.194842e+06 3.182221e+03 9.525061e+05 8.898781e+05 1.123415e+04 1.197671e+06
std 3.336365e+06 4.609881e+05 1.357197e+08 3.126888e+06 3.859389e+04 2.874721e+06
min 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
25% 4.350000e+05 1.170000e+03 4.800000e+03 2.550000e+05 3.434250e+03 4.990000e+05
50% 8.050000e+05 1.558000e+03 6.603000e+03 5.635010e+05 7.372000e+03 8.490000e+05
75% 1.370000e+06 2.144000e+03 1.209000e+04 1.033832e+06 1.321300e+04 1.395000e+06
max 8.660000e+08 1.764164e+08 4.856770e+10 8.256328e+08 9.977342e+06 6.250000e+08

We filter out houses whose living areas are too small or too hard to simplify the visualization later. (将面积太大和太小的过虑出来)

abnormal = (data[areas[1]] < 10) | (data[areas[1]] > 1e4)
data = data[~abnormal]
sum(abnormal)
41000

Let’s check the histogram of the ‘Sold Price’, which is the target we want to predict.

ax = sns.histplot(np.log10(data['Sold Price']))
ax.set_xlim([3, 8])
ax.set_xticks(range(3, 9))
ax.set_xticklabels(['%.0e'%a for a in 10**ax.get_xticks()]);

A house has different types. Here are the top types:

data['Type'].value_counts()[0:20]
SingleFamily            74318
Condo                   18749
MultiFamily              6586
VacantLand               6199
Townhouse                5846
Unknown                  5390
MobileManufactured       2588
Apartment                1416
Cooperative               161
Residential Lot            75
Single Family              69
Single Family Lot          56
Acreage                    48
2 Story                    39
3 Story                    25
Hi-Rise (9+), Luxury       21
RESIDENTIAL                19
Condominium                19
Duplex                     19
Mid-Rise (4-8)             17
Name: Type, dtype: int64

Price density for different house types.

types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
sns.displot(pd.DataFrame({
   'Sold Price':np.log10(data[types]['Sold Price']),
                          'Type':data[types]['Type']}),
            x='Sold Price', hue='Type', kind='kde');

Another important measurement is the sale price per living sqft. Let’s check the differences between different house types.

4. 数据可视化

data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types], fliersize=0)
ax.set_ylim([0, 2000]);

We know the location affect the price. Let’s check the price for the top 20 zip codes.

d = data[data['Zip'].isin(data['Zip'].value_counts()[:20].keys())]
ax = sns.boxplot(x='Zip', y='Price per living sqft', data=d, fliersize=0)
ax.set_ylim([0, 2000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

Last, we visualize the correlation matrix of several columns.

_, ax = plt.subplots(figsize=(6,6))
columns = ['Sold Price', 'Listed Price', 'Annual tax amount', 'Price per living sqft', 'Elementary School Score', 'High School Score']
sns.heatmap(data[columns].corr(),annot=True,cmap='RdYlGn', ax=ax);

二、总结

This notebook demonstrates the basic technologies for EDA, including

  • Understanding column data types, values, and distributions
  • Understanding the interactions between columns

We only explored a small aspect of the data. You are welcome to dive deep into more details.


转载:https://blog.csdn.net/jerry_liufeng/article/details/123428011
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场