河北高校邀请赛——二手车交易价格预测-Task2 数据分析
EDA-数据探索性分析
赛题:零基础入门数据挖掘 - 二手车交易价格预测
要求参赛选手根据给定的数据集,建立模型,预测二手汽车的交易价格。
一、EDA目标
- EDA的价值主要在于熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。
- 了解变量间的相互关系以及变量与预测值之间的存在关系。
- 引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。
二、内容介绍
- 载入各种数据科学以及可视化库:
数据科学库 pandas、numpy、scipy;
可视化库 matplotlib、seabon;
其他; - 载入数据:
载入训练集和测试集;
简略观察数据(head()+shape); - 数据总览:
通过describe()来熟悉数据的相关统计量;
通过info()来熟悉数据类型; - 判断数据缺失和异常:
查看每列的存在nan情况;
异常值检测; - 了解预测值的分布:
总体分布概况(无界约翰逊分布等);
查看skewness 和 kurtosis;
查看预测值的具体频数; - 特征分为类别特征和数字特征,并对类别特征查看unique分布
- 数字特征分析:
相关性分析;
查看几个特征得 偏度和峰值;
每个数字特征得分布可视化;
数字特征相互之间的关系可视化;
多变量互相回归关系可视化; - 类型特征分析:
unique分布;
类别特征箱形图可视化;
类别特征的小提琴图可视化;
类别特征的柱形图可视化类别;
特征的每个类别频数可视化(count_plot); - 用pandas_profiling生成数据报告
三、代码示例
1.载入各种数据科学以及可视化库
#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
2.载入数据
1) 载入训练集和测试集
## 1) 载入训练集和测试集;
Train_data = pd.read_csv('./data/car_train_0110.csv', sep=' ')
Test_data = pd.read_csv('./data/car_testA_0110.csv', sep=' ')
所有特征集均脱敏处理
- name - 汽车编码
- regDate - 汽车注册时间
- model - 车型编码
- brand - 品牌
- bodyType - 车身类型
- fuelType - 燃油类型
- gearbox - 变速箱
- power - 汽车功率
- kilometer - 汽车行驶公里
- notRepairedDamage - 汽车有尚未修复的损坏
- regionCode - 看车地区编码
- seller - 销售方
- offerType - 报价类型
- creatDate - 广告发布时间
- price - 汽车价格
- 包含v0-23在内24个匿名特征(根据汽车的评论、标签等大量信息得到的embedding向量)【人工构造 匿名特征】
2) 简略观察数据(head()+shape)
## 2) 简略观察数据(head()+shape)
Train_data.head().append(Train_data.tail())
Train_data.shape
(250000, 40)
Test_data.head().append(Test_data.tail())
Test_data.shape
(50000, 39)
在接下来的操作中,养成看数据集的head()及shape的习惯,如果对自己的pandas等操作不放心,建议执行一步看一下。
3.总览数据概况
- describe中有每列的统计量(个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、最大值),可迅速掌握数据的大概的范围,有助于异常值的判断,比如999 9999、-1 等值可能是nan的另外一种表达方式,需要注意。
- 通过info来了解数据每列的type,有助于了解是否存在除了nan以外的特殊符号异常
1) 通过describe()来熟悉数据的相关统计量
## 1) 通过describe()来熟悉数据的相关统计量
Train_data.describe()
Test_data.describe()
2) 通过info()来熟悉数据类型
## 2) 通过info()来熟悉数据类型
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 250000 non-null int64
1 name 250000 non-null int64
2 regDate 250000 non-null int64
3 model 250000 non-null float64
4 brand 250000 non-null int64
5 bodyType 224620 non-null float64
6 fuelType 227510 non-null float64
7 gearbox 236487 non-null float64
8 power 250000 non-null int64
9 kilometer 250000 non-null float64
10 notRepairedDamage 201464 non-null float64
11 regionCode 250000 non-null int64
12 seller 250000 non-null int64
13 offerType 250000 non-null int64
14 creatDate 250000 non-null int64
15 price 250000 non-null int64
16 v_0 250000 non-null float64
17 v_1 250000 non-null float64
18 v_2 250000 non-null float64
19 v_3 250000 non-null float64
20 v_4 250000 non-null float64
21 v_5 250000 non-null float64
22 v_6 250000 non-null float64
23 v_7 250000 non-null float64
24 v_8 250000 non-null float64
25 v_9 250000 non-null float64
26 v_10 250000 non-null float64
27 v_11 250000 non-null float64
28 v_12 250000 non-null float64
29 v_13 250000 non-null float64
30 v_14 250000 non-null float64
31 v_15 250000 non-null float64
32 v_16 250000 non-null float64
33 v_17 250000 non-null float64
34 v_18 250000 non-null float64
35 v_19 250000 non-null float64
36 v_20 250000 non-null float64
37 v_21 250000 non-null float64
38 v_22 250000 non-null float64
39 v_23 250000 non-null float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB
Test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SaleID 50000 non-null int64
1 name 50000 non-null int64
2 regDate 50000 non-null int64
3 model 50000 non-null float64
4 brand 50000 non-null int64
5 bodyType 44890 non-null float64
6 fuelType 45598 non-null float64
7 gearbox 47287 non-null float64
8 power 50000 non-null int64
9 kilometer 50000 non-null float64
10 notRepairedDamage 40372 non-null float64
11 regionCode 50000 non-null int64
12 seller 50000 non-null int64
13 offerType 50000 non-null int64
14 creatDate 50000 non-null int64
15 v_0 50000 non-null float64
16 v_1 50000 non-null float64
17 v_2 50000 non-null float64
18 v_3 50000 non-null float64
19 v_4 50000 non-null float64
20 v_5 50000 non-null float64
21 v_6 50000 non-null float64
22 v_7 50000 non-null float64
23 v_8 50000 non-null float64
24 v_9 50000 non-null float64
25 v_10 50000 non-null float64
26 v_11 50000 non-null float64
27 v_12 50000 non-null float64
28 v_13 50000 non-null float64
29 v_14 50000 non-null float64
30 v_15 50000 non-null float64
31 v_16 50000 non-null float64
32 v_17 50000 non-null float64
33 v_18 50000 non-null float64
34 v_19 50000 non-null float64
35 v_20 50000 non-null float64
36 v_21 50000 non-null float64
37 v_22 50000 non-null float64
38 v_23 50000 non-null float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB
4.判断数据缺失和异常
1) 查看每列的存在nan情况
## 1) 查看每列的存在nan情况
Train_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 25380
fuelType 22490
gearbox 13513
power 0
kilometer 0
notRepairedDamage 48536
regionCode 0
seller 0
offerType 0
creatDate 0
price 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
v_15 0
v_16 0
v_17 0
v_18 0
v_19 0
v_20 0
v_21 0
v_22 0
v_23 0
dtype: int64
Test_data.isnull().sum()
SaleID 0
name 0
regDate 0
model 0
brand 0
bodyType 5110
fuelType 4402
gearbox 2713
power 0
kilometer 0
notRepairedDamage 9628
regionCode 0
seller 0
offerType 0
creatDate 0
v_0 0
v_1 0
v_2 0
v_3 0
v_4 0
v_5 0
v_6 0
v_7 0
v_8 0
v_9 0
v_10 0
v_11 0
v_12 0
v_13 0
v_14 0
v_15 0
v_16 0
v_17 0
v_18 0
v_19 0
v_20 0
v_21 0
v_22 0
v_23 0
dtype: int64
通过可视化直观了解哪些列存在 “nan”, 目的在于查看nan存在的个数大小:如果很小一般选择填充;如果使用lgb等树模型可以直接空缺,让树自己去优化;但如果nan存在的过多、可以考虑删掉。
# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
# 可视化看下缺省值
msno.matrix(Train_data.sample(250))
msno.bar(Train_data.sample(1000))
# 可视化看下缺省值
msno.matrix(Test_data.sample(250))
msno.bar(Test_data.sample(1000))
测试集的缺省和训练集的差不多情况, 可视化有四列有缺省,notRepairedDamage缺省得最多。
2) 查看异常值检测
以下两个类别特征严重倾斜,一般不会对预测有什么帮助,因此可以删掉。
Train_data["seller"].value_counts()
1 249999
0 1
Name: seller, dtype: int64
Train_data["offerType"].value_counts()
0 249991
1 9
Name: offerType, dtype: int64
del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]
5.了解预测值的分布
Train_data['price']
0 520
1 5500
2 1100
3 1200
4 3300
...
249995 1200
249996 1200
249997 16500
249998 31950
249999 1990
Name: price, Length: 250000, dtype: int64
Train_data['price'].value_counts()
0 7312
500 3815
1500 3587
1000 3149
1200 3071
...
22985 1
10695 1
20680 1
2243 1
46888 1
Name: price, Length: 4585, dtype: int64
1) 总体分布概况(无界约翰逊分布等)
## 1) 总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
价格不服从正态分布,所以在进行回归之前,它必须进行转换。对数转换和无界约翰逊转换拟合效果都很好。
2) 查看skewness and kurtosis
## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())
Skewness: 3.535346
Kurtosis: 21.230678
Train_data.skew()
SaleID 0.001712
name 0.513079
regDate -1.540844
model 1.499765
brand 1.314846
bodyType -0.070459
fuelType 0.701802
gearbox -1.357379
power 58.590829
kilometer -1.557472
notRepairedDamage -2.312519
regionCode 0.690405
creatDate -95.428563
price 3.535346
v_0 -1.504738
v_1 1.582428
v_2 1.198679
v_3 1.352193
v_4 0.217941
v_5 2.052749
v_6 0.090718
v_7 0.823610
v_8 -1.532964
v_9 1.529931
v_10 -2.584452
v_11 -0.906428
v_12 -2.842834
v_13 -3.869655
v_14 0.491706
v_15 1.308716
v_16 1.662893
v_17 0.233318
v_18 0.814453
v_19 0.100073
v_20 2.001253
v_21 0.180020
v_22 0.819133
v_23 1.357847
dtype: float64, SaleID -1.201476
Train_data.kurt()
name -1.084474
regDate 11.041006
model 1.741896
brand 1.814245
bodyType -1.070358
fuelType -1.495782
gearbox -0.157525
power 4473.885260
kilometer 1.250933
notRepairedDamage 3.347777
regionCode -0.352973
creatDate 11376.694263
price 21.230678
v_0 2.901641
v_1 1.098703
v_2 3.749872
v_3 4.294578
v_4 6.953348
v_5 6.489791
v_6 -0.564878
v_7 -0.729838
v_8 0.370812
v_9 0.377943
v_10 4.796855
v_11 1.547812
v_12 6.136342
v_13 13.199575
v_14 -1.597532
v_15 -0.029594
v_16 2.240928
v_17 2.569341
v_18 2.967738
v_19 6.923953
v_20 6.852809
v_21 -0.759948
v_22 -0.741708
v_23 0.143713
dtype: float64
sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
3) 查看预测值的具体频数
## 3) 查看预测值的具体频数
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
查看频数, 大于20000的值极少,可以把这些当作异常值,直接填充或者删掉。
# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trick
plt.hist(np.log1p(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
6.特征分为类别特征和数字特征,并对类别特征查看unique分布
# 分离label即预测值
Y_train = Train_data['price']
numeric_features = ['price', 'power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14',
'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21', 'v_22', 'v_23']
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']
# 特征nunique分布
for cat_fea in categorical_features:
print(cat_fea + "的特征分布如下:")
print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
print(Train_data[cat_fea].value_counts())
# 特征nunique分布
for cat_fea in categorical_features:
print(cat_fea + "的特征分布如下:")
print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
print(Test_data[cat_fea].value_counts())
7.数字特征分析
numeric_features
['price',
'power',
'kilometer',
'v_0',
'v_1',
'v_2',
'v_3',
'v_4',
'v_5',
'v_6',
'v_7',
'v_8',
'v_9',
'v_10',
'v_11',
'v_12',
'v_13',
'v_14',
'v_15',
'v_16',
'v_17',
'v_18',
'v_19',
'v_20',
'v_21',
'v_22',
'v_23']
1) 相关性分析
## 1) 相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price 1.000000
v_0 0.514477
v_11 0.481618
v_15 0.421987
power 0.189456
v_8 0.183505
v_10 0.163891
v_12 0.129570
v_13 0.114883
v_22 0.092171
v_7 0.090440
v_14 0.075673
v_19 0.033689
v_20 0.026216
v_4 0.004413
v_23 -0.018156
v_2 -0.018823
v_6 -0.036826
v_5 -0.039637
v_21 -0.052654
v_17 -0.118299
v_9 -0.165831
v_1 -0.207255
v_16 -0.345840
kilometer -0.404961
v_18 -0.556144
v_3 -0.595468
Name: price, dtype: float64
f , ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
del price_numeric['price']
2) 查看几个特征偏度和峰值
## 2) 查看几个特征偏度和峰值
for col in numeric_features:
print('{:15}'.format(col),
'Skewness: {:05.2f}'.format(Train_data[col].skew()) ,
' ' ,
'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())
)
price Skewness: 03.54 Kurtosis: 021.23
power Skewness: 58.59 Kurtosis: 4473.89
kilometer Skewness: -1.56 Kurtosis: 001.25
v_0 Skewness: -1.50 Kurtosis: 002.90
v_1 Skewness: 01.58 Kurtosis: 001.10
v_2 Skewness: 01.20 Kurtosis: 003.75
v_3 Skewness: 01.35 Kurtosis: 004.29
v_4 Skewness: 00.22 Kurtosis: 006.95
v_5 Skewness: 02.05 Kurtosis: 006.49
v_6 Skewness: 00.09 Kurtosis: -00.56
v_7 Skewness: 00.82 Kurtosis: -00.73
v_8 Skewness: -1.53 Kurtosis: 000.37
v_9 Skewness: 01.53 Kurtosis: 000.38
v_10 Skewness: -2.58 Kurtosis: 004.80
v_11 Skewness: -0.91 Kurtosis: 001.55
v_12 Skewness: -2.84 Kurtosis: 006.14
v_13 Skewness: -3.87 Kurtosis: 013.20
v_14 Skewness: 00.49 Kurtosis: -01.60
v_15 Skewness: 01.31 Kurtosis: -00.03
v_16 Skewness: 01.66 Kurtosis: 002.24
v_17 Skewness: 00.23 Kurtosis: 002.57
v_18 Skewness: 00.81 Kurtosis: 002.97
v_19 Skewness: 00.10 Kurtosis: 006.92
v_20 Skewness: 02.00 Kurtosis: 006.85
v_21 Skewness: 00.18 Kurtosis: -00.76
v_22 Skewness: 00.82 Kurtosis: -00.74
v_23 Skewness: 01.36 Kurtosis: 000.14
3) 每个数字特征得分布可视化
## 3) 每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
4) 数字特征相互之间的关系可视化
## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_0', 'v_11' , 'v_15', 'power', 'v_8', 'v_10', 'v_12', 'v_13', 'v_22', 'v_7', 'v_14', 'v_19', 'v_20']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()
5) 多变量互相回归关系可视化
## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)
v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)
v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)
power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)
v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)
v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)
v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)
v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)
v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)
v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
8.类别特征分析
categorical_features
['name',
'model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'notRepairedDamage',
'regionCode']
1) unique分布
## 1) unique分布
for fea in categorical_features:
print(Train_data[fea].nunique())
164312
251
40
8
7
2
2
8081
2) 类别特征箱形图可视化
## 2) 类别特征箱形图可视化
# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下
categorical_features = ['model',
'brand',
'bodyType',
'fuelType',
'gearbox',
'notRepairedDamage']
for c in categorical_features:
Train_data[c] = Train_data[c].astype('category')
if Train_data[c].isnull().any():
Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
Train_data[c] = Train_data[c].fillna('MISSING')
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")
3) 类别特征的小提琴图可视化
## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :
sns.violinplot(x=catg, y=target, data=Train_data)
plt.show()
4) 类别特征的柱形图可视化
## 4) 类别特征的柱形图可视化
def bar_plot(x, y, **kwargs):
sns.barplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")
5) 类别特征的每个类别频数可视化(count_plot)
## 5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x, **kwargs):
sns.countplot(x=x)
x=plt.xticks(rotation=90)
f = pd.melt(Train_data, value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")
9.用pandas_profiling生成数据报告
用pandas_profiling生成一个较为全面的可视化和数据报告(较为简单、方便) 最终打开html文件即可。
import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")
转载:https://blog.csdn.net/X0420C/article/details/115770557
查看评论