河北高校邀请赛——二手车交易价格预测-Task2 数据分析

2021-04-17 10:08 290人阅读评论(0)

河北高校邀请赛——二手车交易价格预测-Task2 数据分析

EDA-数据探索性分析

EDA-数据探索性分析

赛题：零基础入门数据挖掘 - 二手车交易价格预测
要求参赛选手根据给定的数据集，建立模型，预测二手汽车的交易价格。

一、EDA目标

EDA的价值主要在于熟悉数据集，了解数据集，对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。
了解变量间的相互关系以及变量与预测值之间的存在关系。
引导数据科学从业者进行数据处理以及特征工程的步骤，使数据集的结构和特征集让接下来的预测问题更加可靠。

二、内容介绍

载入各种数据科学以及可视化库：
数据科学库 pandas、numpy、scipy；
可视化库 matplotlib、seabon；
其他；
载入数据：
载入训练集和测试集；
简略观察数据(head()+shape)；
数据总览：
通过describe()来熟悉数据的相关统计量；
通过info()来熟悉数据类型；
判断数据缺失和异常：
查看每列的存在nan情况；
异常值检测；
了解预测值的分布：
总体分布概况（无界约翰逊分布等）；
查看skewness 和 kurtosis；
查看预测值的具体频数；
特征分为类别特征和数字特征，并对类别特征查看unique分布
数字特征分析：
相关性分析；
查看几个特征得偏度和峰值；
每个数字特征得分布可视化；
数字特征相互之间的关系可视化；
多变量互相回归关系可视化；
类型特征分析：
unique分布；
类别特征箱形图可视化；
类别特征的小提琴图可视化；
类别特征的柱形图可视化类别；
特征的每个类别频数可视化(count_plot)；
用pandas_profiling生成数据报告

三、代码示例

1.载入各种数据科学以及可视化库

#coding:utf-8
#导入warnings包，利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

2.载入数据

1) 载入训练集和测试集

## 1) 载入训练集和测试集；
Train_data = pd.read_csv('./data/car_train_0110.csv', sep=' ')
Test_data = pd.read_csv('./data/car_testA_0110.csv', sep=' ')

所有特征集均脱敏处理

name - 汽车编码
regDate - 汽车注册时间
model - 车型编码
brand - 品牌
bodyType - 车身类型
fuelType - 燃油类型
gearbox - 变速箱
power - 汽车功率
kilometer - 汽车行驶公里
notRepairedDamage - 汽车有尚未修复的损坏
regionCode - 看车地区编码
seller - 销售方
offerType - 报价类型
creatDate - 广告发布时间
price - 汽车价格
包含v0-23在内24个匿名特征（根据汽车的评论、标签等大量信息得到的embedding向量）【人工构造匿名特征】

2) 简略观察数据(head()+shape)

## 2) 简略观察数据(head()+shape)
Train_data.head().append(Train_data.tail())
Train_data.shape

(250000, 40)

Test_data.head().append(Test_data.tail())
Test_data.shape

(50000, 39)

在接下来的操作中，养成看数据集的head()及shape的习惯，如果对自己的pandas等操作不放心，建议执行一步看一下。

3.总览数据概况

describe中有每列的统计量（个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、最大值），可迅速掌握数据的大概的范围，有助于异常值的判断，比如999 9999、-1 等值可能是nan的另外一种表达方式，需要注意。
通过info来了解数据每列的type，有助于了解是否存在除了nan以外的特殊符号异常

1) 通过describe()来熟悉数据的相关统计量

## 1) 通过describe()来熟悉数据的相关统计量
Train_data.describe()
Test_data.describe()

2) 通过info()来熟悉数据类型

## 2) 通过info()来熟悉数据类型
Train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 40 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             250000 non-null  int64  
 1   name               250000 non-null  int64  
 2   regDate            250000 non-null  int64  
 3   model              250000 non-null  float64
 4   brand              250000 non-null  int64  
 5   bodyType           224620 non-null  float64
 6   fuelType           227510 non-null  float64
 7   gearbox            236487 non-null  float64
 8   power              250000 non-null  int64  
 9   kilometer          250000 non-null  float64
 10  notRepairedDamage  201464 non-null  float64
 11  regionCode         250000 non-null  int64  
 12  seller             250000 non-null  int64  
 13  offerType          250000 non-null  int64  
 14  creatDate          250000 non-null  int64  
 15  price              250000 non-null  int64  
 16  v_0                250000 non-null  float64
 17  v_1                250000 non-null  float64
 18  v_2                250000 non-null  float64
 19  v_3                250000 non-null  float64
 20  v_4                250000 non-null  float64
 21  v_5                250000 non-null  float64
 22  v_6                250000 non-null  float64
 23  v_7                250000 non-null  float64
 24  v_8                250000 non-null  float64
 25  v_9                250000 non-null  float64
 26  v_10               250000 non-null  float64
 27  v_11               250000 non-null  float64
 28  v_12               250000 non-null  float64
 29  v_13               250000 non-null  float64
 30  v_14               250000 non-null  float64
 31  v_15               250000 non-null  float64
 32  v_16               250000 non-null  float64
 33  v_17               250000 non-null  float64
 34  v_18               250000 non-null  float64
 35  v_19               250000 non-null  float64
 36  v_20               250000 non-null  float64
 37  v_21               250000 non-null  float64
 38  v_22               250000 non-null  float64
 39  v_23               250000 non-null  float64
dtypes: float64(30), int64(10)
memory usage: 76.3 MB

Test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           44890 non-null  float64
 6   fuelType           45598 non-null  float64
 7   gearbox            47287 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  40372 non-null  float64
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
 30  v_15               50000 non-null  float64
 31  v_16               50000 non-null  float64
 32  v_17               50000 non-null  float64
 33  v_18               50000 non-null  float64
 34  v_19               50000 non-null  float64
 35  v_20               50000 non-null  float64
 36  v_21               50000 non-null  float64
 37  v_22               50000 non-null  float64
 38  v_23               50000 non-null  float64
dtypes: float64(30), int64(9)
memory usage: 14.9 MB

4.判断数据缺失和异常

1) 查看每列的存在nan情况

## 1) 查看每列的存在nan情况
Train_data.isnull().sum()

SaleID                   0
name                     0
regDate                  0
model                    0
brand                    0
bodyType             25380
fuelType             22490
gearbox              13513
power                    0
kilometer                0
notRepairedDamage    48536
regionCode               0
seller                   0
offerType                0
creatDate                0
price                    0
v_0                      0
v_1                      0
v_2                      0
v_3                      0
v_4                      0
v_5                      0
v_6                      0
v_7                      0
v_8                      0
v_9                      0
v_10                     0
v_11                     0
v_12                     0
v_13                     0
v_14                     0
v_15                     0
v_16                     0
v_17                     0
v_18                     0
v_19                     0
v_20                     0
v_21                     0
v_22                     0
v_23                     0
dtype: int64

Test_data.isnull().sum()

SaleID                  0
name                    0
regDate                 0
model                   0
brand                   0
bodyType             5110
fuelType             4402
gearbox              2713
power                   0
kilometer               0
notRepairedDamage    9628
regionCode              0
seller                  0
offerType               0
creatDate               0
v_0                     0
v_1                     0
v_2                     0
v_3                     0
v_4                     0
v_5                     0
v_6                     0
v_7                     0
v_8                     0
v_9                     0
v_10                    0
v_11                    0
v_12                    0
v_13                    0
v_14                    0
v_15                    0
v_16                    0
v_17                    0
v_18                    0
v_19                    0
v_20                    0
v_21                    0
v_22                    0
v_23                    0
dtype: int64

通过可视化直观了解哪些列存在 “nan”, 目的在于查看nan存在的个数大小：如果很小一般选择填充；如果使用lgb等树模型可以直接空缺，让树自己去优化；但如果nan存在的过多、可以考虑删掉。

# nan可视化
missing = Train_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

# 可视化看下缺省值
msno.matrix(Train_data.sample(250))

msno.bar(Train_data.sample(1000))

# 可视化看下缺省值
msno.matrix(Test_data.sample(250))

msno.bar(Test_data.sample(1000))

测试集的缺省和训练集的差不多情况, 可视化有四列有缺省，notRepairedDamage缺省得最多。

2) 查看异常值检测

以下两个类别特征严重倾斜，一般不会对预测有什么帮助，因此可以删掉。

Train_data["seller"].value_counts()

1    249999
0         1
Name: seller, dtype: int64

Train_data["offerType"].value_counts()

0    249991
1         9
Name: offerType, dtype: int64

del Train_data["seller"]
del Train_data["offerType"]
del Test_data["seller"]
del Test_data["offerType"]

5.了解预测值的分布

Train_data['price']

0           520
1          5500
2          1100
3          1200
4          3300
          ...  
249995     1200
249996     1200
249997    16500
249998    31950
249999     1990
Name: price, Length: 250000, dtype: int64

Train_data['price'].value_counts()

0        7312
500      3815
1500     3587
1000     3149
1200     3071
         ... 
22985       1
10695       1
20680       1
2243        1
46888       1
Name: price, Length: 4585, dtype: int64

1) 总体分布概况（无界约翰逊分布等）

## 1) 总体分布概况（无界约翰逊分布等）
import scipy.stats as st
y = Train_data['price']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)

价格不服从正态分布，所以在进行回归之前，它必须进行转换。对数转换和无界约翰逊转换拟合效果都很好。

2) 查看skewness and kurtosis

## 2) 查看skewness and kurtosis
sns.distplot(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())

Skewness: 3.535346
Kurtosis: 21.230678

Train_data.skew()

SaleID                0.001712
 name                  0.513079
 regDate              -1.540844
 model                 1.499765
 brand                 1.314846
 bodyType             -0.070459
 fuelType              0.701802
 gearbox              -1.357379
 power                58.590829
 kilometer            -1.557472
 notRepairedDamage    -2.312519
 regionCode            0.690405
 creatDate           -95.428563
 price                 3.535346
 v_0                  -1.504738
 v_1                   1.582428
 v_2                   1.198679
 v_3                   1.352193
 v_4                   0.217941
 v_5                   2.052749
 v_6                   0.090718
 v_7                   0.823610
 v_8                  -1.532964
 v_9                   1.529931
 v_10                 -2.584452
 v_11                 -0.906428
 v_12                 -2.842834
 v_13                 -3.869655
 v_14                  0.491706
 v_15                  1.308716
 v_16                  1.662893
 v_17                  0.233318
 v_18                  0.814453
 v_19                  0.100073
 v_20                  2.001253
 v_21                  0.180020
 v_22                  0.819133
 v_23                  1.357847
 dtype: float64, SaleID                  -1.201476

Train_data.kurt()

name                    -1.084474
 regDate                 11.041006
 model                    1.741896
 brand                    1.814245
 bodyType                -1.070358
 fuelType                -1.495782
 gearbox                 -0.157525
 power                 4473.885260
 kilometer                1.250933
 notRepairedDamage        3.347777
 regionCode              -0.352973
 creatDate            11376.694263
 price                   21.230678
 v_0                      2.901641
 v_1                      1.098703
 v_2                      3.749872
 v_3                      4.294578
 v_4                      6.953348
 v_5                      6.489791
 v_6                     -0.564878
 v_7                     -0.729838
 v_8                      0.370812
 v_9                      0.377943
 v_10                     4.796855
 v_11                     1.547812
 v_12                     6.136342
 v_13                    13.199575
 v_14                    -1.597532
 v_15                    -0.029594
 v_16                     2.240928
 v_17                     2.569341
 v_18                     2.967738
 v_19                     6.923953
 v_20                     6.852809
 v_21                    -0.759948
 v_22                    -0.741708
 v_23                     0.143713
 dtype: float64

sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')

3) 查看预测值的具体频数

## 3) 查看预测值的具体频数
plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

查看频数, 大于20000的值极少，可以把这些当作异常值，直接填充或者删掉。

# log变换 z之后的分布较均匀，可以进行log变换进行预测，这也是预测问题常用的trick
plt.hist(np.log1p(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

6.特征分为类别特征和数字特征，并对类别特征查看unique分布

# 分离label即预测值
Y_train = Train_data['price']

numeric_features = ['price', 'power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14',
'v_15', 'v_16', 'v_17', 'v_18', 'v_19', 'v_20', 'v_21', 'v_22', 'v_23']

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode']

# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下：")
    print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

# 特征nunique分布
for cat_fea in categorical_features:
    print(cat_fea + "的特征分布如下：")
    print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))
    print(Test_data[cat_fea].value_counts())

7.数字特征分析

numeric_features

['price',
 'power',
 'kilometer',
 'v_0',
 'v_1',
 'v_2',
 'v_3',
 'v_4',
 'v_5',
 'v_6',
 'v_7',
 'v_8',
 'v_9',
 'v_10',
 'v_11',
 'v_12',
 'v_13',
 'v_14',
 'v_15',
 'v_16',
 'v_17',
 'v_18',
 'v_19',
 'v_20',
 'v_21',
 'v_22',
 'v_23']

1) 相关性分析

## 1) 相关性分析
price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')

price        1.000000
v_0          0.514477
v_11         0.481618
v_15         0.421987
power        0.189456
v_8          0.183505
v_10         0.163891
v_12         0.129570
v_13         0.114883
v_22         0.092171
v_7          0.090440
v_14         0.075673
v_19         0.033689
v_20         0.026216
v_4          0.004413
v_23        -0.018156
v_2         -0.018823
v_6         -0.036826
v_5         -0.039637
v_21        -0.052654
v_17        -0.118299
v_9         -0.165831
v_1         -0.207255
v_16        -0.345840
kilometer   -0.404961
v_18        -0.556144
v_3         -0.595468
Name: price, dtype: float64

f , ax = plt.subplots(figsize = (7, 7))

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)

del price_numeric['price']

2) 查看几个特征偏度和峰值

## 2) 查看几个特征偏度和峰值
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())  
         )

price           Skewness: 03.54     Kurtosis: 021.23
power           Skewness: 58.59     Kurtosis: 4473.89
kilometer       Skewness: -1.56     Kurtosis: 001.25
v_0             Skewness: -1.50     Kurtosis: 002.90
v_1             Skewness: 01.58     Kurtosis: 001.10
v_2             Skewness: 01.20     Kurtosis: 003.75
v_3             Skewness: 01.35     Kurtosis: 004.29
v_4             Skewness: 00.22     Kurtosis: 006.95
v_5             Skewness: 02.05     Kurtosis: 006.49
v_6             Skewness: 00.09     Kurtosis: -00.56
v_7             Skewness: 00.82     Kurtosis: -00.73
v_8             Skewness: -1.53     Kurtosis: 000.37
v_9             Skewness: 01.53     Kurtosis: 000.38
v_10            Skewness: -2.58     Kurtosis: 004.80
v_11            Skewness: -0.91     Kurtosis: 001.55
v_12            Skewness: -2.84     Kurtosis: 006.14
v_13            Skewness: -3.87     Kurtosis: 013.20
v_14            Skewness: 00.49     Kurtosis: -01.60
v_15            Skewness: 01.31     Kurtosis: -00.03
v_16            Skewness: 01.66     Kurtosis: 002.24
v_17            Skewness: 00.23     Kurtosis: 002.57
v_18            Skewness: 00.81     Kurtosis: 002.97
v_19            Skewness: 00.10     Kurtosis: 006.92
v_20            Skewness: 02.00     Kurtosis: 006.85
v_21            Skewness: 00.18     Kurtosis: -00.76
v_22            Skewness: 00.82     Kurtosis: -00.74
v_23            Skewness: 01.36     Kurtosis: 000.14

3) 每个数字特征得分布可视化

## 3) 每个数字特征得分布可视化
f = pd.melt(Train_data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

4) 数字特征相互之间的关系可视化

## 4) 数字特征相互之间的关系可视化
sns.set()
columns = ['price', 'v_0', 'v_11' , 'v_15', 'power', 'v_8',  'v_10', 'v_12', 'v_13', 'v_22', 'v_7', 'v_14', 'v_19', 'v_20']
sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

5) 多变量互相回归关系可视化

## 5) 多变量互相回归关系可视化
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)
sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)

v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)
sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)

v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)
sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)

power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)
sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)

v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)
sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)

v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)
sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)

v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)
sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)

v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)
sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)

v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)
sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)

v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)
sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)

8.类别特征分析

categorical_features

['name',
 'model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage',
 'regionCode']

1) unique分布

## 1) unique分布
for fea in categorical_features:
    print(Train_data[fea].nunique())

2) 类别特征箱形图可视化

## 2) 类别特征箱形图可视化

# 因为 name和 regionCode的类别太稀疏了，这里我们把不稀疏的几类画一下
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

3) 类别特征的小提琴图可视化

## 3) 类别特征的小提琴图可视化
catg_list = categorical_features
target = 'price'
for catg in catg_list :
    sns.violinplot(x=catg, y=target, data=Train_data)
    plt.show()

4) 类别特征的柱形图可视化

## 4) 类别特征的柱形图可视化
def bar_plot(x, y, **kwargs):
    sns.barplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(bar_plot, "value", "price")

5) 类别特征的每个类别频数可视化(count_plot)

##  5) 类别特征的每个类别频数可视化(count_plot)
def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

9.用pandas_profiling生成数据报告

用pandas_profiling生成一个较为全面的可视化和数据报告(较为简单、方便) 最终打开html文件即可。

import pandas_profiling

pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")

转载：https://blog.csdn.net/X0420C/article/details/115770557

查看评论

飞道的博客

飞道的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章

河北高校邀请赛——二手车交易价格预测-Task2 数据分析

河北高校邀请赛——二手车交易价格预测-Task2 数据分析

EDA-数据探索性分析

一、EDA目标

二、内容介绍

三、代码示例

1.载入各种数据科学以及可视化库

2.载入数据

1) 载入训练集和测试集

2) 简略观察数据(head()+shape)

3.总览数据概况

1) 通过describe()来熟悉数据的相关统计量

2) 通过info()来熟悉数据类型

4.判断数据缺失和异常

1) 查看每列的存在nan情况

2) 查看异常值检测

5.了解预测值的分布

1) 总体分布概况（无界约翰逊分布等）

2) 查看skewness and kurtosis

3) 查看预测值的具体频数

6.特征分为类别特征和数字特征，并对类别特征查看unique分布

7.数字特征分析

1) 相关性分析

2) 查看几个特征偏度和峰值

3) 每个数字特征得分布可视化

4) 数字特征相互之间的关系可视化

5) 多变量互相回归关系可视化

8.类别特征分析

1) unique分布

2) 类别特征箱形图可视化

3) 类别特征的小提琴图可视化

4) 类别特征的柱形图可视化

5) 类别特征的每个类别频数可视化(count_plot)

9.用pandas_profiling生成数据报告

* 以上用户言论只代表其个人观点，不代表本网站的观点或立场