【python数据分析入门到高级】:数据清洗

数据清洗

在上一篇文章中,介绍了如何使用python导入数据,导入数据后的第二步往往就是数据清洗,下面我们来看看如何使用pandas进行数据清洗工作

导入相关库

import pandas as pd  dataframe = pd.read_csv(r'C:/Users/DELL/data-science-learning/python数据分析笔记/探索性数据分析/train.csv') dataframe.head(5)
.dataframe tbody tr th {     vertical-align: top; }  .dataframe thead th {     text-align: right; }

</style>

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

</div>

1.总览数据

  • 查看数据纬度

dataframe.shape
(891, 12)
  • 描述性统计分析

dataframe.describe()
.dataframe tbody tr th {     vertical-align: top; }  .dataframe thead th {     text-align: right; }

</style>

PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

</div>

2.筛选数据

过滤所有女性和年龄大于60岁的乘客

dataframe[(dataframe['Sex'] == 'female') & (dataframe['Age']>=60)]
.dataframe tbody tr th {     vertical-align: top; }  .dataframe thead th {     text-align: right; }

</style>

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
366 367 1 1 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) female 60.0 1 0 110813 75.2500 D37 C
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN

</div>

可以看出,一共有四名年龄大于60岁的女性乘客

3.替换数据

# 将female换成woman,将male换成man
dataframe['Sex'].replace(['female','male'],['woman','man']).head(5)
0      man
1    woman
2    woman
3    woman
4      man
Name: Sex, dtype: object

4.更改列

dataframe.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
# 重命名列
dataframe.rename(columns={'PClass': 'Passenger Class'}).head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

</div>

dataframe.rename(columns={'PClass':'Passenger Class','Sex':'Gender'}).head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

PassengerId Survived Pclass Name Gender Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

</div>

dataframe.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

</div>

5.查找唯一值

pandas中,我们可以使用unique()查找唯一值

# 查找唯一值
dataframe['Sex'].unique()
array(['male', 'female'], dtype=object)
# 显示唯一值出现的个数
dataframe['Sex'].value_counts()
male      577
female    314
Name: Sex, dtype: int64
# 查找类型票的数量
dataframe['Pclass'].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64
# 查找唯一值的种类
dataframe['Pclass'].nunique()
3

6.查找缺失值

# 查找空数据
dataframe[dataframe['Age'].isnull()].head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q

</div>

pandas没有NaN 如果想要处理的话必须导入numpy

import numpy as np
dataframe['Sex'].replace('male',np.nan).head()
0       NaN
1    female
2    female
3    female
4       NaN
Name: Sex, dtype: object

7.删除列或行

# 删除一列,采用drop方法,并传入参数axis
dataframe.drop('Age',axis=1).head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500 NaN S

</div>

#删除行
dataframe.drop(1)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Name PClass Age Sex Survived SexCode
0 Allen, Miss Elisabeth Walton 1st 29.00 female 1 1
2 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0 0
3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female 0 1
4 Allison, Master Hudson Trevor 1st 0.92 male 1 0
5 Anderson, Mr Harry 1st 47.00 male 1 0
6 Andrews, Miss Kornelia Theodosia 1st 63.00 female 1 1
7 Andrews, Mr Thomas, jr 1st 39.00 male 0 0
8 Appleton, Mrs Edward Dale (Charlotte Lamson) 1st 58.00 female 1 1
9 Artagaveytia, Mr Ramon 1st 71.00 male 0 0
10 Astor, Colonel John Jacob 1st 47.00 male 0 0
11 Astor, Mrs John Jacob (Madeleine Talmadge Force) 1st 19.00 female 1 1
12 Aubert, Mrs Leontine Pauline 1st NaN female 1 1
13 Barkworth, Mr Algernon H 1st NaN male 1 0
14 Baumann, Mr John D 1st NaN male 0 0
15 Baxter, Mrs James (Helene DeLaudeniere Chaput) 1st 50.00 female 1 1
16 Baxter, Mr Quigg Edmond 1st 24.00 male 0 0
17 Beattie, Mr Thomson 1st 36.00 male 0 0
18 Beckwith, Mr Richard Leonard 1st 37.00 male 1 0
19 Beckwith, Mrs Richard Leonard (Sallie Monypeny) 1st 47.00 female 1 1
20 Behr, Mr Karl Howell 1st 26.00 male 1 0
21 Birnbaum, Mr Jakob 1st 25.00 male 0 0
22 Bishop, Mr Dickinson H 1st 25.00 male 1 0
23 Bishop, Mrs Dickinson H (Helen Walton) 1st 19.00 female 1 1
24 Bjornstrm-Steffansson, Mr Mauritz Hakan 1st 28.00 male 1 0
25 Blackwell, Mr Stephen Weart 1st 45.00 male 0 0
26 Blank, Mr Henry 1st 39.00 male 1 0
27 Bonnell, Miss Caroline 1st 30.00 female 1 1
28 Bonnell, Miss Elizabeth 1st 58.00 female 1 1
29 Borebank, Mr John James 1st NaN male 0 0
30 Bowen, Miss Grace Scott 1st 45.00 female 1 1
... ... ... ... ... ... ...
1283 Vestrom, Miss Hulda Amanda Adolfina 3rd 14.00 female 0 1
1284 Vonk, Mr Jenko 3rd 22.00 male 0 0
1285 Ware, Mr Frederick 3rd NaN male 0 0
1286 Warren, Mr Charles William 3rd NaN male 0 0
1287 Wazli, Mr Yousif 3rd NaN male 0 0
1288 Webber, Mr James 3rd NaN male 0 0
1289 Wennerstrom, Mr August Edvard 3rd NaN male 1 0
1290 Wenzel, Mr Linhart 3rd NaN male 0 0
1291 Widegren, Mr Charles Peter 3rd 51.00 male 0 0
1292 Wiklund, Mr Jacob Alfred 3rd 18.00 male 0 0
1293 Wilkes, Mrs Ellen 3rd 45.00 female 1 1
1294 Willer, Mr Aaron 3rd NaN male 0 0
1295 Willey, Mr Edward 3rd NaN male 0 0
1296 Williams, Mr Howard Hugh 3rd NaN male 0 0
1297 Williams, Mr Leslie 3rd 28.00 male 0 0
1298 Windelov, Mr Einar 3rd 21.00 male 0 0
1299 Wirz, Mr Albert 3rd 27.00 male 0 0
1300 Wiseman, Mr Phillippe 3rd NaN male 0 0
1301 Wittevrongel, Mr Camiel 3rd 36.00 male 0 0
1302 Yalsevac, Mr Ivan 3rd NaN male 1 0
1303 Yasbeck, Mr Antoni 3rd 27.00 male 0 0
1304 Yasbeck, Mrs Antoni 3rd 15.00 female 1 1
1305 Youssef, Mr Gerios 3rd NaN male 0 0
1306 Zabour, Miss Hileni 3rd NaN female 0 1
1307 Zabour, Miss Tamini 3rd NaN female 0 1
1308 Zakarian, Mr Artun 3rd 27.00 male 0 0
1309 Zakarian, Mr Maprieder 3rd 26.00 male 0 0
1310 Zenni, Mr Philip 3rd 22.00 male 0 0
1311 Lievens, Mr Rene 3rd 24.00 male 0 0
1312 Zimmerman, Leo 3rd 29.00 male 0 0

1312 rows × 6 columns

</div>

dataframe.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Name PClass Age Sex Survived SexCode
0 Allen, Miss Elisabeth Walton 1st 29.00 female 1 1
1 Allison, Miss Helen Loraine 1st 2.00 female 0 1
2 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0 0
3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female 0 1
4 Allison, Master Hudson Trevor 1st 0.92 male 1 0

</div>

# 删除重复行 使用subset参数指明要删除的列
dataframe.drop_duplicates(subset='Sex').head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Name PClass Age Sex Survived SexCode
0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1
2 Allison, Mr Hudson Joshua Creighton 1st 30.0 male 0 0

</div>

8. groupby分组

计算男性和女性的平均值 思路一,将所有男性和女性的条件进行选取分别计算

man = dataframe[dataframe['Sex']=='male']
woman = dataframe[dataframe['Sex']=='female']
print(man.mean())
print(woman.mean())
Age         31.014338
Survived     0.166863
SexCode      0.000000
dtype: float64
Age         29.396424
Survived     0.666667
SexCode      1.000000
dtype: float64
# 思路二,用groupby方法简化
dataframe.groupby('Sex').mean()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Age Survived SexCode
Sex
female 29.396424 0.666667 1.0
male 31.014338 0.166863 0.0

</div>

# 按行分组,计算行数
dataframe.groupby('Sex')['Name'].count()
Sex
female    462
male      851
Name: Name, dtype: int64
dataframe.groupby(['Sex','Survived']).mean()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

PassengerId Pclass Age SibSp Parch Fare
Sex Survived
female 0 434.851852 2.851852 25.046875 1.209877 1.037037 23.024385
1 429.699571 1.918455 28.847716 0.515021 0.515021 51.938573
male 0 449.121795 2.476496 31.618056 0.440171 0.207265 21.960993
1 475.724771 2.018349 27.276022 0.385321 0.357798 40.821484

</div>

9.按照时间段来进行分组

  • 使用resample参数来进行

# 创建时期范围
time_index = pd.date_range('06/06/2017', periods=100000, freq='30S') # periods表示有多少数据,freq表示步长
dataframe = pd.DataFrame(index=time_index)
# 创建一个随机变量
dataframe['Sale_Amout'] = np.random.randint(1, 10, 100000)
# resample 参数,按周对行分组,计算每一周的总和
dataframe.resample('W').sum()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Sale_Amout
2017-06-11 86292
2017-06-18 100359
2017-06-25 100907
2017-07-02 100868
2017-07-09 100522
2017-07-16 10478

</div>

# 使用resample可以按一组时间间隔来进行分组,然后计算每一个时间组的某个统计量
dataframe.resample('2W').mean()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Sale_Amout
2017-06-11 4.993750
2017-06-25 4.991716
2017-07-09 4.994792
2017-07-23 5.037500

</div>

dataframe.resample('M').count()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Sale_Amout
2017-06-30 72000
2017-07-31 28000

</div>

# resample默认是以最后一个数据作 使用label参数可以进行调整
dataframe.resample('M', label='left').count()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Sale_Amout
2017-05-31 72000
2017-06-30 28000

</div>

10.遍历一个列的数据

dataframe = pd.read_csv(url)
# 以大写的形势打印前两行的名字
for name in dataframe['Name'][0:2]:
    print(name.upper())
ALLEN, MISS ELISABETH WALTON
ALLISON, MISS HELEN LORAINE

11.对一列的所有元素应用某个函数

def uppercase(x):
    return x.upper()
dataframe['Name'].apply(uppercase)[0:2]
0    ALLEN, MISS ELISABETH WALTON
1     ALLISON, MISS HELEN LORAINE
Name: Name, dtype: object

12. pandas高级函数

dataframe.groupby('Sex').apply(lambda x:x.count())
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Name PClass Age Sex Survived SexCode
Sex
female 462 462 288 462 462 462
male 851 851 468 851 851 851

</div>

通过联合使用groupby  apply,我们就能计算自定义的统计量 例如上式我们发现age、cabin具有大量的缺失值

13. 连接多个Dataframe

data_a = {'id':['1', '2', '3'],           'first': ['Alex', 'Amy', 'Allen'],           'last': ['Anderson', 'Ackerman', 'Ali']} dataframe_a = pd.DataFrame(data_a, columns=['id','first', 'last'])
data_b = {'id':['4', '5', '6'],           'first': ['Billy', 'Brian', 'Bran'],           'last': ['Bonder', 'Black', 'Balwner']} dataframe_b = pd.DataFrame(data_b, columns=['id','first', 'last'])
 pd.concat([dataframe_a, dataframe_b], axis=0)#在行的方向进行
.dataframe tbody tr th {     vertical-align: top; }  .dataframe thead th {     text-align: right; }

</style>

id first last
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner

</div>

pd.concat([dataframe_a, dataframe_b], axis=1)#在列的方向进行
.dataframe tbody tr th {     vertical-align: top; }  .dataframe thead th {     text-align: right; }

</style>

id first last id first last
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner

</div>

# 也可以用append方法进行添加 c = pd.Series([10, 'Chris', 'Chillon'], index=['id','first','last'])
dataframe.append(c, ignore_index=True)#如果c原来有名字忽略
.dataframe tbody tr th {     vertical-align: top; }  .dataframe thead th {     text-align: right; }

</style>

id first last
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 10 Chris Chillon

</div>

python数据分析入门到高级 文章被收录于专栏

分析python数据分析知识

全部评论
感谢楼主分享
点赞 回复 分享
发布于 2022-08-12 17:21

相关推荐

投递京东等公司9个岗位
点赞 评论 收藏
分享
评论
点赞
8
分享

创作者周榜

更多
牛客网
牛客网在线编程
牛客网题解
牛客企业服务