DataFrame操作（扩充）

文章发布时间:

2025-06-24

最后更新时间:

2025-06-30

文章总字数:

2.4k

预计阅读时间:

11 分钟

一.合并

merge 将两个 DataFrame 对象根据一个或多个键进行合并，类似于 SQL 中的 JOIN 操作

1 2	`pandas.merge (left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)`

参数：

left：左侧的 DataFrame 对象。
right：右侧的 DataFrame 对象。
how：合并方式，可以是 ‘inner’、’outer’、’left’ 或 ‘right’。默认为 ‘inner’。
- ‘inner’：内连接，返回两个 DataFrame 共有的键。
- ‘outer’：外连接，返回两个 DataFrame 的所有键。
- ‘left’：左连接，返回左侧 DataFrame 的所有键，以及右侧 DataFrame 匹配的键。
- ‘right’：右连接，返回右侧 DataFrame 的所有键，以及左侧 DataFrame 匹配的键。
on：用于连接的列名。如果未指定，则使用两个 DataFrame 中相同的列名。
left_on 和 right_on：分别指定左侧和右侧 DataFrame 的连接列名。
left_index 和 right_index：布尔值，指定是否使用索引作为连接键。
sort：布尔值，指定是否在合并后对结果进行排序。
suffixes：一个元组，指定当列名冲突时，右侧和左侧 DataFrame 的后缀。
copy：布尔值，指定是否返回一个新的 DataFrame。如果为 False，则可能修改原始 DataFrame。
indicator：布尔值，如果为 True，则在结果中添加一个名为 __merge 的列，指示每行是如何合并的。
validate：验证合并是否符合特定的模式。

import pandas as pd
 
# 创建两个示例 DataFrame
left = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})
 
right = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K4'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
})
 
# 内连接,把指定连接的列中相同的值取出来，不同的去掉
result = pd.merge(left, right,on = 'key')
print(result)

结果：

key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2

1
2
3

# 外连接,把指定连接的列中所有值取出来，没有的填缺失值（空）
result1 = pd.merge(left, right,how = 'outer',on = 'key')
print(result1)

结果：

key    A    B    C    D
0  K0   A0   B0   C0   D0
1  K1   A1   B1   C1   D1
2  K2   A2   B2   C2   D2
3  K3   A3   B3  NaN  NaN
4  K4  NaN  NaN   C3   D3

1
2
3

# 右连接，以右表为准，可以去掉左表相关值，右表中没有的数据填缺失值
result2 = pd.merge(left, right,how = 'right',on = 'key')
print(result2)

结果：

key    A    B   C   D
0  K0   A0   B0  C0  D0
1  K1   A1   B1  C1  D1
2  K2   A2   B2  C2  D2
3  K4  NaN  NaN  C3  D3

1
2
3

# 左连接，以左表为准，可以去掉右表相关值，左表中没有的数据填缺失值
result3 = pd.merge(left, right,how = 'left',on = 'key')
print(result3)

结果：

key   A   B    C    D
0  K0  A0  B0   C0   D0
1  K1  A1  B1   C1   D1
2  K2  A2  B2   C2   D2
3  K3  A3  B3  NaN  NaN

二.随机抽样

1 2	`DataFrame.sample (n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)`

参数：

n：要抽取的行数
frac：抽取的比例，比如 frac=0.5，代表抽取总体数据的50%
replace：布尔值参数，表示是否以有放回抽样的方式进行选择，默认为 False，取出数据后不再放回
weights：可选参数，代表每个样本的权重值，参数值是字符串或者数组
random_state：可选参数，控制随机状态，默认为 None，表示随机数据不会重复；若为 1 表示会取得重复数据
axis：表示在哪个方向上抽取数据(axis=1 表示列/axis=0 表示行)

基本理解背后思想即可，不经常使用。

import pandas as pd
 
# 创建一个示例 DataFrame
data = {
        'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
        'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
        'C': [1, 2, 3, 4, 5, 6, 7, 8],
}
 
df = pd.DataFrame(data)
# 随机抽取一行
h = df.sample(n=1,axis=0)
print(h)
 
# 随机抽50%
ff = df.sample(frac=0.5,axis=0)
print(ff)
 
# 随机抽取一列
hh = df.sample(n=1,axis=1)
print(hh)

结果：

A      B  C
7  foo  three  8
     A      B  C
6  foo    one  7
3  bar  three  4
2  foo    two  3
4  foo    two  5
       B
0    one
1    one
2    two
3  three
4    two
5    two
6    one
7  three

三.空值处理

3.1 检测空值

isnull()检测 DataFrame 或 Series 中的空值，返回一个布尔值的 DataFrame 或 Series。

notnull()检测 DataFrame 或 Series 中的非空值，返回一个布尔值的 DataFrame 或 Series。

import pandas as pd
 
# 创建一个示例 DataFrame
data = {
        'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
        'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
        'C': [1, 2, 3, 4, 5, 6, None, 8],
        
}
 
df = pd.DataFrame(data)
 
# 返回原数据类型显示空值位置
print(df.isnull())
 
# 返回原数据类型显示非空值位置、
print(df.notnull())

结果：

A      B      C
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
4  False  False  False
5  False  False  False
6  False  False   True
7  False  False  False
      A     B      C
0  True  True   True
1  True  True   True
2  True  True   True
3  True  True   True
4  True  True   True
5  True  True   True
6  True  True  False
7  True  True   True

3.2 填充空值

fillna()

import pandas as pd
 
# 创建一个示例 DataFrame
data = {
        'A': ['foo', 'bar', 'foo', 'bar', None, 'bar', 'foo', 'foo'],
        'B': ['one', 'one', 'two', None, 'two', 'two', 'one', 'three'],
        'C': [1, 2, 3, 4, 5, 6, None, 8],}
 
df = pd.DataFrame(data)
 
# 用指定值填充
h = df.fillna(0)
print(h)
 
# 用方法填充
h1 = df.fillna(method='ffill',inplace=False)
print(h1)

结果：

A      B    C
0  foo    one  1.0
1  bar    one  2.0
2  foo    two  3.0
3  bar      0  4.0
4    0    two  5.0
5  bar    two  6.0
6  foo    one  0.0
7  foo  three  8.0
     A      B    C
0  foo    one  1.0
1  bar    one  2.0
2  foo    two  3.0
3  bar    two  4.0
4  bar    two  5.0
5  bar    two  6.0
6  foo    one  6.0
7  foo  three  8.0

3.3 删除空值

dropna()

import pandas as pd
 
# 创建一个示例 DataFrame
data = {
        'A': ['foo', 'bar', 'foo', 'bar', None, 'bar', 'foo', 'foo'],
        'B': ['one', 'one', 'two', None, 'two', 'two', 'one', 'three'],
        'C': [1, 2, 3, 4, 5, 6, None, 8],}
 
df = pd.DataFrame(data)
 
# 按行删除包含空值的行(默认)
h = df.dropna(axis=0)
print(h)
 
# 按列删除包含空值的列
v = df.dropna(axis=1)
print(v)

结果：

A      B    C
0  foo    one  1.0
1  bar    one  2.0
2  foo    two  3.0
5  bar    two  6.0
7  foo  three  8.0
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7]

四.读取CSV文件

csv 介于 txt 与 excel 之间，txt 文件是纯文本文件，excel 文件是电子表格文件。

可以用记事本打开，也可以通过 excel 打开变成表格。其为 txt 时，以逗号分割列；随后用 excel打开就能看到分割的两列。

4.1 存储csv

to_csv() 方法将 DataFrame 存储为 csv。

import pandas as pd
 
# 创建一个示例 DataFrame
data = {
        'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
        'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
        'C': [1, 2, 3, 4, 5, 6, 7, 8],
}
 
df = pd.DataFrame(data,index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 
# 将 DataFrame 导出为 CSV，并且不传入行索引index方便观察
df.to_csv('example.csv', index=False)

结果：

那把index传进去为什么不方便呢，来看看：

1 2	`# 将 DataFrame 导出为 CSV，并且传入行索引index df.to_csv('example1.csv')`

结果：

可见，列标签是在最上方的，最左边还多了一个逗号。然后依次往下看会发现很多的行标签，如果行标签是数字的话其实就可以以省略不写，因为默认就是从0开始，肉眼上会有很多重复的数字。但是如果行标签是字母的话，那么就必须要写上。

4.2 读取数据

read_csv() 将 csv 转换成 DataFram。

import pandas as pd
 
# 读取CSV文件
new = pd.read_csv('example.csv')
print(new)

结果：

A      B  C
0  foo    one  1
1  bar    one  2
2  foo    two  3
3  bar  three  4
4  foo    two  5
5  bar    two  6
6  foo    one  7
7  foo  three  8

五.绘图

Pandas 对 Matplotlib 绘图软件包的基础上单独封装了一个plot()接口，通过调用该接口可以实现常用的绘图。

只用 pandas 绘制图片可能可以编译，但是不会显示图片，需要使用 matplotlib 库，调用 show() 方法显示图形

参数：

kind：绘图类型，默认为 line，可选值有：
- line：折线图
- bar：柱状图
- hist：直方图
- pie：饼图
- scatter：散点图

import pandas as pd
import matplotlib.pyplot as plt
 
# 创建一个示例 DataFrame
data = {
        'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
        'C': [1, 2, 3, 4, 5, 6, 7, 8],
}
 
df = pd.DataFrame(data,index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 
# 画柱状图
df.plot(kind='bar')
plt.show()

结果：

1
2
3

# 直方图
df.plot(kind='hist')
plt.show()

结果：

1
2
3

# 散点图
df.plot(kind='scatter', x='C', y='B')
plt.show()

结果：

# 创建一个示例 Series
data = {
    'A': 10,
    'B': 20,
    'C': 30,
    'D': 40
}
series = pd.Series(data)
# 绘制饼图
series.plot(kind='pie', autopct='%1.1f%%')
# 显示图表
plt.show()

结果：

≡