pd.merge( )
Definition :
merge(left: 'DataFrame | Series', right: 'DataFrame | Series',
how: 'str'='inner',
on: 'IndexLabel | None'=None,
left_on: 'IndexLabel | None'=None,
right_on: 'IndexLabel | None'=None,
left_index: 'bool'=False,
right_index: 'bool'=False,
sort: 'bool'=False,
suffixes: 'Suffixes'=('_x', '_y'),
copy: 'bool'=True,
indicator: 'bool'=False,
validate: 'str | None'=None) -> 'DataFrame'
参数含义
left:待链接的左侧数据集。
right:待链接的右侧数据集。
how:左右数据集的连接方式。可选‘left’、‘right’、‘outer’、‘inner’,'cross',默认为inner。
on:左右两个待链接数据集有共同列名,且按该列链接两个数据集合时使用该参数。
left_on:链接两个数据集时,左数据集对应连接关键字(可为列表)。
right_on:链接两个数据集时,右数据集对应连接关键字(可为列表)。
left_index:若为True,则按左数据集的索引连接两个数据集。
right_index:若为True,则按右数据集的索引连接两个数据集。
sort:对结果数据集进行排序。
suffixes:为左右数据集中重复列名定义后缀。默认加('_x','_y')。
Merge method
Merge method | SQL Join Name | Description |
left | LEFT OUTER JOIN | Use keys from left frame only |
right | RIGHT OUTER JOIN | Use keys from right frame only |
outer | FULL OUTER JOIN | Use union of keys from both frames |
inner | INNER JOIN | Use intersection of keys from both frames |
Examples
import pandas as pd
left = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
result = pd.merge(left, right, on="key" ,how= 'inner')
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
result = pd.merge(left, right, how="left", on=["key1", "key2"])
result = pd.merge(left, right, how="right", on=["key1", "key2"])
result = pd.merge(left, right, how="outer", on=["key1", "key2"])
result = pd.merge(left, right, how="inner", on=["key1", "key2"])
DataFrame与Series链接
df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})
ser = pd.Series(
["a", "b", "c", "d", "e", "f"],
index=pd.MultiIndex.from_arrays(
[["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
),
)
pd.merge(df, ser.reset_index(), on=["Let", "Num"])
检查重复关键字
left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})
right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
result = pd.merge(left, right, on="B", how="outer")
left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
pd.merge(left, right, on="B", how="outer", validate="one_to_many")
合并指示器
df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})
df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})
pd.merge(df1, df2, on="col1", how="outer", indicator=True)
pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
总结
根据不同的使用场景,设置不同的参数值,达到预期的数据集链接目的。此处仅仅是个人的学习笔记,还有很多灵活的用法没有列举出来,感兴趣的可以深入操练。同时,强烈建议您要多读pandas官方文档,很多用法均来自官方文档,毕竟“您要学会打猎”。
关于数据集链接,我之前的文章也有介绍,建议您系统性的浏览: