pandas

Pandas的Index索引有什么用途？

2023-03-122019-10-10 Leave a comment by crazyant

把数据存储于普通的column列也能用于数据查询，那使用index有什么好处？

index的用途总结：

更方便的数据查询；
使用index可以获得性能提升；
自动的数据对齐功能；
更多更强大的数据结构支持；

import pandas as pd

df = pd.read_csv("./datas/ml-latest-small/ratings.csv")

df.head()

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

df.count()

userId       100836
movieId      100836
rating       100836
timestamp    100836
dtype: int64

1、使用index查询数据

# drop==False，让索引列还保持在column
df.set_index("userId", inplace=True, drop=False)

df.head()

	userId	movieId	rating	timestamp
userId
1	1	1	4.0	964982703
1	1	3	4.0	964981247
1	1	6	4.0	964982224
1	1	47	5.0	964983815
1	1	50	5.0	964982931

df.index

Int64Index([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
            ...
            610, 610, 610, 610, 610, 610, 610, 610, 610, 610],
           dtype='int64', name='userId', length=100836)

# 使用index的查询方法
df.loc[500].head(5)

	userId	movieId	rating	timestamp
userId
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

# 使用column的condition查询方法
df.loc[df["userId"] == 500].head()

	userId	movieId	rating	timestamp
userId
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

2. 使用index会提升查询性能

如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);
如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN);
如果index是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

实验1：完全随机的顺序查询

# 将数据随机打散
from sklearn.utils import shuffle
df_shuffle = shuffle(df)

df_shuffle.head()

	userId	movieId	rating	timestamp
userId
160	160	2340	1.0	985383314
129	129	1136	3.5	1167375403
167	167	44191	4.5	1154718915
536	536	276	3.0	832839990
67	67	5952	2.0	1501274082

# 索引是否是递增的
df_shuffle.index.is_monotonic_increasing

False

df_shuffle.index.is_unique

False

# 计时，查询id==500数据性能
%timeit df_shuffle.loc[500]

376 µs ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

实验2：将index排序后的查询

df_sorted = df_shuffle.sort_index()

df_sorted.head()

	userId	movieId	rating	timestamp
userId
1	1	2985	4.0	964983034
1	1	2617	2.0	964982588
1	1	3639	4.0	964982271
1	1	6	4.0	964982224
1	1	733	4.0	964982400

# 索引是否是递增的
df_sorted.index.is_monotonic_increasing

True

df_sorted.index.is_unique

False

%timeit df_sorted.loc[500]

203 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

3. 使用index能自动对齐数据

包括series和dataframe

s1 = pd.Series([1,2,3], index=list("abc"))

s1

a    1
b    2
c    3
dtype: int64

s2 = pd.Series([2,3,4], index=list("bcd"))

s2

b    2
c    3
d    4
dtype: int64

s1+s2

a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64

4. 使用index更多更强大的数据结构支持

很多强大的索引数据结构

CategoricalIndex，基于分类数据的Index，提升性能；
MultiIndex，多维索引，用于groupby多维聚合后结果等；
DatetimeIndex，时间类型索引，强大的日期和时间的方法支持；

相关推荐

Leave a Comment 取消回复