600字范文 > PySpark RDD 之collect take top first取值操作

PySpark RDD 之collect take top first取值操作

时间：2024-07-21 19:46:57

1. pyspark 版本

2.3.0版本

2. collect()

collect()[source]

Return a list that contains all of the elements in this RDD.

中文：返回包含此RDD中的所有元素的列表。

Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.中文注解：注意，这个方法应该只在预期得到的数组很小的情况下使用，因为所有的数据都加载到驱动程序的内存中。

案列：

from pyspark import SparkContext, SparkConfconf = SparkConf().setMaster("local").setAppName("quzhi")sc = SparkContext(conf=conf)lines1 = sc.parallelize(list(range(10)))print('lines1= ', lines1.collect())>>> lines1= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

3.take

take(num)[source]

Take the first num elements of the RDD.

It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

Translated from the Scala implementation in RDD#take().

中文：取RDD的前num个元素。

它的工作方式是先扫描一个分区，然后使用该分区的结果来估算满足限制所需的其他分区的数量。

从RDD＃take（）中的Scala实现翻译而来。

Note this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.中文：注意仅当预期结果数组较小时才应使用此方法，因为所有数据均已加载到驱动程序的内存中。>>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)[2, 3]>>> sc.parallelize([2, 3, 4, 5, 6]).take(10)[2, 3, 4, 5, 6]>>> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)[91, 92, 93]

案列：

# take: 从rdd中返回前n个元素print('take(2)= ', lines1.take(2)) >>> take(2)= [0, 1]

4.top

top(num,key=None)[source]

Get the top N elements from an RDD.

中文：

Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.中文：注意仅当预期结果数组较小时才应使用此方法，因为所有数据均已加载到驱动程序的内存中。Note It returns the list sorted in descending order.注意它返回以降序排序的列表。>>> sc.parallelize([10, 4, 2, 12, 3]).top(1)[12]>>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)[6, 5]>>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)[4, 3, 2]

案列：

# top(num)返回最前面的两个元素lines2 = sc.parallelize(list(range(0, 10))[::-1])print('lines2= ', lines2.collect())print('lines1.top(2)= ', lines1.top(2))print('lines2.top(2)= ', lines2.top(2))>>>lines2= [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]>>>lines1.top(2)= [9, 8]>>>lines2.top(2)= [9, 8]