Common RDD Transformations and Actions¶
Image is stored here https://www.linkedin.com/pulse/resilient-distributed-datasets-rdds-arabinda-mohapatra-5phef/
Transformations¶
- map(): Applies a function to each element in the RDD.
- filter(): Filters elements based on a predicate.
- flatMap(): Similar to map(), but each input item can be mapped to 0 or more output items.
In [1]:
from pyspark import SparkContext
# Initialize SparkContext
sc = SparkContext("local", "RDD Transformations and Actions")
24/05/29 22:01:54 WARN Utils: Your hostname, AlienDVD resolves to a loopback address: 127.0.1.1; using 172.25.214.222 instead (on interface eth0) 24/05/29 22:01:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/05/29 22:01:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/05/29 22:01:57 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
In [2]:
# map() example
rdd = sc.parallelize([1, 2, 3, 4])
squared_rdd = rdd.map(lambda x: x * x)
print(squared_rdd.collect())
# filter() example
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
print(filtered_rdd.collect())
# flatMap() example
rdd = sc.parallelize(["hello world", "foo bar"])
flat_mapped_rdd = rdd.flatMap(lambda x: x.split(" "))
print(flat_mapped_rdd.collect())
[1, 4, 9, 16] [2, 4] ['hello', 'world', 'foo', 'bar']
Actions¶
- collect(): Returns all elements of the RDD.
- take(n): Returns the first n elements of the RDD.
- count(): Returns the number of elements in the RDD.
In [3]:
# collect() example
rdd = sc.parallelize([1, 2, 3, 4])
print(rdd.collect())
# take(n) example
print(rdd.take(2))
# count() example
print(rdd.count())
[1, 2, 3, 4] [1, 2] 4
Summary¶
- Parallelizing a Collection: Use sc.parallelize().
- From an External Dataset: Use sc.textFile().
- From Existing RDDs: Use transformations like map, filter, and flatMap.
- Actions: Use actions like collect, take, and count.
In [ ]: