Common RDD Transformations and Actions¶

image.png

Image is stored here https://www.linkedin.com/pulse/resilient-distributed-datasets-rdds-arabinda-mohapatra-5phef/

Transformations¶

  • map(): Applies a function to each element in the RDD.
  • filter(): Filters elements based on a predicate.
  • flatMap(): Similar to map(), but each input item can be mapped to 0 or more output items.
In [1]:
from pyspark import SparkContext

# Initialize SparkContext
sc = SparkContext("local", "RDD Transformations and Actions")
24/05/29 22:01:54 WARN Utils: Your hostname, AlienDVD resolves to a loopback address: 127.0.1.1; using 172.25.214.222 instead (on interface eth0)
24/05/29 22:01:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/29 22:01:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/29 22:01:57 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
In [2]:
# map() example
rdd = sc.parallelize([1, 2, 3, 4])
squared_rdd = rdd.map(lambda x: x * x)
print(squared_rdd.collect())

# filter() example
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
print(filtered_rdd.collect())

# flatMap() example
rdd = sc.parallelize(["hello world", "foo bar"])
flat_mapped_rdd = rdd.flatMap(lambda x: x.split(" "))
print(flat_mapped_rdd.collect())
                                                                                
[1, 4, 9, 16]
[2, 4]
['hello', 'world', 'foo', 'bar']

Actions¶

  • collect(): Returns all elements of the RDD.
  • take(n): Returns the first n elements of the RDD.
  • count(): Returns the number of elements in the RDD.
In [3]:
# collect() example
rdd = sc.parallelize([1, 2, 3, 4])
print(rdd.collect())

# take(n) example
print(rdd.take(2))

# count() example
print(rdd.count())
[1, 2, 3, 4]
[1, 2]
4

Summary¶

  • Parallelizing a Collection: Use sc.parallelize().
  • From an External Dataset: Use sc.textFile().
  • From Existing RDDs: Use transformations like map, filter, and flatMap.
  • Actions: Use actions like collect, take, and count.
In [ ]: