There are several ways to create RDDs in PySpark:

  • Parallelizing a Collection
  • From an External Dataset
  • From existing RDDs

1. Parallelizing a Collection¶

You can create an RDD from an existing collection (like a list) using the parallelize method.

In [1]:
from pyspark import SparkContext

# Initialize SparkContext
sc = SparkContext("local", "RDD Example")
24/08/18 10:08:58 WARN Utils: Your hostname, AlienDVD resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
24/08/18 10:08:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/18 10:08:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
In [2]:
# Create an RDD from a Python list
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

print(type(rdd))

# Collect the RDD to see its contents
print(rdd.collect())
<class 'pyspark.rdd.RDD'>
[1, 2, 3, 4, 5]

2. From an External Dataset¶

You can create an RDD from an external data source such as HDFS, S3, or local file systems using the textFile method.

In [3]:
# Create an RDD from a text file
rdd = sc.textFile("resources/people.txt")

print(type(rdd))
<class 'pyspark.rdd.RDD'>
In [4]:
# Collect the RDD
print(rdd.collect())
['Mohsen 49', 'Jan 35', 'Jos 30', 'Arash 25', 'Eva 28', 'Frank 40', 'Helma 33', 'Dean 31', 'Alican 22', 'Zhanna 27']

3. From Existing RDDs¶

RDDs can be created by transforming existing RDDs using operations like map, filter, flatMap, etc.

In [5]:
# Create an RDD from a list
data = ["apple", "banana", "cherry", "date"]
rdd = sc.parallelize(data)

# Transform the RDD using map to create a new RDD
upper_rdd = rdd.map(lambda x: x.upper())

# Collect the transformed RDD
print(upper_rdd.collect())
['APPLE', 'BANANA', 'CHERRY', 'DATE']