{ "cells": [ { "cell_type": "markdown", "id": "caf63193-8444-404c-92c6-470a94bd3004", "metadata": {}, "source": [ "There are several ways to create RDDs in PySpark:\n", "\n", "- Parallelizing a Collection\n", "- From an External Dataset\n", "- From existing RDDs\n", "\n", "### 1. Parallelizing a Collection\n", "You can create an RDD from an existing collection (like a list) using the parallelize method." ] }, { "cell_type": "code", "execution_count": 1, "id": "fa7bbc9d-e733-4756-bc7c-1c845e57eeec", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "24/08/18 10:08:58 WARN Utils: Your hostname, AlienDVD resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)\n", "24/08/18 10:08:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "24/08/18 10:08:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" ] } ], "source": [ "from pyspark import SparkContext\n", "\n", "# Initialize SparkContext\n", "sc = SparkContext(\"local\", \"RDD Example\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "bce45414-5139-4abf-a4db-cb4179289d90", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[1, 2, 3, 4, 5]\n" ] } ], "source": [ "# Create an RDD from a Python list\n", "data = [1, 2, 3, 4, 5]\n", "rdd = sc.parallelize(data)\n", "\n", "print(type(rdd))\n", "\n", "# Collect the RDD to see its contents\n", "print(rdd.collect())\n" ] }, { "cell_type": "markdown", "id": "9966d555-0c86-4a6d-98cf-e0dcaf08453e", "metadata": {}, "source": [ "### 2. From an External Dataset\n", "You can create an RDD from an external data source such as HDFS, S3, or local file systems using the textFile method." ] }, { "cell_type": "code", "execution_count": 3, "id": "2097d9de-96c4-4ab6-959b-503ad3a628aa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Create an RDD from a text file\n", "rdd = sc.textFile(\"resources/people.txt\")\n", "\n", "print(type(rdd))" ] }, { "cell_type": "code", "execution_count": 4, "id": "655587bb-ef0d-44df-9bb8-8f6c1147bc23", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Mohsen 49', 'Jan 35', 'Jos 30', 'Arash 25', 'Eva 28', 'Frank 40', 'Helma 33', 'Dean 31', 'Alican 22', 'Zhanna 27']\n" ] } ], "source": [ "# Collect the RDD\n", "print(rdd.collect())" ] }, { "cell_type": "markdown", "id": "49f0cf98-a68b-44c2-b780-15a69b339cb4", "metadata": {}, "source": [ "### 3. From Existing RDDs\n", "RDDs can be created by transforming existing RDDs using operations like map, filter, flatMap, etc." ] }, { "cell_type": "code", "execution_count": 5, "id": "ff4bce8b-9cd9-468b-9208-5f7b62182eaa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['APPLE', 'BANANA', 'CHERRY', 'DATE']\n" ] } ], "source": [ "# Create an RDD from a list\n", "data = [\"apple\", \"banana\", \"cherry\", \"date\"]\n", "rdd = sc.parallelize(data)\n", "\n", "# Transform the RDD using map to create a new RDD\n", "upper_rdd = rdd.map(lambda x: x.upper())\n", "\n", "# Collect the transformed RDD\n", "print(upper_rdd.collect())\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }