{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "caf63193-8444-404c-92c6-470a94bd3004",
   "metadata": {},
   "source": [
    "There are several ways to create RDDs in PySpark:\n",
    "\n",
    "- Parallelizing a Collection\n",
    "- From an External Dataset\n",
    "- From existing RDDs\n",
    "\n",
    "### 1. Parallelizing a Collection\n",
    "You can create an RDD from an existing collection (like a list) using the parallelize method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fa7bbc9d-e733-4756-bc7c-1c845e57eeec",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "24/08/18 10:08:58 WARN Utils: Your hostname, AlienDVD resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)\n",
      "24/08/18 10:08:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\n",
      "Setting default log level to \"WARN\".\n",
      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
      "24/08/18 10:08:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
     ]
    }
   ],
   "source": [
    "from pyspark import SparkContext\n",
    "\n",
    "# Initialize SparkContext\n",
    "sc = SparkContext(\"local\", \"RDD Example\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bce45414-5139-4abf-a4db-cb4179289d90",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pyspark.rdd.RDD'>\n",
      "[1, 2, 3, 4, 5]\n"
     ]
    }
   ],
   "source": [
    "# Create an RDD from a Python list\n",
    "data = [1, 2, 3, 4, 5]\n",
    "rdd = sc.parallelize(data)\n",
    "\n",
    "print(type(rdd))\n",
    "\n",
    "# Collect the RDD to see its contents\n",
    "print(rdd.collect())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9966d555-0c86-4a6d-98cf-e0dcaf08453e",
   "metadata": {},
   "source": [
    "### 2. From an External Dataset\n",
    "You can create an RDD from an external data source such as HDFS, S3, or local file systems using the textFile method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2097d9de-96c4-4ab6-959b-503ad3a628aa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pyspark.rdd.RDD'>\n"
     ]
    }
   ],
   "source": [
    "# Create an RDD from a text file\n",
    "rdd = sc.textFile(\"resources/people.txt\")\n",
    "\n",
    "print(type(rdd))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "655587bb-ef0d-44df-9bb8-8f6c1147bc23",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Mohsen 49', 'Jan 35', 'Jos 30', 'Arash 25', 'Eva 28', 'Frank 40', 'Helma 33', 'Dean 31', 'Alican 22', 'Zhanna 27']\n"
     ]
    }
   ],
   "source": [
    "# Collect the RDD\n",
    "print(rdd.collect())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49f0cf98-a68b-44c2-b780-15a69b339cb4",
   "metadata": {},
   "source": [
    "### 3. From Existing RDDs\n",
    "RDDs can be created by transforming existing RDDs using operations like map, filter, flatMap, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ff4bce8b-9cd9-468b-9208-5f7b62182eaa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['APPLE', 'BANANA', 'CHERRY', 'DATE']\n"
     ]
    }
   ],
   "source": [
    "# Create an RDD from a list\n",
    "data = [\"apple\", \"banana\", \"cherry\", \"date\"]\n",
    "rdd = sc.parallelize(data)\n",
    "\n",
    "# Transform the RDD using map to create a new RDD\n",
    "upper_rdd = rdd.map(lambda x: x.upper())\n",
    "\n",
    "# Collect the transformed RDD\n",
    "print(upper_rdd.collect())\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}