Running DeepSeek R1 Locally with vLLM and Ray Dashboard¶
To serve and infer LLMs locally, we will use vLLM—a fast and efficient library for LLM inference and serving, originally developed at the Sky Computing Lab at UC Berkeley. Over time, vLLM has evolved into a community-driven project with contributions from academia and industry.
vLLM seamlessly supports most popular open-source models on Hugging Face, including:
- Transformer-based LLMs (e.g., LLaMA, Mistral, GPT)
- Mixture-of-Expert LLMs (e.g., Mixtral, DeepSeek-V2 and V3)
- Embedding Models (e.g., E5-Mistral, BGE)
- Multi-modal LLMs (e.g., LLaVA)
To learn more about vLLM, refer to the paper:
📄 Efficient Memory Management for Large Language Model Serving with PagedAttention
Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
Presented at ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
vLLM vs Hugging Face Transformers: Which One to Use for Running Models Locally?¶
When to Use vLLM¶
✅ Best for high-performance inference
✅ Optimized for large models (LLaMA, GPT, Mistral, etc.)
✅ Lower memory usage with PagedAttention
✅ High throughput (efficient batching and fast generation)
When to Use Hugging Face Transformers¶
✅ Best for training and fine-tuning models
✅ More flexibility (supports multiple architectures)
✅ Ideal for experimentation and model modifications
✅ Extensive support for custom tokenization, datasets, and adapters
Summary¶
Use Case | Best Choice |
---|---|
Fast, efficient inference | ✅ vLLM |
Training & fine-tuning | ✅ Hugging Face Transformers |
Experimentation & prototyping | ✅ Hugging Face Transformers |
Running large models efficiently | ✅ vLLM |
For our use case, where inference speed and efficiency matter most, we will use vLLM.
Setup Process¶
To get started, follow these steps:
1. Create and Activate a Virtual Environment¶
python -m venv vllm_env
source vllm_env/bin/activate # For Linux/macOS
# or
vllm_env\Scripts\activate # For Windows
2. Install Dependencies¶
pip install -r requirements.txt
3. Start Jupyter Notebook¶
jupyter notebook
Once inside Jupyter, you can proceed with loading and running DeepSeek R1 using vLLM.
Next Steps¶
✅ Load DeepSeek R1 and test its inference speed.
✅ Integrate with API endpoints for easier querying and interaction with crewAI.
To serve DeepSeek R1 Distill Qwen-1.5B locally using vLLM while exposing the Ray instance and accessing the Ray Dashboard, run the following command:
# Start Ray with the dashboard enabled
ray start --head --dashboard-port=8265 --include-dashboard=True
# Run vLLM API server
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--port 9000 \
--tensor-parallel-size 1
# For CPU (if you don’t have a GPU):
python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --device cpu --port 9000
For the latter case, if you receive warnings about platform detection and memory usage, consider running it on a GPU (CUDA).
Accessing the Ray Dashboard¶
Once started, open the Ray Dashboard in your browser:
🔗 URL: http://127.0.0.1:8265
This provides real-time monitoring of:
- ✅ GPU & CPU usage
- ✅ Memory allocation
- ✅ Task execution and scheduling
- ✅ Overall system performance
With this setup, you can efficiently run and monitor DeepSeek R1 locally using vLLM and Ray. 🚀
Note:: For a better Ray Dashboard experience you need to set up Prometheus and Grafana, steps that we will skip here in thins toturial.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import requests
url = "http://localhost:9000/v1/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
"prompt": "Write a python code that recives and int and outputs the sum of 0 to int. I do not need to see your line of thinking! Just show me the output!",
"max_tokens": 2000
}
response = requests.post(url, headers=headers, json=data)
print(response.json()['choices'][0]['text'])
For example, if the integer is 3, the Output... How? Ok, the integer is 3, Output is 6. Perform loop. Yes, I need to implement this in Python. So, I know if I write a for loop from 0 to int, then sum all the numbers. But for the for loop, how to make the range go through 0,1,2,3. So, for that range, which parameters. Wait, in Python, range(start, stop), and elif start is included, stop is excluded. So, if someone wanted to sum 0 to 3 in Python. We should perhaps set the range as range(0,4), but wait, if we do 0,1,2,3. Wait, but the for loop will automatically go through each number. So the for [var] from 0 to int+1. So, perhaps for i in range(0, int + 1): So, in your code, the output is the sum of 0 to integer. So, the meaning is: calculate 0 +1 +2 +3+...+n. So, in Python, for integers from 0 to n, sum include all. Thus, in this case, the example provided was, if the integer is 3, sum is 0+1+2+3=6. Yes. So, in code, as follows. Problem is when the int is passed as input. So, in Python, we have to read the integer from standard input. Then, compute the sum. So, first, read the integer. One approach is to read a string from input, convert to integer. Then, loop from 0 to that integer, sum. Wait, but according to the question, after receiving the int, output the sum. So, that's steps. Read integer 'n' from input. sum = 0 for i in range(n+1): sum += i print sum. Yes. In the example, when n is 3, sum is 6. Yes. Therefore, the code. Wait, the user provides for Python. So, in pseudocode. n = int(input()) total = 0 for i in range(n + 1): total += i print(total) So, that's the code. Wait, but in Python, when you do for i in ... it's inclusive. Wait, no, because the range goes up to stop as exclusive. Hence, for the range(n+1) needs to generate 0,1,2,...,n. Because, for example, range(3) would give 0,1,2: that's because starting from 0, then 0 to stop -1. So, to get 0 to 3 included, the range should be 0 to 4. Because: range(0,4) is [0,1,2,3]. Yes. So that's the correct approach. So then, the code is: n = int(input()) total = 0 for i in range(n + 1): total += i print(total) So, that should work. Testing by input(3) gives 6. Testing input(0) gives 0. Testing input(1) gives 0 +1= 1. Wait, no, 0 +1 is 1? Or is the sum including 0 and 1. Yes, sum from 0 to 1 is 0+1=1. Wait, so, for int=1, the code will calculate (0+1)=1, which is correct. So in this code, the sum is correct. Thus, that's the solution. So, putting it all together. The code will read an integer, sum from 0 to n, then print the sum. Yes. Hence, the Python code is written as: n = int(input()) total = 0 for i in range(n + 1): total += i print(total) </think> To solve this problem, we need to write a Python code that reads an integer from the standard input, calculates the sum of all integers from 0 up to and including that integer, and prints the result. ### Approach The approach is straightforward and involves the following steps: 1. Read an integer input from the user. 2. Initialize a variable to keep track of the sum. 3. Use a loop to iterate from 0 to the given integer, inclusive, and add each number to the sum. 4. Print the computed sum. This solution efficiently computes the sum using a for loop within the range defined from 0 to the given integer, ensuring we include all required values. ### Solution Code ```python n = int(input()) total = 0 for i in range(n + 1): total += i print(total) ``` ### Explanation 1. **Reading Input**: The input is read using `input()` and converted to an integer using `int()`. 2. **Initializing Sum**: A variable `total` is initialized to 0 to accumulate the sum. 3. **Loop Through Range**: Using a for loop from 0 to `n` (inclusive), each number is added to `total`. 4. **Printing Result**: After the loop completes, `total` contains the sum of all integers from 0 to `n`, which is then printed. This method ensures that we efficiently compute the required sum using a linear approach, making it both time and space efficient.
import litellm
messages = [{"role": "user", "content": "Hey, how's it going"}]
response = litellm.completion(
model="hosted_vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", # pass the vllm model name
messages=messages,
api_base="http://localhost:9000/v1",
temperature=0.1,
max_tokens=800)
print(response)
ModelResponse(id='chatcmpl-ab52d2df07df45d09e19039af09d9ee9', created=1740673222, model='hosted_vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Okay, so I just saw this message: "Hey, how\'s it going?" and I\'m trying to figure out what the user is asking for. Let me break it down.\n\nFirst, the greeting is pretty straightforward—Hello. But then, the user is asking, "how\'s it going?" That\'s a common question, but it\'s a bit vague. I need to understand what they\'re really looking for. Maybe they\'re just checking in, or perhaps they\'re seeking some information.\n\nI should consider different angles. Maybe they\'re asking about their day, their work, or something else. Since the user didn\'t specify, I can\'t be sure. But I can think of a few possibilities.\n\n1. **General Well-being**: They might be asking how they\'re feeling in general. That could be for a friend, family member, or someone they\'re meeting online.\n\n2. **Work or Personal Life**: If they\'re talking about work, they might be asking about their current tasks, progress, or any issues they\'re facing. If it\'s personal, they could be discussing hobbies, goals, or stress.\n\n3. **Specific Topics**: They might be asking about a particular subject, like their career, health, or something else. Without more context, it\'s hard to pin down.\n\n4. **Social Interaction**: They could be asking about their online presence, whether they\'re active, or if they\'re looking for advice.\n\nI should also think about the tone. The user used "Hey, how\'s it going?" which is casual and friendly. So, they might be looking for a light-hearted response or just a simple confirmation.\n\nAnother angle is that they might be testing if I\'m paying attention or if I\'m just responding to a generic message. But since they\'re asking a question, it\'s more likely they\'re seeking information.\n\nI wonder if they\'re planning to ask something else in the future. Maybe they\'re just curious and want to know more. Or perhaps they\'re looking for a prompt to start a conversation.\n\nTo sum up, without more context, I can\'t be certain, but I can offer a few possibilities. I should ask them to clarify or provide more details so I can assist them better.\n</think>\n\nIt seems like you\'re asking how you\'re doing, which is a common way to check in. To help you better, could you please clarify what you need assistance with? Whether it\'s for a specific topic, a question, or something else, I\'m here to help!', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=510, prompt_tokens=11, total_tokens=521, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt_logprobs=None)
from crewai import LLM
llm = LLM(
model="hosted_vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
temperature=0.2,
base_url="http://localhost:9000/v1",
api_key="NA"
)
from crewai import Agent, Task, Crew
# Create a CrewAI agent
agent = Agent(
role="Researcher",
goal="Analyze the latest AI trends",
backstory="An AI expert who gathers and summarizes key insights",
llm=llm
)
# Define a task with expected_output
task = Task(
description="Find the top 3 AI trends of 2025.",
agent=agent,
expected_output="A list of three key AI trends for 2025 with a brief summary of each."
)
crew = Crew(agents=[agent], tasks=[task])
crew.kickoff()
CrewOutput(raw="1. Generative AI: This trend involves AI systems that can create new content, such as images, text, and music. These systems, often referred to as GANs (Generative Adversarial Networks), are gaining sophistication and are expected to remain relevant in 2025, offering a wide range of applications from art to entertainment.\n\n2. Medical AI: The integration of AI with healthcare is a significant trend. AI is being utilized for diagnostics, drug discovery, and personalized treatment plans. This shift aims to improve healthcare outcomes and efficiency, making AI a cornerstone of modern medical practice.\n\n3. AI Integration with Other Technologies: This trend focuses on enhancing AI's capabilities by integrating it with other technologies, such as robotics and autonomous vehicles. By improving AI's efficiency and adaptability, it can support advancements in manufacturing and other industries, driving innovation and efficiency.\n\nThese trends highlight the transformative potential of AI across various sectors, shaping the future of technology and society.", pydantic=None, json_dict=None, tasks_output=[TaskOutput(description='Find the top 3 AI trends of 2025.', name=None, expected_output='A list of three key AI trends for 2025 with a brief summary of each.', summary='Find the top 3 AI trends of 2025....', raw="1. Generative AI: This trend involves AI systems that can create new content, such as images, text, and music. These systems, often referred to as GANs (Generative Adversarial Networks), are gaining sophistication and are expected to remain relevant in 2025, offering a wide range of applications from art to entertainment.\n\n2. Medical AI: The integration of AI with healthcare is a significant trend. AI is being utilized for diagnostics, drug discovery, and personalized treatment plans. This shift aims to improve healthcare outcomes and efficiency, making AI a cornerstone of modern medical practice.\n\n3. AI Integration with Other Technologies: This trend focuses on enhancing AI's capabilities by integrating it with other technologies, such as robotics and autonomous vehicles. By improving AI's efficiency and adaptability, it can support advancements in manufacturing and other industries, driving innovation and efficiency.\n\nThese trends highlight the transformative potential of AI across various sectors, shaping the future of technology and society.", pydantic=None, json_dict=None, agent='Researcher', output_format=<OutputFormat.RAW: 'raw'>)], token_usage=UsageMetrics(total_tokens=1964, prompt_tokens=469, cached_prompt_tokens=0, completion_tokens=1495, successful_requests=2))
Ngrok auth¶
1️⃣What is ngrok?¶
Ngrok is a tool that allows you to expose a local server running on your machine to the public internet using a secure tunnel. It creates a publicly accessible URL that can be used to access services running locally, like a web application, API, or Jupyter Notebook.
2️⃣Why Should You Use ngrok?¶
Exposing Localhost to the Internet¶
- Run a local web app (e.g., Flask, Django, FastAPI), and ngrok provides a public URL like
https://abcd.ngrok.io
to access it.
Testing Webhooks¶
- Services like Stripe, GitHub, Twilio, or Telegram bots require public URLs for webhooks.
- Ngrok lets you expose your local environment without deploying your app.
Remote Access to Jupyter Notebooks¶
- If you're working in WSL, a cloud server, or a remote machine, ngrok helps you access your Jupyter Notebook from anywhere.
Collaboration & Demos¶
- Need to share a local project? Just run ngrok and get a public link instantly.
Bypassing Firewalls & NAT¶
- Ngrok helps when working behind corporate networks where port forwarding isn’t an option.
3️⃣How to Install and Use ngrok¶
** Install ngrok (for WSL/Ubuntu) if not installed*¶
On WSL (Ubuntu): un the following commands in your WSL terminal:
curl -s https://ngrok-agent.s3.amazonaws.com/ngrok.asc | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null && \
echo "deb https://ngrok-agent.s3.amazonaws.com buster main" | sudo tee /etc/apt/sources.list.d/ngrok.list && \
sudo apt update && sudo apt install ngrok
This will take something aound 10-20 seconds to complete.
Authenticate ngrok¶
Get your auth token from the ngrok dashboard and run:
ngrok config add-authtoken <your_token>
Replace <your_token>
with your actual authentication token.
Expose a Local Server¶
If you are running your app on port 5000
, use:
ngrok http 5000
Ngrok will generate a public URL, e.g.:
https://abcd.ngrok.io
Now you can access your local app from anywhere!
4️⃣Security Reminder¶
- Do not share your ngrok authentication token publicly.
- If you accidentally exposed it, go to the ngrok dashboard and reset your token.
5️⃣Should You Use ngrok?¶
✔ Yes, if you need to expose a local server to the internet quickly.
✔ Yes, if you're testing webhooks or need remote access.
❌ No, if you’re deploying a production app—use a proper hosting service instead.
!ngrok config add-authtoken <your_token>
Authtoken saved to configuration file: /home/mohsen/.config/ngrok/ngrok.yml
from pyngrok import ngrok
port = 9000
# Open a ngrok tunnel to the HTTP server
public_url = ngrok.connect(port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{port}\"")
* ngrok tunnel "https://8fb3-84-104-50-150.ngrok-free.app" -> "http://127.0.0.1:9000"
You can use a tool that supports POST requests, like Postman or curl are designed to send POST requests with headers and a JSON body.
llm = LLM(
model="hosted_vllm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
temperature=0.6,
base_url="https://8fb3-84-104-50-150.ngrok-free.app/v1",
api_key="NA"
)
from crewai import Agent, Task, Crew
# Create a CrewAI agent
agent = Agent(
role="Researcher",
goal="Analyze the latest AI trends",
backstory="An AI expert who gathers and summarizes key insights",
llm=llm
)
# Define a task with expected_output
task = Task(
description="Find the top 3 AI trends of 2025.",
agent=agent,
expected_output="A list of three key AI trends for 2025 with a brief summary of each."
)
crew = Crew(agents=[agent], tasks=[task])
crew.kickoff()
Overriding of current TracerProvider is not allowed
CrewOutput(raw='The top three AI trends for 2025 are: \n1. Generative AI: advancing in alignment with human intent to aid decision-making and understanding, enhancing image generation beyond mere creation. \n2. AI in Sustainability: contributing to climate modeling, resource management, and healthcare through smarter grids and supply chain optimization, with a focus on ethical AI. \n3. Ethical AI: driving responsible applications in education, mental health, and criminal justice, with a trend towards better ethical AI and accountability.', pydantic=None, json_dict=None, tasks_output=[TaskOutput(description='Find the top 3 AI trends of 2025.', name=None, expected_output='A list of three key AI trends for 2025 with a brief summary of each.', summary='Find the top 3 AI trends of 2025....', raw='The top three AI trends for 2025 are: \n1. Generative AI: advancing in alignment with human intent to aid decision-making and understanding, enhancing image generation beyond mere creation. \n2. AI in Sustainability: contributing to climate modeling, resource management, and healthcare through smarter grids and supply chain optimization, with a focus on ethical AI. \n3. Ethical AI: driving responsible applications in education, mental health, and criminal justice, with a trend towards better ethical AI and accountability.', pydantic=None, json_dict=None, agent='Researcher', output_format=<OutputFormat.RAW: 'raw'>)], token_usage=UsageMetrics(total_tokens=2013, prompt_tokens=469, cached_prompt_tokens=0, completion_tokens=1544, successful_requests=2))