Running DeepSeek R1 Locally with vLLM & Ray Dashboard

Running DeepSeek R1 Locally with vLLM & Ray Dashboard

Feb 26, 2025ยท
Mohsen Davarynejad
Mohsen Davarynejad
ยท 2 min read
Practicing civility can enhance your leadership

Summary

Large Language Models (LLMs) have revolutionized AI applications, but running them efficiently on local hardware remains a challenge. In this post, we explore how to set up DeepSeek R1 Distill Qwen-1.5B using vLLM, an optimized inference engine, along with Ray Dashboard for real-time resource monitoring. By the end of this guide, youโ€™ll have a fast, API-accessible LLM running on your local machine, ready for further agentic integrations.

Check out the notebook jupyter notebook 01-DeepSeek-R1-Local-Setup-vLLM-API-Ray-Dashboard-CrewAI-Agent-Test here with an HTML version hosted here.


Introduction

Deploying open-source LLMs locally is now easier than ever, thanks to vLLMโ€”a high-performance inference framework designed for efficiency and scalability. Unlike traditional Hugging Face transformers, vLLM optimizes memory usage and throughput, making it ideal for running large-scale AI models on consumer GPUs.

In this post, we will:

  • Set up DeepSeek R1 Distill Qwen-1.5B on a local machine.
  • Expose it via an OpenAI-compatible API.
  • Use Ray Dashboard to monitor GPU and system resource usage.
  • Optionally, expose the API to the web using ngrok.

Why Run DeepSeek R1 Locally?

LLMs are powerful, but running them in the cloud comes with limitations:

  • Cost: Continuous API calls to hosted LLM services can be expensive.
  • Latency: Local execution can be faster for real-time applications.
  • Privacy & Control: Keep your AI workloads entirely on-premise.

By leveraging vLLM and Ray, we ensure that our local LLM setup is not only efficient but also scalable, allowing for multiple requests while keeping GPU utilization optimized.

For a detailed explanation and line-by-line implementation, check out the full notebook here.

Conclusion

By following this setup, you now have a fully operational LLM API running locally, optimized for efficiency with vLLM and Ray. Whether you’re developing chatbots, research tools, or agentic workflows, this setup provides the foundation for powerful local AI applications.

But running a model is only the beginning! Next, weโ€™ll dive into orchestrating AI agents using CrewAI, allowing multiple agents to collaborate, perform tasks, and make decisions dynamically.

๐Ÿ” Explore More Topics!

Check out the TAGS list on my website to find interesting topics that match your curiosity.

Continue to the next post: Orchestrating AI Agents with CrewAI and Local DeepSeek API.