Running DeepSeek R1 Locally with vLLM & Ray Dashboard


Summary
Large Language Models (LLMs) have revolutionized AI applications, but running them efficiently on local hardware remains a challenge. In this post, we explore how to set up DeepSeek R1 Distill Qwen-1.5B using vLLM, an optimized inference engine, along with Ray Dashboard for real-time resource monitoring. By the end of this guide, youโll have a fast, API-accessible LLM running on your local machine, ready for further agentic integrations.
Check out the notebook jupyter notebook 01-DeepSeek-R1-Local-Setup-vLLM-API-Ray-Dashboard-CrewAI-Agent-Test here with an HTML version hosted here.
Introduction
Deploying open-source LLMs locally is now easier than ever, thanks to vLLMโa high-performance inference framework designed for efficiency and scalability. Unlike traditional Hugging Face transformers, vLLM optimizes memory usage and throughput, making it ideal for running large-scale AI models on consumer GPUs.
In this post, we will:
- Set up DeepSeek R1 Distill Qwen-1.5B on a local machine.
- Expose it via an OpenAI-compatible API.
- Use Ray Dashboard to monitor GPU and system resource usage.
- Optionally, expose the API to the web using ngrok.
Why Run DeepSeek R1 Locally?
LLMs are powerful, but running them in the cloud comes with limitations:
- Cost: Continuous API calls to hosted LLM services can be expensive.
- Latency: Local execution can be faster for real-time applications.
- Privacy & Control: Keep your AI workloads entirely on-premise.
By leveraging vLLM and Ray, we ensure that our local LLM setup is not only efficient but also scalable, allowing for multiple requests while keeping GPU utilization optimized.
For a detailed explanation and line-by-line implementation, check out the full notebook here.
Conclusion
By following this setup, you now have a fully operational LLM API running locally, optimized for efficiency with vLLM and Ray. Whether you’re developing chatbots, research tools, or agentic workflows, this setup provides the foundation for powerful local AI applications.
But running a model is only the beginning! Next, weโll dive into orchestrating AI agents using CrewAI, allowing multiple agents to collaborate, perform tasks, and make decisions dynamically.
๐ Explore More Topics!
Check out the TAGS list on my website to find interesting topics that match your curiosity.
Continue to the next post: Orchestrating AI Agents with CrewAI and Local DeepSeek API.