Deploying DeepSeek-R1 Locally with vLLM on Ubuntu

With the rapid advancements in AI, running large language models locally has become an attractive alternative to cloud-based solutions. DeepSeek-R1, a state-of-the-art open-source model, enables advanced reasoning and problem-solving capabilities. Deploying it locally on Ubuntu within WSL (Windows Subsystem for Linux) offers enhanced privacy, reduced latency, and greater control over performance.

This guide will walk you through setting up DeepSeek-R1 on Ubuntu (WSL) using vLLM, a highly efficient inference engine. Whether you’re an AI enthusiast, a developer, or a researcher, this step-by-step tutorial will help you harness the full power of DeepSeek-R1 on your local machine.

Understanding DeepSeek-R1

DeepSeek-R1 is a state-of-the-art open-source reasoning model built on a Mixture of Experts (MoE) architecture, featuring 671 billion parameters. Uniquely, it activates only 37 billion parameters during each forward pass, balancing performance and efficiency. This design enables advanced reasoning capabilities, making it suitable for complex tasks in mathematics, coding, and logical analysis.

Advantages of Local Deployment

Running DeepSeek-R1 locally on Ubuntu within WSL offers several benefits:

Data Privacy & Security: Your data remains on-premises, ensuring confidentiality without third-party access.
Offline Functionality: Operate without an internet connection, ensuring availability in remote or secure environments.
Customization & Flexibility: Tailor the model to specific needs, integrate with local applications, and fine-tune parameters.
Performance & Speed: Experience reduced latency and faster response times by leveraging local hardware resources.
Cost Efficiency: Eliminate recurring cloud service fees and manage resources based on actual requirements.

Prerequisites

Before proceeding, ensure your system meets the following requirements:

Hardware:

A GPU with CUDA support (NVIDIA GPUs recommended for optimal performance).
Minimum 16GB RAM (32GB or more recommended for large-scale tasks).
SSD storage for faster model loading and inference.

Software:

Ubuntu (running on WSL2 with GPU support enabled).
Python 3.8 or higher.
CUDA 11.6 or higher (for GPU acceleration).
PyTorch or TensorFlow (depending on the chosen backend).
Git and virtual environment tools for dependency management.

Setting Up WSL for DeepSeek-R1

1. Install WSL2 and Ubuntu

Before starting, ensure that WSL2 (Windows Subsystem for Linux) is installed on your system. If it is not installed, follow these steps:

wsl --install -d Ubuntu

Restart your system if prompted.
Once the installation is complete, open Ubuntu from the Start menu and set up your user credentials.

1. Alternative: Using a Cloud-based GPU Instance

For students who want to run their workloads on a GPU-enabled instance, they can opt for cloud providers. A great option is Vultr (vultr.com), which offers GPU-enabled instances and provides free sign-up credits for new users.

Other cloud providers, such as AWS, GCP, and Azure, also offer GPU instances, and they also provide free credits, an extra advantage for those looking for a cost-effective solution.

2. Enable GPU Support in WSL2

Ensure you have the NVIDIA drivers installed and CUDA set up:

sudo apt update && sudo apt upgrade -y
sudo apt install -y nvidia-cuda-toolkit

Confirm CUDA installation with:

nvcc --version

3. Install Dependencies

sudo apt install -y python3-pip git
pip install --upgrade pip

Installation Methods

1. Using Ollama

Ollama simplifies the management of AI models locally.

Step 1: Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh

Step 2: Install DeepSeek-R1 Model

ollama pull deepseek-r1

Step 3: Run DeepSeek-R1

ollama run deepseek-r1

Step 4: Test the Installation

Interact with the model by entering prompts in your terminal.

2. Using vLLM

vLLM is optimized for efficient inference with transformer-based models. It enables high-throughput and low-latency inference, making it ideal for deploying LLMs (Large Language Models) in production environments.

Note:: For a detailed step-by-step guide on how to install and run vLLM with DeepSeek locally, students can follow this instructional blog post: Running DeepSeek with vLLM Locally where they can use Jupyter Notebook to follow along with the instructions and experiment with vLLM interactively.

Step 1: Install vLLM

Ensure you have pip installed, then run:

pip install vllm

Step 2: Download DeepSeek-R1 Model

Clone the repository and navigate to the directory:

git clone https://github.com/deepseek-ai/deepseek-r1.git
cd deepseek-r1

Step 3: Load the Model with vLLM

Initialize the model:

python -m vllm.entrypoints.api_server --model deepseek-r1

Step 4: API Access

Interact with DeepSeek-R1 via API calls for various applications.

3. Using Transformers (Hugging Face)

The Transformers library allows for flexible model loading and customization.

Step 1: Install Dependencies

pip install transformers torch accelerate

Step 2: Load the Model

In a Python script or interactive session:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-r1")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-r1")

Step 3: Generate Text

Test the model’s functionality:

input_text = "What is DeepSeek-R1?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

Step 4: Fine-Tuning (Optional)

For advanced users, fine-tuning DeepSeek-R1 on custom datasets can be achieved using the Hugging Face Trainer API.

Troubleshooting Common Issues

CUDA Not Recognized: Ensure CUDA is installed and accessible by verifying with nvcc --version.

Memory Errors: Reduce batch size or switch to CPU mode:

model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-r1", device_map="cpu")

Slow Performance: Optimize by using TorchScript or ONNX.

By following this guide, you can efficiently set up and run DeepSeek-R1 on Ubuntu within WSL2, leveraging local resources for enhanced performance and privacy.