vLLM on Ubuntu 24.04: Install OpenAI-Compatible API (CUDA 12)

This comprehensive guide provides a step-by-step walkthrough for installing and configuring vLLM on Ubuntu 24.04, including NVIDIA driver setup (CUDA 12.x) and Python environment isolation. It details how to serve a high-performance, OpenAI-compatible API, test it, and create a robust systemd service for production, enabling businesses to build secure, cost-effective, and scalable on-premises AI infrastructure.

Back to Blog
14 min read
A diagram illustrating a secure, on-premises AI infrastructure, showing how vLLM on an Ubuntu 24.04 server with an NVIDIA GPU provides a high-performance, OpenAI-compatible API to internal users.

How to Install vLLM on Ubuntu 24.04 (CUDA 12.x) and Serve an OpenAI-Compatible API

Published: October 21, 2025

Executive Summary

Deploying large language models (LLMs) in a production environment presents significant performance and cost challenges. High latency and low throughput can render applications unusable, while reliance on third-party cloud APIs introduces spiraling costs and critical data sovereignty risks. vLLM, a high-throughput LLM serving engine, solves this by using PagedAttention to dramatically increase performance.

This comprehensive guide provides a full, production-focused walkthrough for installing vLLM on Ubuntu 24.04 with NVIDIA CUDA 12.x support. We move beyond a simple install, showing you how to configure a robust, OpenAI-compatible API endpoint, test it, and daemonize the service with systemd for reliable, enterprise-grade operation. This empowers your organization to build a secure, cost-effective, and high-performance private AI infrastructure.

The Business Case for Self-Hosted, High-Performance Inference

The generative AI revolution has moved from experimentation to production. In a 2024 Gartner survey, 55% of organizations reported moving their AI pilots into production, seeking tangible business value [Gartner, 2024]. However, this transition exposes the critical bottleneck of LLM inference—the process of generating responses. Standard serving methods, such as those from Hugging Face Transformers, are notoriously inefficient, leading to high latency and low throughput.

This inefficiency forces businesses into a false choice: either invest in a massive, expensive fleet of GPUs to handle concurrent users or outsource processing to cloud APIs like OpenAI, sacrificing data control and facing unpredictable, usage-based billing.

What is vLLM and Why is it a Game-Changer?

vLLM is an open-source library developed by researchers at UC Berkeley designed specifically for high-throughput and memory-efficient LLM inference. Its core innovation is PagedAttention, a new attention algorithm inspired by traditional operating system paging.

In simple terms, traditional attention mechanisms waste vast amounts of GPU VRAM (video memory) by pre-allocating large, contiguous memory blocks for every single request, even if the request ends up being short. PagedAttention, by contrast, allocates VRAM in smaller, non-contiguous blocks on demand. This allows for dramatically better memory utilization, near-zero waste, and the ability to handle far more concurrent users on the same GPU.

The results are staggering. According to the vLLM project's own benchmarks, their engine can achieve up to 24x higher throughput compared to standard Hugging Face (HF) Transformers and 3.5x higher throughput than Hugging Face Text Generation Inference (TGI) [vLLM Blog]. This isn't a minor improvement; it's a fundamental shift in performance.

Drastic Cost Reduction

Serving 100 million tokens per day on a powerful cloud API like OpenAI's GPT-4o could cost over $1,500 per day (assuming a 50/50 input/output mix) [OpenAI Pricing, 2024]. Self-hosting with vLLM on a single enterprise GPU (like an NVIDIA L40S) amortizes the hardware cost within months, then drops operational cost to just power and maintenance, transforming a variable OpEx nightmare into a predictable CapEx investment.

Massive Performance Gains

High throughput isn't just a number; it's the difference between a real-time chatbot and a frustrating "loading..." spinner. vLLM's efficiency means you can serve dozens of concurrent users on a single GPU that would have previously struggled with just one or two, enabling truly interactive AI applications for your entire organization.

Total Data Sovereignty

When you send a prompt to a cloud API, your sensitive data (financial reports, patient information, legal contracts, proprietary code) is sent to a third-party server. By self-hosting with vLLM, your prompts and the model's responses never leave your secure network perimeter. This is not just a preference but a legal necessity for organizations subject to HIPAA, GDPR, or PCI-DSS compliance.

Unmatched Customization

You are no longer restricted to a curated list of cloud models. With vLLM, you can download and serve any of the thousands of open-source models available, including industry-specific models (like for law or medicine) or models you have fine-tuned on your own company data. This allows you to build a true competitive advantage with a bespoke AI.

Step 1: Preparing Your Ubuntu 24.04 Server

Before we can install vLLM, we must build a stable, production-ready foundation. This involves selecting the correct hardware and, most importantly, correctly configuring the NVIDIA driver stack on Ubuntu 24.04 LTS (Noble Numbat).

Hardware Requirements

  • GPU: A CUDA-compatible NVIDIA GPU is non-negotiable. For production, this means data center cards like the NVIDIA A100, H100, or L40S. For development or smaller-scale production, a high-VRAM client GPU like the RTX 3090 (24GB) or RTX 4090 (24GB) is suitable. VRAM is the primary constraint.
  • System RAM: 64GB at a minimum. You need enough RAM to load model weights before they are transferred to VRAM.
  • Storage: A fast NVMe SSD. LLM model files are massive (a 70B parameter model can be 140GB+), and loading them from a slow disk will create a significant boot-up bottleneck.

Installing NVIDIA Drivers (The "Great Filter")

This is the single most common failure point in any AI infrastructure deployment. Driver conflicts, kernel module mismatches, and Secure Boot issues can stop a project before it starts. Ubuntu 24.04 has excellent hardware support, and we recommend using the built-in tools for maximum stability and compatibility, especially with Secure Boot.

1. Update Your System and Check for Drivers:

sudo apt update && sudo apt upgrade -y
sudo ubuntu-drivers devices

This will list your hardware and show the recommended proprietary driver. The output will look something like this:

== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00002204sv000010DEsd00001380bc03sc00i00
vendor   : NVIDIA Corporation
model    : GA102 [GeForce RTX 3090]
driver   : nvidia-driver-550-open - distro non-free recommended
driver   : nvidia-driver-550 - distro non-free

2. Install the Recommended Driver:

sudo ubuntu-drivers install

This command automatically handles the installation of the recommended driver (e.g., `nvidia-driver-550`) and, crucially, configures MOK (Machine Owner Key) for Secure Boot. During installation, you will be prompted to create a password. You must remember this password.

3. Reboot and Enroll the MOK Key:

sudo reboot

This reboot is mandatory. Upon restarting, your system will boot into a blue "MOK management" screen. Select "Enroll MOK," then "Continue," and enter the password you created during installation. This signs the NVIDIA kernel module, allowing it to load with Secure Boot enabled.

4. Verify Driver Installation:

After rebooting back into Ubuntu, run the "hello, world" of GPU infrastructure:

nvidia-smi

If successful, you will see a detailed report of your GPU, the driver version, and the CUDA Version (e.g., 12.4). This confirms your system is ready.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07        Driver Version: 550.90.07    CUDA Version: 12.4     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id         Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf  Pwr:Usage/Cap        |         Memory-Usage | GPU-Util  Compute M. |
|                                       |                      |                MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090       Off  | 00000000:01:00.0 Off |                   N/A |
| 30%   42C    P8    17W / 450W        |       0MiB / 24564MiB |      0%      Default |
|                                       |                      |                   N/A |
+-----------------------------------------+----------------------+----------------------+
...
+---------------------------------------------------------------------------------------+

Critical Stop Point

If `nvidia-smi` fails, DO NOT PROCEED. This command must succeed. Driver management is complex, involving kernel module signing and potential conflicts. This is a common area where ITECS's Linux IT support and AI consulting services save organizations weeks of troubleshooting.

Step 2: Installing vLLM in a Virtual Environment

With the GPU drivers operational, we can now install vLLM. It is enterprise best practice to never use the system's root Python environment for applications. We will use Python's built-in `venv` module to create an isolated, self-contained environment.

1. Install Python 3 and venv:

sudo apt install python3.12-venv -y

2. Create and Activate the Virtual Environment:

We will create a directory for our vLLM service in `/opt`, a standard location for optional third-party software.

sudo mkdir -p /opt/vllm
sudo chown $USER:$USER /opt/vllm
python3 -m venv /opt/vllm/env
source /opt/vllm/env/bin/activate

Your terminal prompt will now be prefixed with `(env)`, indicating you are in the isolated environment.

3. Install PyTorch with CUDA 12.1 Support:

vLLM is built on PyTorch. We must install a version of PyTorch that is compiled against the same major CUDA version our driver supports (CUDA 12.x). Our `nvidia-smi` output showed CUDA 12.4, so we will use the official CUDA 12.1 build from PyTorch.

pip3 install torch --index-url https://download.pytorch.org/whl/cu121

4. Install vLLM and Flash Attention:

Finally, we install vLLM itself. We also recommend installing `flash-attn` for further performance boosts on compatible GPUs (Ampere architecture and newer).

pip3 install vllm
pip3 install flash-attn --no-build-isolation

Step 3: Serving an OpenAI-Compatible API

One of vLLM's most powerful features is its ability to launch a web server that mimics the official OpenAI API. This makes it a drop-in replacement for any application, like Open WebUI or custom scripts, that are already built to use the `openai` Python library.

We will launch the server to serve `meta-llama/Llama-3.1-8B-Instruct`, a powerful and popular new model.

python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1

Understanding the Command Flags:

  • --model: Specifies the model to serve from the Hugging Face Hub. On first run, vLLM will download this model, which may take several minutes.
  • --host 0.0.0.0: Binds the server to all network interfaces, not just localhost. This is critical for production access.
  • --port 8000: The port to listen on.
  • --tensor-parallel-size 1: This is the number of GPUs to use. For a single-GPU server, this is 1. If you had 4 GPUs, you would set this to 4 to split the model across them for even higher performance.

You will see vLLM start up, load the model, and confirm it is "running on http://0.0.0.0:8000".

Security Note: Configure Your Firewall

Binding to 0.0.0.0 makes your API available to the network. You must configure your firewall to control access. For a simple setup, only allow access from trusted IPs. For a public setup, you must put this behind a reverse proxy like Nginx with TLS encryption.

# Example: Allow access from your internal network only
sudo ufw allow from 192.168.1.0/24 to any port 8000
sudo ufw enable

Step 4: Verifying and Using the API

Your API is running. Let's test it from another terminal.

Test 1: List Available Models with curl

This mimics the OpenAI /v1/models endpoint.

curl http://localhost:8000/v1/models

You will get a JSON response showing the model you are serving:

{
  "object": "list",
  "data": [
    {
      "id": "meta-llama/Llama-3.1-8B-Instruct",
      "object": "model",
      "created": 1716800000,
      "owned_by": "vllm",
      "root": "meta-llama/Llama-3.1-8B-Instruct",
      ...
    }
  ]
}

Test 2: Send a Chat Completion with curl

This mimics the /v1/chat/completions endpoint.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of Texas?"}
    ],
    "max_tokens": 50
  }'

You will receive a standard OpenAI-formatted response, and you'll notice it generates extremely quickly.

Test 3: The "Drop-in Replacement" (Python openai Client)

This is the most powerful test. Install the openai library (pip install openai) and run the following Python script. Notice the only change from a standard OpenAI script is the base_url and a dummy api_key.

from openai import OpenAI

# Point the client to your vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="NOT_USED" # vLLM doesn't require an API key
)

completion = client.chat.completions.create(
  model="meta-llama/Llama-3.1-8B-Instruct",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain PagedAttention in one sentence."}
  ],
  max_tokens=100
)

print(completion.choices[0].message.content)

This confirms your server is fully compatible. Any application designed for OpenAI can now use your secure, on-premises, high-speed LLM server with a simple configuration change.

Step 5: Productionizing vLLM with Systemd

Running a critical application in a terminal window is not a production strategy. If you log out or the server reboots, the service dies. We must "daemonize" the vLLM server using systemd, Ubuntu's native service manager. This will ensure it starts automatically on boot and restarts if it crashes.

1. Create a systemd Unit File:

sudo nano /etc/systemd/system/vllm.service

2. Paste the following configuration:

This file tells systemd what to run and how to run it.

[Unit]
Description=vLLM OpenAI-Compatible API Service
After=network-online.target

[Service]
# Best practice: Run as a dedicated, non-privileged user
# Create this user: sudo useradd -r -s /bin/false vllm-user
# sudo chown -R vllm-user:vllm-user /opt/vllm
User=vllm-user
Group=vllm-user

# Command to execute
# Note the full path to the python binary in the venv
ExecStart=/opt/vllm/env/bin/python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1

# Set the Hugging Face home directory for this user
Environment="HF_HOME=/home/vllm-user/.cache/huggingface"

# Restart policy
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Note: Before this will work, you must create the vllm-user and give it ownership of the /opt/vllm directory and a home directory for model caching.

sudo useradd -r -s /bin/false -m -d /home/vllm-user vllm-user
sudo chown -R vllm-user:vllm-user /opt/vllm

3. Enable and Start the Service:

sudo systemctl daemon-reload
sudo systemctl enable vllm.service
sudo systemctl start vllm.service

4. Check the Status:

sudo systemctl status vllm.service

You should see an active (running) status in green. Your vLLM API server is now a robust service that will start on boot and run 24/7.

From Installation to Transformation: The ITECS Managed Intelligence Advantage

You have now successfully deployed a state-of-the-art, high-performance LLM serving engine. This is a significant technical achievement that puts your organization light-years ahead in terms of AI capability, cost-efficiency, and data security.

However, the journey from a running service to a transformed business operation is complex. The technical setup is just the beginning. Production AI infrastructure introduces new operational burdens that can quickly overwhelm an internal IT team:

  • Security & Hardening: How do you expose this service securely? It needs a reverse proxy, SSL termination, rate-limiting, and authentication. How does it integrate with your existing Active Directory or SSO?
  • Monitoring & Observability: What happens when performance degrades? You need 24/7 monitoring of GPU utilization, VRAM temperature, API latency, and token throughput to preempt failures.
  • Lifecycle Management: NVIDIA releases new drivers. vLLM and PyTorch are updated weekly. Who manages this patching and validation cadence to ensure new versions don't break your production workloads?
  • Integration & Strategy: Now that you have the API, how do you integrate it into your business applications? How do you build RAG (Retrieval-Augmented Generation) pipelines to feed it your company data?

This is where technical capability must be paired with strategic management. ITECS empowers businesses to bridge this exact gap. Our AI Consulting & Managed Intelligence services are built on top of our flagship MSP ELITE package to provide a holistic solution. We don't just help you build it; we manage its entire lifecycle.

ITECS handles the complex infrastructure—from the initial hardware procurement and driver management to the 24/7/365 security monitoring and performance tuning. We transform your raw AI capability into a secure, reliable, and powerful business tool, allowing your team to focus on leveraging AI, not managing it.

Conclusion: Take Control of Your AI Future

By deploying vLLM on Ubuntu 24.04, you have replaced a costly, insecure, and slow cloud dependency with an incredibly fast, secure, and cost-effective on-premises asset. You now have a "drop-in" replacement for the OpenAI API that gives you full control over your data, your models, and your costs.

This guide provides the technical foundation. The next step is to integrate this power into your business. Whether you are building internal chatbots, enhancing your applications, or analyzing proprietary data, the path is now open.

Transform Your AI Infrastructure with ITECS

Building your own AI server is a powerful first step. Ensuring it's secure, optimized, and managed 24/7 is the key to business transformation. ITECS's AI Consulting services provide the strategic roadmap and expert management to turn your AI ambitions into reality.

Schedule Your AI Infrastructure Consultation Today

Related ITECS Resources

About ITECS Team

The ITECS team consists of experienced IT professionals dedicated to delivering enterprise-grade technology solutions and insights to businesses in Dallas and beyond.

Share This Article

Continue Reading

Explore more insights and technology trends from ITECS

View All Articles