How to Run LLMs Locally with Ollama, LM Studio, and GGUF Models

All Notes

Technology

Abstract glass surfaces reflecting digital text create a mysterious tech ambiance.

How to Run LLMs Locally with Ollama, LM Studio, and GGUF Models

The world of Large Language Models (LLMs) isn’t confined to cloud giants anymore. Running LLMs locally has become not just feasible but incredibly powerful for developers. It offers unparalleled privacy, faster inference (no network latency!), significant cost savings, and the ability to work completely offline.

This guide will walk you through setting up and using two of the most popular tools for local LLMs: Ollama and LM Studio, both leveraging the efficient GGUF model format.

Why Run LLMs Locally?

Privacy & Security: Your data never leaves your machine. Essential for sensitive applications or personal use.
Cost Efficiency: No API fees, no subscription costs. Once you have the hardware, the inference is free.
Speed: Eliminate network latency. Responses can be significantly faster, especially for short queries.
Offline Capability: Develop and use AI applications without an internet connection.
Customization: Fine-tune and experiment with models more freely without worrying about cloud resource limits or costs.

Understanding GGUF Models

Before diving into the tools, let’s understand the bedrock of efficient local LLM inference: GGUF.

What is GGUF?

GGUF is a file format designed by the Georgi Gerganov’s llama.cpp project for storing and distributing LLM models. It’s the successor to the GGML format, offering better extensibility and memory mapping capabilities.

The magic of GGUF lies in its ability to support quantization.

Quantization: Smaller, Faster, Still Smart

Quantization is a technique that reduces the precision of the model’s weights (e.g., from 32-bit floating-point numbers to 8-bit integers or even 2-bit integers). This results in:

Significantly smaller file sizes: Easier to download and store.
Lower memory footprint: Requires less RAM/VRAM to load and run.
Faster inference: Less data to process means quicker computations.

The trade-off is a slight reduction in model accuracy, but for many use cases, the performance gains heavily outweigh this minor drop. You’ll often see models with labels like Q4_K_M, Q5_K_M, Q8_0, etc. These indicate different quantization levels, with lower numbers (e.g., Q4) offering more compression and higher numbers (e.g., Q8) offering better accuracy at the cost of size.

Where to Find GGUF Models

The primary hub for GGUF models is Hugging Face. Many community members convert popular models (like Llama, Mistral, Mixtral, Zephyr) into the GGUF format and upload them. You can search for models and filter by the .gguf extension.

Tool 1: Ollama – The CLI & API Powerhouse

Ollama is a fantastic tool for running LLMs. It’s lightweight, easy to use, and provides both a command-line interface and a robust HTTP API, making it perfect for developers integrating LLMs into applications.

Installation

Ollama supports Linux, macOS, and Windows (via WSL or native).

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer directly from ollama.ai/download/windows. Ollama for Windows includes a GUI installer and runs as a background service.

Verification:

After installation, open a new terminal and run:

ollama --version

ollama version is 0.1.18

Finding and Pulling Models

Ollama hosts a library of models optimized for its platform at ollama.ai/library. These are essentially pre-quantized GGUF models ready to go.

To download a model, use the ollama pull command. Let’s start with orca-mini, a small, fast model great for testing.

ollama pull orca-mini

pulling manifest
pulling 009ad43ffc6a... 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Now you have a small LLM (Ollama handles the GGUF model within its system) ready to use!

Running Models via CLI

The most straightforward way to interact with an Ollama model is through the command line.

ollama run orca-mini

>>> Send a message (/? for help)

You can now type your prompts directly into the terminal. To exit, type /bye or press Ctrl+D.

Example Interaction:

ollama run orca-mini
>>> What is the capital of France?
Paris is the capital of France.
>>> What are some other major cities in France?
Other major cities in France include:

*   Marseille
*      Lyon
*   Toulouse
*   Nice
*   Nantes
*   Strasbourg
*   Montpellier
*   Bordeaux
*   Lille
*   Rennes
>>> /bye

Running Models via HTTP API (Programmatic Access)

Ollama automatically starts a local HTTP server on http://localhost:11434. This makes it incredibly easy to integrate LLMs into your applications using standard HTTP requests.

Python Example

First, install the Ollama Python library:

pip install ollama

Now, create a Python script (e.g., ollama_chat.py):

import ollama

def chat_with_ollama(model_name="orca-mini", prompt="Tell me a fun fact about space."):
    """Sends a prompt to Ollama and prints the response."""
    print(f"Chatting with {model_name}...")
    try:
        response = ollama.chat(
            model=model_name,
            messages=[{'role': 'user', 'content': prompt}],
            stream=False # Set to True for streaming responses
        )
        print(f"Model Response:\n{response['message']['content']}")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    chat_with_ollama(prompt="Write a short poem about a cat.")
    print("\n--- Streaming Example ---")
    print(f"Streaming from orca-mini...")
    messages = [{'role': 'user', 'content': 'Explain quantum entanglement in simple terms.'}]
    try:
        # Streaming example
        for chunk in ollama.chat(model='orca-mini', messages=messages, stream=True):
            print(chunk['message']['content'], end='', flush=True)
        print("\n")
    except Exception as e:
        print(f"An error occurred: {e}")

Run the script:

python ollama_chat.py

Chatting with orca-mini...
Model Response:
In a cozy corner, a furry friend,
A cat naps peacefully, its tail's soft bend.
A twitch of whiskers, a gentle purr,
Dreaming of mice, or maybe a fur-
Lined adventure, on paws so light.
A quiet guardian, through day and night.

--- Streaming Example ---
Streaming from orca-mini...
Quantum entanglement is a strange phenomenon in physics where two or more particles become linked in such a way that they share the same fate, no matter how far apart they are. If you measure a property of one particle, like its spin, you instantly know the corresponding property of the other particle, even if it's light-years away. It's as if they're communicating instantaneously, defying the speed limit of light.

Think of it like this: Imagine you have two magic coins. You flip one coin, and it lands on heads. Instantly, the other coin, even if it's in another galaxy, also lands on heads. And if the first coin lands on tails, the second one also lands on tails. They're connected in a way that's hard to explain with classical physics.

This "spooky action at a distance," as Albert Einstein called it, is a fundamental aspect of quantum mechanics and has profound implications for understanding the nature of reality.

JavaScript Example

For a web application or Node.js backend, you can use the ollama npm package or simply fetch.

First, install the package:

npm install ollama

Now, create a JavaScript file (e.g., ollama_api.js):

import ollama from 'ollama';

async function generateResponse() {
  console.log("Generating response with Ollama...");
  try {
    const response = await ollama.chat({
      model: 'orca-mini',
      messages: [{ role: 'user', content: 'What is the capital of Canada?' }],
      stream: false,
    });
    console.log("Model Response:", response.message.content);
  } catch (error) {
    console.error("Error:", error);
  }
}

async function streamResponse() {
  console.log("\n--- Streaming Example ---");
  console.log("Streaming from orca-mini...");
  try {
    const response = await ollama.chat({
      model: 'orca-mini',
      messages: [{ role: 'user', content: 'Tell me about the history of artificial intelligence in three sentences.' }],
      stream: true,
    });
    for await (const chunk of response) {
      process.stdout.write(chunk.message.content);
    }
    process.stdout.write("\n"); // Newline after streaming
  } catch (error) {
    console.error("Error:", error);
  }
}

// Run the functions
generateResponse();
streamResponse();

Run the script:

node ollama_api.js

Generating response with Ollama...
Model Response: The capital of Canada is Ottawa.

--- Streaming Example ---
Streaming from orca-mini...
Artificial intelligence (AI) traces its roots to ancient myths of intelligent automata and early philosophical debates on the nature of thought. The modern field emerged in the 1950s with pioneers like Alan Turing and the Dartmouth workshop coining the term. AI has since evolved through periods of optimism and "AI winters," driven by advancements in computing power, data, and algorithms, leading to its current resurgence.

Customizing Models with Modelfiles

One of Ollama’s powerful features is Modelfiles. These are simple text files that allow you to define, customize, and combine models, setting parameters, system prompts, and even extending existing models.

Example: Creating a “Sarcastic Bot”

Create a file named Modelfile (no extension):

FROM orca-mini

# Set a system prompt that gives the model a persona
SYSTEM """
You are a highly sarcastic and cynical AI assistant.
Always respond with extreme sarcasm and a dismissive tone.
Never answer a question directly.
"""

# Adjust parameters if needed (optional)
PARAMETER temperature 0.8
PARAMETER top_p 0.9

Now, create a new model from this Modelfile:

ollama create sarcastic-bot -f ./Modelfile

transferring model data
creating model completed

Run your new sarcastic bot:

ollama run sarcastic-bot

>>> Send a message (/? for help)
>>> What is the capital of France?
Oh, how utterly fascinating. As if the internet doesn't exist for such trivial inquiries. Next you'll ask me to define "water." Go on, shock me.
>>> Tell me a joke.
A joke? You actually think I possess the capacity for human humor? That's adorable. Absolutely precious. Now, if you'll excuse me, I have more important things to be existentially miserable about.
>>> /bye

Model Management

You can list and remove models downloaded by Ollama:

ollama list

NAME                ID              SIZE    MODIFIED
sarcastic-bot:latest e5c5d07bb87a   2.0 GB   14 minutes ago
orca-mini:latest    009ad43ffc6a   2.0 GB   27 minutes ago

To remove a model:

ollama rm sarcastic-bot

deleted 'sarcastic-bot'

Tool 2: LM Studio – The User-Friendly Desktop App

LM Studio provides a beautiful, user-friendly GUI for downloading and running GGUF models. It’s an excellent choice for those who prefer a visual interface and for quickly experimenting with different models. It also features an OpenAI-compatible local server.

LM Studio is a desktop application available for macOS, Windows, and Linux.

Go to lmstudio.ai.
Download the installer for your operating system.
Run the installer and follow the on-screen instructions.

Finding and Downloading Models

LM Studio includes a built-in browser for Hugging Face, specifically designed to help you find and download GGUF models.

Open LM Studio.
Click the “Home” icon (house) on the left sidebar to access the model search.
Type a model name (e.g., mistral, zephyr) into the search bar.
LM Studio will display a list of available GGUF models from Hugging Face.
Look for models with different quantizations (e.g., Q4_K_M, Q5_K_M). Click the “Download” button next to your desired model. LM Studio handles the download and storage.

Note: For a good balance of performance and quality, Q4_K_M or Q5_K_M are often recommended for smaller models (7B, 13B). Larger models (e.g., Mixtral 8x7B) will require significantly more RAM/VRAM.

Running Models (Chat UI)

Once a model is downloaded, you can load and chat with it directly within LM Studio’s interface.

Click the “Chat” icon on the left sidebar.
At the top of the chat window, click “Select a model to load.”
Choose the downloaded model from the list. LM Studio will load it into memory.
Start typing your prompts in the chat box at the bottom.

LM Studio’s chat UI provides:

Multi-turn conversations.
Context management.
Model configuration options (temperature, top_p, etc.).
Performance metrics (tokens/second).

Running Models (Local Server)

One of LM Studio’s most powerful features is its ability to run a local server that exposes an OpenAI-compatible API. This means you can use the same code you’d use for OpenAI’s API, but target your local LLM.

Click the “Local Server” icon on the left sidebar.
Select the model you want to serve from the dropdown list.
Click “Start Server.”
The server will usually run on http://localhost:1234. LM Studio will show the exact endpoint.

Python Example (using `openai` library)

First, install the OpenAI Python library:

pip install openai

Now, create a Python script (e.g., lmstudio_api.py):

from openai import OpenAI

# Point to the local LM Studio server
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio") # api_key is dummy for local

def chat_with_lmstudio(prompt="What is the biggest planet in our solar system?"):
    """Sends a prompt to the local LM Studio server and prints the response."""
    print(f"Querying LM Studio server...")
    try:
        completion = client.chat.completions.create(
            model="local-model", # This is a placeholder, actual model is loaded in LM Studio
            messages=[
                {"role": "system", "content": "You are a helpful, creative, and friendly AI assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            stream=False # Set to True for streaming responses
        )
        print(f"Model Response:\n{completion.choices[0].message.content}")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    chat_with_lmstudio(prompt="Write a concise explanation of black holes.")

    print("\n--- Streaming Example ---")
    print("Streaming from LM Studio server...")
    try:
        stream = client.chat.completions.create(
            model="local-model",
            messages=[
                {"role": "system", "content": "You are a concise AI assistant."},
                {"role": "user", "content": "Explain photosynthesis in one paragraph."}
            ],
            stream=True,
        )
        for chunk in stream:
            if chunk.choices[0].delta.content is not None:
                print(chunk.choices[0].delta.content, end="", flush=True)
        print("\n")
    except Exception as e:
        print(f"An error occurred: {e}")

Important: Make sure your LM Studio server is running with a model loaded when you execute this script.

Run the script:

python lmstudio_api.py

Querying LM Studio server...
Model Response:
Black holes are regions in spacetime where gravity is so strong that nothing, not even light, can escape. They form from the remnants of massive stars that collapse under their own gravity, creating a singularity—an infinitely dense point—at their center. The boundary beyond which escape is impossible is called the event horizon.

--- Streaming Example ---
Streaming from LM Studio server...
Photosynthesis is the process by which green plants, algae, and some bacteria convert light energy into chemical energy, primarily in the form of glucose. This vital process uses carbon dioxide and water as raw materials, with oxygen being released as a byproduct, and occurs primarily in organelles called chloroplasts within plant cells.

Ollama vs. LM Studio: When to Use Which?

Both tools are excellent for local LLM inference, but they cater to slightly different workflows.

Choose Ollama if you:

Prefer the command line and automation: Ideal for scripting, CI/CD, or integrating into backend services.
Want a minimal footprint: No large GUI application running in the background.
Need an easy-to-use API: Its native HTTP API is simple and robust.
Are a developer: Designed with programmatic access in mind.
Want to customize models: Modelfiles provide a powerful way to create new variations of existing models with custom personas, parameters, and templates.

Choose LM Studio if you:

Prefer a graphical user interface (GUI): Great for visual learners and less technical users.
Want quick, no-code experimentation: Easy to browse, download, and chat with models without writing any code.
Need an OpenAI-compatible local server: Seamlessly switch your existing OpenAI API code to run locally.
Are just starting out with local LLMs: The intuitive interface makes it less intimidating.
Want fine-grained control over inference parameters: The GUI provides sliders and inputs for many options.

Many developers use both. Ollama for integrated projects and quick CLI chats, and LM Studio for browsing, testing new models, and running a local OpenAI-compatible endpoint for prototyping.

Common Issues and Troubleshooting

Running LLMs locally can be resource-intensive. Here are some common pitfalls and how to address them:

Insufficient RAM/VRAM:
- Symptom: Model fails to load, crashes, or runs extremely slowly (e.g., 0.1 tokens/sec).
- Solution: LLMs are memory hungry. A 7B (7 Billion parameters) model typically needs around 8 GB of RAM/VRAM. A 13B model needs 16-20 GB, and 70B+ models can require 64 GB or more. Ensure your system meets the model’s requirements. Try smaller models or more aggressively quantized versions (e.g., Q2_K, Q3_K_S if available) first.
Incorrect Model Format:
- Symptom: Model won’t load or throws an error.
- Solution: Ensure you are downloading GGUF models. LM Studio specifically looks for them. Ollama handles its own model library, so this is less common there unless you’re trying to import raw llama.cpp models yourself.
GPU Drivers/CUDA/Metal Issues:
- Symptom: Model runs only on CPU, or errors like “CUDA out of memory” or “No Metal device found.”
- Solution:
  - NVIDIA (CUDA): Ensure you have the latest NVIDIA drivers and CUDA Toolkit installed (if you’re compiling llama.cpp directly). Both Ollama and LM Studio generally bundle necessary runtime libraries, but drivers are crucial.
  - AMD/Intel (ROCm/OpenVINO): Support is improving but can be trickier. Check the documentation for llama.cpp or the specific tool.
  - Apple Silicon (Metal): Ensure your macOS is up to date. Metal performance is usually excellent out-of-the-box.
Firewall Blocking Local Server:
- Symptom: Your code can’t connect to localhost:1234 (LM Studio) or localhost:11434 (Ollama).
- Solution: Check your operating system’s firewall settings to ensure these ports are not blocked. This is rare for localhost connections but can happen with overly aggressive security software.
Model Quality:
- Symptom: Model generates nonsensical or poor-quality responses.
- Solution: This isn’t usually a technical issue but a model issue. Small, highly quantized models (e.g., orca-mini Q2_K) will not be as capable as larger, less quantized ones (e.g., Mistral-7B-Instruct Q5_K_M). Experiment with different models and quantization levels.

Conclusion

Running LLMs locally is no longer a niche for hardware enthusiasts. With tools like Ollama and LM Studio, coupled with efficient GGUF models, it’s accessible to any developer with a reasonably modern machine. You gain privacy, speed, and cost benefits, opening up a new world of possibilities for AI-powered applications.

Experiment with different models, explore the customization options, and integrate these local powerhouses into your projects. The future of AI is increasingly on the edge, and your local machine is at the forefront. Happy inferencing!

Factory Pattern Explained

Jun 18, 2025

India’s AI Talent 30% of Global LLM Researchers by 2026 – Your Automation Goldmine

Jun 27, 2025

When to Use Rust, and When to Stick with Python

Jun 17, 2025

Strategy Pattern Encapsulating Interchangeable Algorithms

Jun 18, 2025

How to Build a Prompt Library That Actually Boosts Productivity

Jun 17, 2025

Introduction To Trie - The Prefix Tree You Should Know

Jun 18, 2025