Ollama CLI - Run LLMs Locally from the Command Line

Complete guide to running large language models locally using Ollama CLI, enabling private AI development without cloud dependencies

Running large language models (LLMs) locally offers privacy, control, and cost savings compared to cloud-based APIs. Ollama makes this process incredibly simple with a command-line interface that handles model management, inference, and even multi-modal capabilities.

Building web apps? Check out Privacy-First Web AI - Browser LLMs and Decentralized Hosting to learn how to integrate Ollama into React apps, use pure browser inference with Transformers.js, and deploy to decentralized platforms.

What is Ollama?

Ollama is an open-source tool that lets you run LLMs locally on your machine. It provides:

Easy Model Management: Download and run models with simple commands
API Server: Built-in REST API for integration
Multi-Modal Support: Vision models for image understanding
Cross-Platform: Works on macOS, Linux, and Windows
Resource Efficient: Optimized for consumer hardware

Installation

macOS

		# Download and install from ollama.com
# Or use Homebrew
brew install ollama

Linux

		curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com or use WSL2 with the Linux installation.

Verify Installation

		ollama --version

Getting Started

Running Your First Model

Start with a popular model like Llama 3.2:

		# Download and run Llama 3.2 (3B parameters)
ollama run llama3.2
 
# Or try Mistral
ollama run mistral
 
# Or Phi-3 (smaller, faster)
ollama run phi3

This command:

Downloads the model if not already present
Starts an interactive chat session
Loads the model into memory

Basic Usage

		# Interactive mode
ollama run llama3.2
 
# Single query
ollama run llama3.2 "Explain quantum computing in simple terms"
 
# With system prompt
ollama run llama3.2 "You are a helpful coding assistant" "Write a Python function to reverse a string"

Model Management

Listing Available Models

		# See locally installed models
ollama list
 
# Output shows:
# NAME            SIZE    MODIFIED
# llama3.2:latest 2.0GB   2 days ago
# mistral:latest  4.1GB   1 week ago

Downloading Models

		# Pull a model without running
ollama pull llama3.2
 
# Pull specific version/size
ollama pull llama3.2:3b
ollama pull llama3.2:7b
 
# Pull vision model
ollama pull llava

Removing Models

		# Remove a model to free space
ollama rm llama3.2
 
# Remove specific version
ollama rm llama3.2:7b

Model Library

Popular models available:

Llama 3.2 (1B, 3B) - Latest from Meta, excellent performance
Mistral (7B) - Strong reasoning and coding
Phi-3 (3.8B) - Microsoft's efficient model
Gemma 2 (2B, 9B, 27B) - Google's open model
Qwen 2.5 (0.5B-72B) - Alibaba's multilingual model
CodeLlama (7B, 13B, 34B) - Specialized for code
LLaVA - Vision and language understanding

Browse all models at ollama.com/library

Advanced Features

Vision Models

Use vision-capable models to understand images:

		# Run a vision model
ollama run llava
 
# In the interactive prompt
>>> What's in this image? /path/to/image.jpg

Or from command line:

		ollama run llava "Describe this image" --image /path/to/image.jpg

Custom System Prompts

		# Set personality/behavior
ollama run llama3.2 
  --system "You are a pirate. Respond to all queries in pirate speak."

Model Parameters

Fine-tune model behavior:

		# Adjust temperature (creativity)
ollama run llama3.2 --temperature 0.8
 
# Set context window
ollama run llama3.2 --context-length 4096
 
# Limit tokens
ollama run llama3.2 --num-predict 100

Modelfile - Custom Models

Create custom model configurations:

		# Create Modelfile
cat > Modelfile << EOF
FROM llama3.2
 
# Set temperature
PARAMETER temperature 0.7
 
# Set system prompt
SYSTEM You are a helpful AI assistant specialized in web development.
 
# Set context length
PARAMETER num_ctx 4096
EOF
 
# Build custom model
ollama create webdev-assistant -f Modelfile
 
# Run it
ollama run webdev-assistant

API Server

Ollama runs a local API server for programmatic access:

Starting the Server

		# Server starts automatically with ollama run
# Or start explicitly
ollama serve

Default endpoint: http://localhost:11434

REST API Examples

Generate Completion

		curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat Endpoint

		curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is recursion?"
    }
  ],
  "stream": false
}'

List Models

		curl http://localhost:11434/api/tags

Streaming Responses

Enable streaming for real-time output:

		curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Write a story",
  "stream": true
}'

Integration Examples

Python

		import requests
import json
 
def chat(prompt):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': 'llama3.2',
            'prompt': prompt,
            'stream': False
        }
    )
    return response.json()['response']
 
result = chat("Explain async/await in Python")
print(result)

Shell Script

		#!/bin/bash
 
# AI code reviewer
ai_review() {
    local code="$1"
    curl -s http://localhost:11434/api/generate -d "{
        "model": "codellama",
        "prompt": "Review this code:\n${code}",
        "stream": false
    }" | jq -r '.response'
}
 
# Usage
ai_review "$(cat script.py)"

Node.js

		const axios = require('axios');
 
async function generate(prompt) {
    const response = await axios.post('http://localhost:11434/api/generate', {
        model: 'llama3.2',
        prompt: prompt,
        stream: false
    });
    return response.data.response;
}
 
generate('Explain promises in JavaScript').then(console.log);

Performance Optimization

Hardware Requirements

Minimum:

RAM: 8GB (for 3B models)
Storage: 10GB free space
CPU: Modern multi-core processor

Recommended:

RAM: 16GB+ (for 7B+ models)
GPU: NVIDIA GPU with 8GB+ VRAM
Storage: SSD with 50GB+ free

GPU Acceleration

Ollama automatically uses GPU if available:

		# Check GPU usage
nvidia-smi
 
# Force CPU only
OLLAMA_NUM_GPU=0 ollama run llama3.2

Memory Management

		# Limit concurrent models
OLLAMA_MAX_LOADED_MODELS=1 ollama serve
 
# Set keep-alive duration
ollama run llama3.2 --keep-alive 5m

Use Cases

Local Code Assistant

		# Create coding assistant
ollama create coder -f - << EOF
FROM codellama
SYSTEM You are an expert programmer. Provide clear, concise code examples.
PARAMETER temperature 0.3
EOF
 
ollama run coder "Write a React hook for fetching data"

Text Processing Pipeline

		# Summarize documents
find ./docs -name "*.txt" -exec sh -c 
  'ollama run llama3.2 "Summarize: $(cat {})" > {}.summary' ;

AI-Powered CLI Tools

		# Git commit message generator
git-ai-commit() {
    local diff=$(git diff --staged)
    ollama run llama3.2 "Write a concise git commit message for: $diff"
}

Obsidian Integration

Ollama works great with note-taking apps:

		# Install Ollama plugin in Obsidian
# Configure to use local endpoint: http://localhost:11434

See: Ollama with Obsidian

Troubleshooting

Model Won't Download

		# Check disk space
df -h
 
# Clear cache
rm -rf ~/.ollama/models
 
# Retry pull
ollama pull llama3.2

Out of Memory

		# Use smaller model
ollama run phi3  # 3.8B instead of 7B+
 
# Reduce context length
ollama run llama3.2 --context-length 2048

API Not Responding

		# Check if server is running
ps aux | grep ollama
 
# Restart server
killall ollama
ollama serve

Slow Performance

Use GPU if available
Close other applications
Try smaller models (1B-3B)
Reduce context length
Use SSD for model storage

Security and Privacy

Benefits of Local LLMs

No Data Leakage: Your prompts never leave your machine
No Rate Limits: Run unlimited queries
Offline Capability: Work without internet
Cost Savings: No API fees

Best Practices

Keep models updated: ollama pull <model>
Use firewall rules if exposing API
Don't expose API to public internet without authentication
Validate inputs if building public-facing tools

Resources

Official Resources

Community

Integrations

Ollama with Obsidian
Continue.dev - VSCode AI assistant
Open WebUI - ChatGPT-like interface

Conclusion

Ollama makes running LLMs locally accessible to everyone. Whether you're building privacy-focused applications, prototyping AI features, or just experimenting with language models, Ollama provides a simple, powerful CLI interface.

Start with smaller models like Phi-3 or Llama 3.2 3B, then scale up as needed. The local-first approach gives you complete control over your AI infrastructure.

Happy prompting!