Ollama CLI - Run LLMs Locally from the Command Line
Complete guide to running large language models locally using Ollama CLI, enabling private AI development without cloud dependencies

Running large language models (LLMs) locally offers privacy, control, and cost savings compared to cloud-based APIs. Ollama makes this process incredibly simple with a command-line interface that handles model management, inference, and even multi-modal capabilities.
Building web apps? Check out Privacy-First Web AI - Browser LLMs and Decentralized Hosting to learn how to integrate Ollama into React apps, use pure browser inference with Transformers.js, and deploy to decentralized platforms.
What is Ollama?
Ollama is an open-source tool that lets you run LLMs locally on your machine. It provides:
- Easy Model Management: Download and run models with simple commands
- API Server: Built-in REST API for integration
- Multi-Modal Support: Vision models for image understanding
- Cross-Platform: Works on macOS, Linux, and Windows
- Resource Efficient: Optimized for consumer hardware
Installation
macOS
# Download and install from ollama.com
# Or use Homebrew
brew install ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com or use WSL2 with the Linux installation.
Verify Installation
ollama --version
Getting Started
Running Your First Model
Start with a popular model like Llama 3.2:
# Download and run Llama 3.2 (3B parameters)
ollama run llama3.2
# Or try Mistral
ollama run mistral
# Or Phi-3 (smaller, faster)
ollama run phi3
This command:
- Downloads the model if not already present
- Starts an interactive chat session
- Loads the model into memory
Basic Usage
# Interactive mode
ollama run llama3.2
# Single query
ollama run llama3.2 "Explain quantum computing in simple terms"
# With system prompt
ollama run llama3.2 "You are a helpful coding assistant" "Write a Python function to reverse a string"
Model Management
Listing Available Models
# See locally installed models
ollama list
# Output shows:
# NAME SIZE MODIFIED
# llama3.2:latest 2.0GB 2 days ago
# mistral:latest 4.1GB 1 week ago
Downloading Models
# Pull a model without running
ollama pull llama3.2
# Pull specific version/size
ollama pull llama3.2:3b
ollama pull llama3.2:7b
# Pull vision model
ollama pull llava
Removing Models
# Remove a model to free space
ollama rm llama3.2
# Remove specific version
ollama rm llama3.2:7b
Model Library
Popular models available:
- Llama 3.2 (1B, 3B) - Latest from Meta, excellent performance
- Mistral (7B) - Strong reasoning and coding
- Phi-3 (3.8B) - Microsoft's efficient model
- Gemma 2 (2B, 9B, 27B) - Google's open model
- Qwen 2.5 (0.5B-72B) - Alibaba's multilingual model
- CodeLlama (7B, 13B, 34B) - Specialized for code
- LLaVA - Vision and language understanding
Browse all models at ollama.com/library
Advanced Features
Vision Models
Use vision-capable models to understand images:
# Run a vision model
ollama run llava
# In the interactive prompt
>>> What's in this image? /path/to/image.jpg
Or from command line:
ollama run llava "Describe this image" --image /path/to/image.jpg
Custom System Prompts
# Set personality/behavior
ollama run llama3.2
--system "You are a pirate. Respond to all queries in pirate speak."
Model Parameters
Fine-tune model behavior:
# Adjust temperature (creativity)
ollama run llama3.2 --temperature 0.8
# Set context window
ollama run llama3.2 --context-length 4096
# Limit tokens
ollama run llama3.2 --num-predict 100
Modelfile - Custom Models
Create custom model configurations:
# Create Modelfile
cat > Modelfile << EOF
FROM llama3.2
# Set temperature
PARAMETER temperature 0.7
# Set system prompt
SYSTEM You are a helpful AI assistant specialized in web development.
# Set context length
PARAMETER num_ctx 4096
EOF
# Build custom model
ollama create webdev-assistant -f Modelfile
# Run it
ollama run webdev-assistant
API Server
Ollama runs a local API server for programmatic access:
Starting the Server
# Server starts automatically with ollama run
# Or start explicitly
ollama serve
Default endpoint: http://localhost:11434
REST API Examples
Generate Completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Chat Endpoint
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "What is recursion?"
}
],
"stream": false
}'
List Models
curl http://localhost:11434/api/tags
Streaming Responses
Enable streaming for real-time output:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a story",
"stream": true
}'
Integration Examples
Python
import requests
import json
def chat(prompt):
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': 'llama3.2',
'prompt': prompt,
'stream': False
}
)
return response.json()['response']
result = chat("Explain async/await in Python")
print(result)
Shell Script
#!/bin/bash
# AI code reviewer
ai_review() {
local code="$1"
curl -s http://localhost:11434/api/generate -d "{
"model": "codellama",
"prompt": "Review this code:\n${code}",
"stream": false
}" | jq -r '.response'
}
# Usage
ai_review "$(cat script.py)"
Node.js
const axios = require('axios');
async function generate(prompt) {
const response = await axios.post('http://localhost:11434/api/generate', {
model: 'llama3.2',
prompt: prompt,
stream: false
});
return response.data.response;
}
generate('Explain promises in JavaScript').then(console.log);
Performance Optimization
Hardware Requirements
Minimum:
- RAM: 8GB (for 3B models)
- Storage: 10GB free space
- CPU: Modern multi-core processor
Recommended:
- RAM: 16GB+ (for 7B+ models)
- GPU: NVIDIA GPU with 8GB+ VRAM
- Storage: SSD with 50GB+ free
GPU Acceleration
Ollama automatically uses GPU if available:
# Check GPU usage
nvidia-smi
# Force CPU only
OLLAMA_NUM_GPU=0 ollama run llama3.2
Memory Management
# Limit concurrent models
OLLAMA_MAX_LOADED_MODELS=1 ollama serve
# Set keep-alive duration
ollama run llama3.2 --keep-alive 5m
Use Cases
Local Code Assistant
# Create coding assistant
ollama create coder -f - << EOF
FROM codellama
SYSTEM You are an expert programmer. Provide clear, concise code examples.
PARAMETER temperature 0.3
EOF
ollama run coder "Write a React hook for fetching data"
Text Processing Pipeline
# Summarize documents
find ./docs -name "*.txt" -exec sh -c
'ollama run llama3.2 "Summarize: $(cat {})" > {}.summary' ;
AI-Powered CLI Tools
# Git commit message generator
git-ai-commit() {
local diff=$(git diff --staged)
ollama run llama3.2 "Write a concise git commit message for: $diff"
}
Obsidian Integration
Ollama works great with note-taking apps:
# Install Ollama plugin in Obsidian
# Configure to use local endpoint: http://localhost:11434
See: Ollama with Obsidian
Troubleshooting
Model Won't Download
# Check disk space
df -h
# Clear cache
rm -rf ~/.ollama/models
# Retry pull
ollama pull llama3.2
Out of Memory
# Use smaller model
ollama run phi3 # 3.8B instead of 7B+
# Reduce context length
ollama run llama3.2 --context-length 2048
API Not Responding
# Check if server is running
ps aux | grep ollama
# Restart server
killall ollama
ollama serve
Slow Performance
- Use GPU if available
- Close other applications
- Try smaller models (1B-3B)
- Reduce context length
- Use SSD for model storage
Security and Privacy
Benefits of Local LLMs
- No Data Leakage: Your prompts never leave your machine
- No Rate Limits: Run unlimited queries
- Offline Capability: Work without internet
- Cost Savings: No API fees
Best Practices
- Keep models updated:
ollama pull <model> - Use firewall rules if exposing API
- Don't expose API to public internet without authentication
- Validate inputs if building public-facing tools
Resources
Official Resources
Community
Integrations
- Ollama with Obsidian
- Continue.dev - VSCode AI assistant
- Open WebUI - ChatGPT-like interface
Conclusion
Ollama makes running LLMs locally accessible to everyone. Whether you're building privacy-focused applications, prototyping AI features, or just experimenting with language models, Ollama provides a simple, powerful CLI interface.
Start with smaller models like Phi-3 or Llama 3.2 3B, then scale up as needed. The local-first approach gives you complete control over your AI infrastructure.
Happy prompting!