Henry Model Catalog: Complete Guide to Available LLMs
Last updated: June 26, 2026
This document describes all models with YAML definitions in Henry, including their capabilities, ideal use cases, and the hardware trade-offs for running each. Models are organized by family and sorted by parameter count.
π Quick Comparison Matrix
| Model | Size | RAM | Context | Best For | Quantization |
|---|---|---|---|---|---|
| Qwen 3.5 2B | ~1.8 GB | 2GB+ | 32K | Lightweight chat, edge | Q8_0 |
| Phi-4 Mini | ~2.2 GB | 2GB+ | 128K | Structured reasoning, code | Q4_K_M |
| Phi-3.5 Mini | ~2.2 GB | 2GB+ | 128K | Multilingual reasoning, code | Q4_K_M |
| SmolLM3 3B | ~2.0 GB | 2GB+ | 128K | Hybrid reasoning, multilingual | Q4_K_M |
| Apertus 4B | ~2.3 GB | 2GB+ | 32K | Multilingual, tool calling | Q4_K_M |
| Granite 4.1 3B | ~2.4 GB | 2GB+ | 128K | Coding, RAG, assistants | Q4_K_M |
| Llama 3.2 3B | ~2.0-3.4 GB | 3-4GB+ | 128K | Multilingual, general use | Q4_K_M-Q8_0 |
| Granite 3.3 8B | ~4.9 GB | 6GB+ | 8K | Enterprise, function calling | Q4_K_M |
| Gemma 4 E4B | ~2.5 GB | 3GB+ | 128K | Technical docs, code review | Q4_K_M |
| Qwen 3.5 4B | ~2.5 GB | 3GB+ | 32K | General reasoning | Q5_K_S |
| Qwen 3 4B | ~2.7 GB | 3GB+ | 32K | General reasoning | Q5_K_S |
| Apertus 8B | ~4.9 GB | 6GB+ | 32K | Heavy multilingual | Q4_K_M |
| Qwen 2.5 Coder 7B | ~4.5 GB | 5GB+ | 32K | Code generation | Q4_K_M |
| Llama 3 8B | ~4.9-8.5 GB | 5-9GB+ | 8K | General chat, code | Q4_K_M-Q8_0 |
| Granite 4.1 8B | ~4.9 GB | 6GB+ | 128K | Enterprise assistants | Q4_K_M |
| Granite 3.3 8B (Source) | ~4.9 GB | 6GB+ | 128K | Custom tool workflows | Q4_K_M |
π¦ Meta Llama Family
Llama 3.2 Series β Multilingual with Extended Context
The Llama 3.2 models feature 128K token context windows and native multilingual support (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai). They are instruction-tuned and support tool calling.
| Model | Quant | Size | RAM | Use Case | Trade-off |
|---|---|---|---|---|---|
llama-3.2-3b-instruct |
Q4_K_M | ~2.0 GB | 3GB+ | Entry-level multilingual β Good for basic chat, translation, and simple tool workflows on constrained hardware | Smallest file, moderate quality |
llama-3.2-3b-instruct-q5_k_m |
Q5_K_M | ~2.3 GB | 3GB+ | Balanced multilingual β Best price/performance for general use with 128K context | 15% larger than Q4, better accuracy |
llama-3.2-3b-instruct-q6_k_m |
Q6_K_M | ~2.6 GB | 3GB+ | High-quality multilingual β Uses Q8_0 for critical weights, near-perfect output | 30% larger than Q4, excellent quality |
llama-3.2-3b-instruct-q8_0 |
Q8_0 | ~3.4 GB | 4GB+ | Maximum quality β Best for production where accuracy is paramount | Largest file, highest RAM |
Best for: Multilingual applications, long-document processing (128K context), chatbots serving international audiences, tool-enabled assistants on moderate hardware.
Llama 3 Series β Proven General-Purpose Models
The original Llama 3 models with 8K token context. Strong at instruction following, conversation, and code generation.
| Model | Quant | Size | RAM | Use Case | Trade-off |
|---|---|---|---|---|---|
llama-3-8b-instruct |
Q4_K_M | ~4.9 GB | 5GB+ | Default choice β Good balance for most users, strong general performance | Baseline quality, smallest footprint |
llama-3-8b-instruct-q5_k_m |
Q5_K_M | ~5.7 GB | 6GB+ | Improved quality β Better than Q4 for complex tasks, still reasonable size | 16% larger, better accuracy |
llama-3-8b-instruct-q6_k |
Q6_K | ~6.6 GB | 7GB+ | High accuracy β Near-perfect quality for most applications | 35% larger than Q4 |
llama-3-8b-instruct-q8_0 |
Q8_0 | ~8.5 GB | 9GB+ | Maximum fidelity β Best for professional use where quality matters most | 73% larger than Q4, highest RAM |
Best for: General conversational AI, code assistance, API servers, applications where 8K context is sufficient.
Note: Llama 3 models lack the extended context and multilingual capabilities of Llama 3.2 but have been more thoroughly tested in production.
πͺ¨ IBM Granite Family
IBMβs enterprise-oriented models with strong tool calling,
function calling, embedding, and tagged content support. All
Granite models support structured reasoning with
<think> and <response> tags.
Granite 4.1 Series β Next-Generation
| Model | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|
granite-4.1-3b-source |
~2.4 GB | 2GB+ | 128K | Lightweight enterprise β Coding, RAG, multilingual workflows on limited hardware | Small size, full 128K context |
granite-4.1-8b-source |
~4.9 GB | 6GB+ | 128K | Full enterprise β Complex tool workflows, large documents, production assistants | Higher RAM, more capable |
Best for: Enterprise applications, RAG systems, AI assistants with complex tool integration, coding workflows.
Granite 3.3 Series β Stable Release
| Model | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|
granite-3.3-8b |
~4.9 GB | 6GB+ | 8K | Quick deployment β Pre-built GGUF, fast setup, function calling ready | Limited context (8K) |
granite-3.3-8b-source |
~4.9 GB | 6GB+ | 128K | Custom tool workflows β Built from source with extended context and full feature set | Requires conversion from source |
Best for: Business applications requiring function calling, structured outputs, and compliance-ready workflows.
Granite features across all models:
tool_calling, function_calling,
embedding, tagged
β°οΈ Swiss-AI Apertus Family
Open models from Swiss-AI with EU AI Act compliance and 1811 language support.
| Model | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|
apertus-4b |
~2.3 GB | 2GB+ | 32K | Lightweight multilingual β 1811 languages, tool calling, embedding on minimal hardware | Smaller, distilled from 8B |
apertus-8b |
~4.9 GB | 6GB+ | 32K | Full multilingual β Source model for 4B, more capable for complex multilingual tasks | 2x the RAM of 4B |
Features: tool_calling,
embedding, tagged
Best for: Multilingual applications, European deployments requiring compliance, tool-enabled chatbots serving diverse language communities.
Note: Both use custom chat templates supporting
<tool_call> tags.
π§ HuggingFaceTB SmolLM3 Family
Hugging Faceβs compact reasoning models with hybrid reasoning
mode (extended thinking with /think and
/no_think flags) and excellent multilingual
support.
| Model | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|
smollm3-3b |
~2.0 GB | 2GB+ | 128K | Compact reasoner β Hybrid reasoning, multilingual (6+ languages), tool calling on minimal hardware | 3B parameters, excellent reasoning/parameter ratio |
Features: tool_calling
Best for: Applications requiring reasoning capabilities in a very compact package, multilingual deployments, hybrid reasoning workflows, tool-enabled assistants on resource-constrained hardware.
Note: Supports extended thinking mode via
/think and /no_think flags in system prompt or
enable_thinking parameter.
πΈ Google Gemma Family
Googleβs efficient models optimized for on-device deployment.
| Model | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|
gemma4-e4b |
~2.5 GB | 3GB+ | 128K | Technical documentation β Code review, editorial tasks, technical writing with massive context | Efficient architecture, 128K context |
Features: tool_calling,
embedding
Best for: Technical content generation, code review, documentation assistance, deployments on resource-constrained devices (runs on M1 Mac Mini with 8GB, Raspberry Pi 500 at ~1-2 tok/s).
Note: Uses the βEfficientβ (E) series architecture, which is compact but highly capable.
π Microsoft Phi Family
Microsoftβs reasoning-optimized models with exceptional reasoning-per-parameter ratios.
| Model | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|
phi4-mini |
~2.2 GB | 2GB+ | 128K | Structured reasoning β Code review, technical documentation, complex problem-solving | 3.8B parameters, best reasoning/parameter ratio |
phi-3.5-mini-instruct |
~2.2 GB | 2GB+ | 128K | Multilingual reasoning β Strong multilingual (20+ languages), code, math, logic | 3.8B parameters, competitive reasoning |
Features: tool_calling
Best for: Applications requiring strong reasoning capabilities in a compact package, technical decision-making, code analysis, multilingual deployments.
Note: Phi-4 Mini can use either the embedded chat
template from tokenizer_config.json or a Phi-3.5-mini
fallback template.
π Alibaba Qwen Family
Alibabaβs Qwen models, with specialized coding variants.
Qwen 3.5 Series
| Model | Quant | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|---|
qwen3.5-2b |
Q8_0 | ~1.8 GB | 2GB+ | 32K | Ultra-lightweight β Minimal hardware, basic tool calling | Smallest Qwen, limited capability |
qwen3.5-4b |
Q5_K_S | ~2.5 GB | 3GB+ | 32K | General purpose β Good balance for most Qwen use cases | Mid-range size and capability |
Best for: Budget deployments, edge devices, simple chat applications.
Qwen 3 Series
| Model | Quant | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|---|
qwen3-4b-instruct-2507 |
Q5_K_S | ~2.7 GB | 3GB+ | 32K | Latest checkpoint β July 2025 model with improved capabilities over 3.5 series | Newer model, better general reasoning |
Best for: Users wanting the latest Qwen improvements for general reasoning tasks.
Qwen 2.5 Coder Series β Coding Specialists
| Model | Quant | Size | RAM | Context | Use Case | Trade-off |
|---|---|---|---|---|---|---|
qwen2.5-coder-7b |
Q4_K_M | ~4.5 GB | 5GB+ | 32K | Code generation β Strong on Go, TypeScript, and many other languages | Specialized for coding, less general knowledge |
Features: tool_calling
Best for: Code completion, code explanation, code generation across multiple programming languages, IDE integrations.
Note: File naming may vary in the GGUF repo (check Hugging Face for exact casing).
π― Resource Requirement Tiers
Tier 1: Ultra-Lightweight (2GB RAM)
Can run on: Raspberry Pi 5 (8GB), M1 Mac Mini (8GB), budget laptops
| Model | Use Case |
|---|---|
| Qwen 3.5 2B | Basic chat, simple tool calling |
| Phi-4 Mini | Structured reasoning, technical tasks |
| Phi-3.5 Mini | Multilingual reasoning, code, math, logic |
| SmolLM3 3B | Hybrid reasoning, multilingual, tool calling |
| Apertus 4B | Multilingual chat, 1811 languages |
| Granite 4.1 3B (Source) | Coding, RAG, enterprise workflows |
| Llama 3.2 3B (Q4_K_M) | Multilingual, 128K context |
Best choice: Phi-4 Mini for reasoning, Llama 3.2 3B for multilingual, Phi-3.5 Mini for multilingual reasoning, SmolLM3 3B for hybrid reasoning, Apertus 4B for language coverage.
Tier 2: Lightweight (4GB RAM)
Can run on: Most modern laptops, M1 MacBook Air, mid-range desktops
| Model | Use Case |
|---|---|
| Llama 3.2 3B (Q5_K_M-Q8_0) | High-quality multilingual |
| Qwen 3.5 4B | General reasoning |
| Qwen 3 4B Instruct (2507) | Latest Qwen improvements |
| Granite 4.1 3B (Source) | Enterprise workflows |
| Gemma 4 E4B | Technical documentation |
| Phi-3.5 Mini | Multilingual reasoning, code, math |
| SmolLM3 3B | Hybrid reasoning, multilingual |
Best choice: Llama 3.2 3B Q5_K_M for balanced multilingual performance, Phi-3.5 Mini for multilingual reasoning.
Tier 3: Standard (6-8GB RAM)
Can run on: M1 MacBook Pro, gaming laptops, workstations
| Model | Use Case |
|---|---|
| Llama 3 8B (Q4_K_M-Q6_K) | General purpose, code, chat |
| Granite 3.3 8B / 4.1 8B | Enterprise, function calling |
| Apertus 8B | Heavy multilingual |
| Qwen 2.5 Coder 7B | Code generation |
Best choice: Llama 3 8B Q5_K_M for best balance, Granite 4.1 8B for enterprise tool workflows.
Tier 4: High-End (8GB+ RAM)
Can run on: M2 MacBook Pro, workstations, servers
| Model | Use Case |
|---|---|
| Llama 3 8B Q8_0 | Maximum quality general use |
| All other models at higher quantizations | Best quality variants |
Best choice: Llama 3 8B Q8_0 for maximum quality, or any model at its highest quantization.
π‘ Use Case Recommendations
π§ Coding & Development
| Priority | Model | Why |
|---|---|---|
| 1 | Qwen 2.5 Coder 7B | Specialized for code, strong on Go/TypeScript |
| 2 | Granite 4.1 8B (Source) | 128K context, tool calling for IDE integration |
| 3 | Llama 3 8B Q5_K_M | Strong general code assistance |
| 4 | Phi-4 Mini | Excellent reasoning for code review |
| 5 | Phi-3.5 Mini | Multilingual code reasoning |
| 5 | SmolLM3 3B | Hybrid reasoning for code tasks |
π Multilingual Applications
| Priority | Model | Why |
|---|---|---|
| 1 | Apertus 8B | 1811 languages, EU compliant |
| 2 | Phi-3.5 Mini | 20+ languages, 128K context, strong reasoning |
| 3 | Llama 3.2 3B Q5_K_M | 9 languages, 128K context |
| 4 | SmolLM3 3B | 6+ languages, 128K context, hybrid reasoning |
| 5 | Apertus 4B | 1811 languages, minimal hardware |
| 6 | Granite 4.1 3B/8B | Multilingual support, enterprise features |
π Long Document Processing (128K Context)
| Priority | Model | Why |
|---|---|---|
| 1 | Granite 4.1 8B (Source) | 128K context, enterprise features |
| 2 | Llama 3.2 3B Q5_K_M | 128K context, multilingual, balanced |
| 3 | Gemma 4 E4B | 128K context, efficient architecture |
| 4 | Phi-4 Mini | 128K context, strong reasoning |
| 5 | Phi-3.5 Mini | 128K context, multilingual reasoning |
| 6 | SmolLM3 3B | 128K context, hybrid reasoning |
| 7 | Granite 4.1 3B (Source) | 128K context, lightweight |
π― General Chat & Assistants
| Priority | Model | Why |
|---|---|---|
| 1 | Llama 3 8B Q5_K_M | Proven, high quality, tool calling |
| 2 | Llama 3.2 3B Q5_K_M | 128K context, multilingual |
| 3 | Granite 4.1 8B (Source) | Enterprise features, 128K context |
| 4 | Phi-4 Mini | Strong reasoning in compact package |
| 5 | Phi-3.5 Mini | Multilingual reasoning, 128K context |
| 6 | SmolLM3 3B | Hybrid reasoning, multilingual, 128K context |
π Tool Calling & Function Calling
| Priority | Model | Features |
|---|---|---|
| 1 | Granite 4.1 8B (Source) | tool_calling, function_calling,
embedding, tagged |
| 2 | Granite 3.3 8B (Source) | Same as above, 128K context |
| 3 | Apertus 8B | tool_calling, embedding,
tagged, 1811 languages |
| 4 | Phi-3.5 Mini | tool_calling, 128K context, 20+ languages |
| 5 | SmolLM3 3B | tool_calling, 128K context, hybrid reasoning |
| 6 | Llama 3.2 3B Q6_K_M | tool_calling, 128K context, multilingual |
π° Budget Deployments
| Priority | Model | Hardware | Capability |
|---|---|---|---|
| 1 | Qwen 3.5 2B | 2GB RAM | Basic tool calling, 32K context |
| 2 | Phi-4 Mini | 2GB RAM | Strong reasoning, 128K context |
| 3 | Phi-3.5 Mini | 2GB RAM | Multilingual reasoning, 128K context |
| 4 | SmolLM3 3B | 2GB RAM | Hybrid reasoning, multilingual, tool calling |
| 5 | Apertus 4B | 2GB RAM | 1811 languages, tool calling |
| 6 | Llama 3.2 3B Q4_K_M | 3GB RAM | Multilingual, 128K context |
βοΈ Trade-off Summary
Quantization Trade-offs
| Quantization | File Size | Quality | RAM | Speed | Best For |
|---|---|---|---|---|---|
| Q4_K_M | Smallest (75% of Q8) | Good | Lower | Fastest | Budget deployments, testing |
| Q5_K_M | Moderate (+15-20%) | High | +1GB | Slightly slower | Best balance for most users |
| Q5_K_S | Moderate (+15-20%) | Very High | +1GB | Slightly slower | High-quality variants |
| Q6_K | Larger (+30-35%) | Very High | +1-2GB | Slower | Near-perfect quality |
| Q6_K_M | Larger (+30-35%) | Very High | +1-2GB | Slower | Uses Q8_0 for critical weights |
| Q8_0 | Largest (baseline) | Maximum | +2-3GB | Slowest | Production, maximum accuracy |
Model Size Trade-offs
| Size | Pros | Cons |
|---|---|---|
| 2-3B | Low RAM, fast, 128K context available | Less capable, limited complexity |
| 4-5B | Good balance, 32-128K context | Moderate RAM, reasonable capability |
| 7-8B | High capability, production-ready | Higher RAM (6-9GB), larger files |
Context Length Trade-offs
| Context | Pros | Cons |
|---|---|---|
| 8K | Lower RAM, faster | Canβt process long documents |
| 32K | Good for most documents | Moderate RAM increase |
| 128K | Full books, long conversations | Higher RAM, slower |
π Top Picks by Scenario
| Scenario | Best Model | Runner-Up | Budget Option |
|---|---|---|---|
| Best overall | Llama 3 8B Q5_K_M | Llama 3.2 3B Q5_K_M | Llama 3.2 3B Q4_K_M |
| Best for coding | Qwen 2.5 Coder 7B | Granite 4.1 8B (Source) | Phi-3.5 Mini |
| Best multilingual | Apertus 8B | Phi-3.5 Mini | Llama 3.2 3B Q5_K_M |
| Best for long docs | Granite 4.1 8B (Source) | Llama 3.2 3B Q5_K_M | Phi-3.5 Mini |
| Best reasoning | Phi-4 Mini | Phi-3.5 Mini | Llama 3 8B Q6_K |
| Best enterprise | Granite 4.1 8B (Source) | Granite 3.3 8B (Source) | Apertus 8B |
| Best for edge | Phi-4 Mini | Phi-3.5 Mini | SmolLM3 3B |
| Lowest RAM | Qwen 3.5 2B (2GB) | Phi-4 Mini (2GB) | Phi-3.5 Mini (2GB) |
π¦ Model Type Key
| Field | Description |
|---|---|
convert: true |
Model must be converted from source (safetensors) to GGUF |
source_type: safetensors |
Model is in PyTorch safetensors format, not pre-converted GGUF |
chat_template |
Custom template used for conversation formatting |
features |
Capabilities baked into the GGUF: tool_calling,
function_calling, embedding,
tagged |
kind: instruct |
Instruction-tuned model (follows prompts better) |
ram_gb |
Recommended minimum RAM for comfortable operation |
context_length |
Maximum token context window |