skip to main content

Henry Model Catalog: Complete Guide to Available LLMs

Last updated: June 26, 2026

This document describes all models with YAML definitions in Henry, including their capabilities, ideal use cases, and the hardware trade-offs for running each. Models are organized by family and sorted by parameter count.

πŸ“Š Quick Comparison Matrix

Model Size RAM Context Best For Quantization
Qwen 3.5 2B ~1.8 GB 2GB+ 32K Lightweight chat, edge Q8_0
Phi-4 Mini ~2.2 GB 2GB+ 128K Structured reasoning, code Q4_K_M
Phi-3.5 Mini ~2.2 GB 2GB+ 128K Multilingual reasoning, code Q4_K_M
SmolLM3 3B ~2.0 GB 2GB+ 128K Hybrid reasoning, multilingual Q4_K_M
Apertus 4B ~2.3 GB 2GB+ 32K Multilingual, tool calling Q4_K_M
Granite 4.1 3B ~2.4 GB 2GB+ 128K Coding, RAG, assistants Q4_K_M
Llama 3.2 3B ~2.0-3.4 GB 3-4GB+ 128K Multilingual, general use Q4_K_M-Q8_0
Granite 3.3 8B ~4.9 GB 6GB+ 8K Enterprise, function calling Q4_K_M
Gemma 4 E4B ~2.5 GB 3GB+ 128K Technical docs, code review Q4_K_M
Qwen 3.5 4B ~2.5 GB 3GB+ 32K General reasoning Q5_K_S
Qwen 3 4B ~2.7 GB 3GB+ 32K General reasoning Q5_K_S
Apertus 8B ~4.9 GB 6GB+ 32K Heavy multilingual Q4_K_M
Qwen 2.5 Coder 7B ~4.5 GB 5GB+ 32K Code generation Q4_K_M
Llama 3 8B ~4.9-8.5 GB 5-9GB+ 8K General chat, code Q4_K_M-Q8_0
Granite 4.1 8B ~4.9 GB 6GB+ 128K Enterprise assistants Q4_K_M
Granite 3.3 8B (Source) ~4.9 GB 6GB+ 128K Custom tool workflows Q4_K_M

πŸ¦™ Meta Llama Family

Llama 3.2 Series β€” Multilingual with Extended Context

The Llama 3.2 models feature 128K token context windows and native multilingual support (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai). They are instruction-tuned and support tool calling.

Model Quant Size RAM Use Case Trade-off
llama-3.2-3b-instruct Q4_K_M ~2.0 GB 3GB+ Entry-level multilingual β€” Good for basic chat, translation, and simple tool workflows on constrained hardware Smallest file, moderate quality
llama-3.2-3b-instruct-q5_k_m Q5_K_M ~2.3 GB 3GB+ Balanced multilingual β€” Best price/performance for general use with 128K context 15% larger than Q4, better accuracy
llama-3.2-3b-instruct-q6_k_m Q6_K_M ~2.6 GB 3GB+ High-quality multilingual β€” Uses Q8_0 for critical weights, near-perfect output 30% larger than Q4, excellent quality
llama-3.2-3b-instruct-q8_0 Q8_0 ~3.4 GB 4GB+ Maximum quality β€” Best for production where accuracy is paramount Largest file, highest RAM

Best for: Multilingual applications, long-document processing (128K context), chatbots serving international audiences, tool-enabled assistants on moderate hardware.

Llama 3 Series β€” Proven General-Purpose Models

The original Llama 3 models with 8K token context. Strong at instruction following, conversation, and code generation.

Model Quant Size RAM Use Case Trade-off
llama-3-8b-instruct Q4_K_M ~4.9 GB 5GB+ Default choice β€” Good balance for most users, strong general performance Baseline quality, smallest footprint
llama-3-8b-instruct-q5_k_m Q5_K_M ~5.7 GB 6GB+ Improved quality β€” Better than Q4 for complex tasks, still reasonable size 16% larger, better accuracy
llama-3-8b-instruct-q6_k Q6_K ~6.6 GB 7GB+ High accuracy β€” Near-perfect quality for most applications 35% larger than Q4
llama-3-8b-instruct-q8_0 Q8_0 ~8.5 GB 9GB+ Maximum fidelity β€” Best for professional use where quality matters most 73% larger than Q4, highest RAM

Best for: General conversational AI, code assistance, API servers, applications where 8K context is sufficient.

Note: Llama 3 models lack the extended context and multilingual capabilities of Llama 3.2 but have been more thoroughly tested in production.

πŸͺ¨ IBM Granite Family

IBM’s enterprise-oriented models with strong tool calling, function calling, embedding, and tagged content support. All Granite models support structured reasoning with <think> and <response> tags.

Granite 4.1 Series β€” Next-Generation

Model Size RAM Context Use Case Trade-off
granite-4.1-3b-source ~2.4 GB 2GB+ 128K Lightweight enterprise β€” Coding, RAG, multilingual workflows on limited hardware Small size, full 128K context
granite-4.1-8b-source ~4.9 GB 6GB+ 128K Full enterprise β€” Complex tool workflows, large documents, production assistants Higher RAM, more capable

Best for: Enterprise applications, RAG systems, AI assistants with complex tool integration, coding workflows.

Granite 3.3 Series β€” Stable Release

Model Size RAM Context Use Case Trade-off
granite-3.3-8b ~4.9 GB 6GB+ 8K Quick deployment β€” Pre-built GGUF, fast setup, function calling ready Limited context (8K)
granite-3.3-8b-source ~4.9 GB 6GB+ 128K Custom tool workflows β€” Built from source with extended context and full feature set Requires conversion from source

Best for: Business applications requiring function calling, structured outputs, and compliance-ready workflows.

Granite features across all models: tool_calling, function_calling, embedding, tagged

⛰️ Swiss-AI Apertus Family

Open models from Swiss-AI with EU AI Act compliance and 1811 language support.

Model Size RAM Context Use Case Trade-off
apertus-4b ~2.3 GB 2GB+ 32K Lightweight multilingual β€” 1811 languages, tool calling, embedding on minimal hardware Smaller, distilled from 8B
apertus-8b ~4.9 GB 6GB+ 32K Full multilingual β€” Source model for 4B, more capable for complex multilingual tasks 2x the RAM of 4B

Features: tool_calling, embedding, tagged

Best for: Multilingual applications, European deployments requiring compliance, tool-enabled chatbots serving diverse language communities.

Note: Both use custom chat templates supporting <tool_call> tags.

🧠 HuggingFaceTB SmolLM3 Family

Hugging Face’s compact reasoning models with hybrid reasoning mode (extended thinking with /think and /no_think flags) and excellent multilingual support.

Model Size RAM Context Use Case Trade-off
smollm3-3b ~2.0 GB 2GB+ 128K Compact reasoner β€” Hybrid reasoning, multilingual (6+ languages), tool calling on minimal hardware 3B parameters, excellent reasoning/parameter ratio

Features: tool_calling

Best for: Applications requiring reasoning capabilities in a very compact package, multilingual deployments, hybrid reasoning workflows, tool-enabled assistants on resource-constrained hardware.

Note: Supports extended thinking mode via /think and /no_think flags in system prompt or enable_thinking parameter.

🌸 Google Gemma Family

Google’s efficient models optimized for on-device deployment.

Model Size RAM Context Use Case Trade-off
gemma4-e4b ~2.5 GB 3GB+ 128K Technical documentation β€” Code review, editorial tasks, technical writing with massive context Efficient architecture, 128K context

Features: tool_calling, embedding

Best for: Technical content generation, code review, documentation assistance, deployments on resource-constrained devices (runs on M1 Mac Mini with 8GB, Raspberry Pi 500 at ~1-2 tok/s).

Note: Uses the β€œEfficient” (E) series architecture, which is compact but highly capable.

πŸ” Microsoft Phi Family

Microsoft’s reasoning-optimized models with exceptional reasoning-per-parameter ratios.

Model Size RAM Context Use Case Trade-off
phi4-mini ~2.2 GB 2GB+ 128K Structured reasoning β€” Code review, technical documentation, complex problem-solving 3.8B parameters, best reasoning/parameter ratio
phi-3.5-mini-instruct ~2.2 GB 2GB+ 128K Multilingual reasoning β€” Strong multilingual (20+ languages), code, math, logic 3.8B parameters, competitive reasoning

Features: tool_calling

Best for: Applications requiring strong reasoning capabilities in a compact package, technical decision-making, code analysis, multilingual deployments.

Note: Phi-4 Mini can use either the embedded chat template from tokenizer_config.json or a Phi-3.5-mini fallback template.

πŸ‰ Alibaba Qwen Family

Alibaba’s Qwen models, with specialized coding variants.

Qwen 3.5 Series

Model Quant Size RAM Context Use Case Trade-off
qwen3.5-2b Q8_0 ~1.8 GB 2GB+ 32K Ultra-lightweight β€” Minimal hardware, basic tool calling Smallest Qwen, limited capability
qwen3.5-4b Q5_K_S ~2.5 GB 3GB+ 32K General purpose β€” Good balance for most Qwen use cases Mid-range size and capability

Best for: Budget deployments, edge devices, simple chat applications.

Qwen 3 Series

Model Quant Size RAM Context Use Case Trade-off
qwen3-4b-instruct-2507 Q5_K_S ~2.7 GB 3GB+ 32K Latest checkpoint β€” July 2025 model with improved capabilities over 3.5 series Newer model, better general reasoning

Best for: Users wanting the latest Qwen improvements for general reasoning tasks.

Qwen 2.5 Coder Series β€” Coding Specialists

Model Quant Size RAM Context Use Case Trade-off
qwen2.5-coder-7b Q4_K_M ~4.5 GB 5GB+ 32K Code generation β€” Strong on Go, TypeScript, and many other languages Specialized for coding, less general knowledge

Features: tool_calling

Best for: Code completion, code explanation, code generation across multiple programming languages, IDE integrations.

Note: File naming may vary in the GGUF repo (check Hugging Face for exact casing).

🎯 Resource Requirement Tiers

Tier 1: Ultra-Lightweight (2GB RAM)

Can run on: Raspberry Pi 5 (8GB), M1 Mac Mini (8GB), budget laptops

Model Use Case
Qwen 3.5 2B Basic chat, simple tool calling
Phi-4 Mini Structured reasoning, technical tasks
Phi-3.5 Mini Multilingual reasoning, code, math, logic
SmolLM3 3B Hybrid reasoning, multilingual, tool calling
Apertus 4B Multilingual chat, 1811 languages
Granite 4.1 3B (Source) Coding, RAG, enterprise workflows
Llama 3.2 3B (Q4_K_M) Multilingual, 128K context

Best choice: Phi-4 Mini for reasoning, Llama 3.2 3B for multilingual, Phi-3.5 Mini for multilingual reasoning, SmolLM3 3B for hybrid reasoning, Apertus 4B for language coverage.

Tier 2: Lightweight (4GB RAM)

Can run on: Most modern laptops, M1 MacBook Air, mid-range desktops

Model Use Case
Llama 3.2 3B (Q5_K_M-Q8_0) High-quality multilingual
Qwen 3.5 4B General reasoning
Qwen 3 4B Instruct (2507) Latest Qwen improvements
Granite 4.1 3B (Source) Enterprise workflows
Gemma 4 E4B Technical documentation
Phi-3.5 Mini Multilingual reasoning, code, math
SmolLM3 3B Hybrid reasoning, multilingual

Best choice: Llama 3.2 3B Q5_K_M for balanced multilingual performance, Phi-3.5 Mini for multilingual reasoning.

Tier 3: Standard (6-8GB RAM)

Can run on: M1 MacBook Pro, gaming laptops, workstations

Model Use Case
Llama 3 8B (Q4_K_M-Q6_K) General purpose, code, chat
Granite 3.3 8B / 4.1 8B Enterprise, function calling
Apertus 8B Heavy multilingual
Qwen 2.5 Coder 7B Code generation

Best choice: Llama 3 8B Q5_K_M for best balance, Granite 4.1 8B for enterprise tool workflows.

Tier 4: High-End (8GB+ RAM)

Can run on: M2 MacBook Pro, workstations, servers

Model Use Case
Llama 3 8B Q8_0 Maximum quality general use
All other models at higher quantizations Best quality variants

Best choice: Llama 3 8B Q8_0 for maximum quality, or any model at its highest quantization.

πŸ’‘ Use Case Recommendations

πŸ”§ Coding & Development

Priority Model Why
1 Qwen 2.5 Coder 7B Specialized for code, strong on Go/TypeScript
2 Granite 4.1 8B (Source) 128K context, tool calling for IDE integration
3 Llama 3 8B Q5_K_M Strong general code assistance
4 Phi-4 Mini Excellent reasoning for code review
5 Phi-3.5 Mini Multilingual code reasoning
5 SmolLM3 3B Hybrid reasoning for code tasks

🌍 Multilingual Applications

Priority Model Why
1 Apertus 8B 1811 languages, EU compliant
2 Phi-3.5 Mini 20+ languages, 128K context, strong reasoning
3 Llama 3.2 3B Q5_K_M 9 languages, 128K context
4 SmolLM3 3B 6+ languages, 128K context, hybrid reasoning
5 Apertus 4B 1811 languages, minimal hardware
6 Granite 4.1 3B/8B Multilingual support, enterprise features

πŸ“„ Long Document Processing (128K Context)

Priority Model Why
1 Granite 4.1 8B (Source) 128K context, enterprise features
2 Llama 3.2 3B Q5_K_M 128K context, multilingual, balanced
3 Gemma 4 E4B 128K context, efficient architecture
4 Phi-4 Mini 128K context, strong reasoning
5 Phi-3.5 Mini 128K context, multilingual reasoning
6 SmolLM3 3B 128K context, hybrid reasoning
7 Granite 4.1 3B (Source) 128K context, lightweight

🎯 General Chat & Assistants

Priority Model Why
1 Llama 3 8B Q5_K_M Proven, high quality, tool calling
2 Llama 3.2 3B Q5_K_M 128K context, multilingual
3 Granite 4.1 8B (Source) Enterprise features, 128K context
4 Phi-4 Mini Strong reasoning in compact package
5 Phi-3.5 Mini Multilingual reasoning, 128K context
6 SmolLM3 3B Hybrid reasoning, multilingual, 128K context

πŸ”Œ Tool Calling & Function Calling

Priority Model Features
1 Granite 4.1 8B (Source) tool_calling, function_calling, embedding, tagged
2 Granite 3.3 8B (Source) Same as above, 128K context
3 Apertus 8B tool_calling, embedding, tagged, 1811 languages
4 Phi-3.5 Mini tool_calling, 128K context, 20+ languages
5 SmolLM3 3B tool_calling, 128K context, hybrid reasoning
6 Llama 3.2 3B Q6_K_M tool_calling, 128K context, multilingual

πŸ’° Budget Deployments

Priority Model Hardware Capability
1 Qwen 3.5 2B 2GB RAM Basic tool calling, 32K context
2 Phi-4 Mini 2GB RAM Strong reasoning, 128K context
3 Phi-3.5 Mini 2GB RAM Multilingual reasoning, 128K context
4 SmolLM3 3B 2GB RAM Hybrid reasoning, multilingual, tool calling
5 Apertus 4B 2GB RAM 1811 languages, tool calling
6 Llama 3.2 3B Q4_K_M 3GB RAM Multilingual, 128K context

βš–οΈ Trade-off Summary

Quantization Trade-offs

Quantization File Size Quality RAM Speed Best For
Q4_K_M Smallest (75% of Q8) Good Lower Fastest Budget deployments, testing
Q5_K_M Moderate (+15-20%) High +1GB Slightly slower Best balance for most users
Q5_K_S Moderate (+15-20%) Very High +1GB Slightly slower High-quality variants
Q6_K Larger (+30-35%) Very High +1-2GB Slower Near-perfect quality
Q6_K_M Larger (+30-35%) Very High +1-2GB Slower Uses Q8_0 for critical weights
Q8_0 Largest (baseline) Maximum +2-3GB Slowest Production, maximum accuracy

Model Size Trade-offs

Size Pros Cons
2-3B Low RAM, fast, 128K context available Less capable, limited complexity
4-5B Good balance, 32-128K context Moderate RAM, reasonable capability
7-8B High capability, production-ready Higher RAM (6-9GB), larger files

Context Length Trade-offs

Context Pros Cons
8K Lower RAM, faster Can’t process long documents
32K Good for most documents Moderate RAM increase
128K Full books, long conversations Higher RAM, slower

πŸ† Top Picks by Scenario

Scenario Best Model Runner-Up Budget Option
Best overall Llama 3 8B Q5_K_M Llama 3.2 3B Q5_K_M Llama 3.2 3B Q4_K_M
Best for coding Qwen 2.5 Coder 7B Granite 4.1 8B (Source) Phi-3.5 Mini
Best multilingual Apertus 8B Phi-3.5 Mini Llama 3.2 3B Q5_K_M
Best for long docs Granite 4.1 8B (Source) Llama 3.2 3B Q5_K_M Phi-3.5 Mini
Best reasoning Phi-4 Mini Phi-3.5 Mini Llama 3 8B Q6_K
Best enterprise Granite 4.1 8B (Source) Granite 3.3 8B (Source) Apertus 8B
Best for edge Phi-4 Mini Phi-3.5 Mini SmolLM3 3B
Lowest RAM Qwen 3.5 2B (2GB) Phi-4 Mini (2GB) Phi-3.5 Mini (2GB)

πŸ“¦ Model Type Key

Field Description
convert: true Model must be converted from source (safetensors) to GGUF
source_type: safetensors Model is in PyTorch safetensors format, not pre-converted GGUF
chat_template Custom template used for conversation formatting
features Capabilities baked into the GGUF: tool_calling, function_calling, embedding, tagged
kind: instruct Instruction-tuned model (follows prompts better)
ram_gb Recommended minimum RAM for comfortable operation
context_length Maximum token context window