Henry: Llamafile Factory Guide
Overview
Henry is a factory for building Llamafiles (and GUFF files) from open models hosted on HuggingFace.co. It automates the process of downloading, converting, quantizing, and packaging models into self-contained, executable Llamafiles that can run anywhere without additional dependencies.
What Henry Can Do
- Download models from Hugging Face Hub
- Convert models from Hugging Face format (safetensors/pytorch) to GGUF format
- Apply custom chat templates for proper model interaction
- Quantize models to reduce file size and memory requirements
- Package GGUF files into executable Llamafiles
- Test the generated Llamafiles
Supported Outputs
- Llamafiles: Self-contained executable model files
- GGUF files: Standalone model weights in GGUF format
Software Requirements
System Tools
| Tool | Purpose | Download/Install |
|---|---|---|
uv |
Python package manager | https://docs.astral.sh/uv/ |
git |
Version control | Package manager (apt, brew, etc.) |
make |
Build automation | Package manager (apt, brew, etc.) |
curl |
File downloads | Package manager (apt, brew, etc.) |
hf |
Hugging Face CLI | uv tool install huggingface_hub[cli] |
Python Dependencies
| Package | Purpose | Notes |
|---|---|---|
torch |
PyTorch deep learning framework | Install with
--extra-index-url https://download.pytorch.org/whl/cpu |
transformers |
Hugging Face transformers | |
safetensors |
Safe tensor serialization | |
sentencepiece |
Tokenizer support | |
accelerate |
Training/Inference acceleration | |
huggingface_hub |
Hugging Face Hub access | |
tqdm |
Progress bars | |
gguf |
GGUF format support |
Optional Tools
| Tool | Purpose |
|---|---|
cmake |
Build system for llama.cpp quantize tool |
g++/clang |
C++ compiler for building quantize tool |
Quick Start: Generate Granite 4.1 8B Model
This step-by-step guide walks you through generating a Granite 4.1 8B Llamafile using Henry.
Step 1: Clone Henry Repository
git clone https://github.com/rsdoiel/henry.git
cd henryStep 2: Install System Dependencies
macOS (Homebrew)
# Install system tools
brew install git make curl cmake
# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | shmacOS (MacPorts)
# Install system tools
sudo port install git make curl cmake
# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | shLinux (Ubuntu/Debian)
# Install system tools
sudo apt update
sudo apt install -y git make curl cmake g++
# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | shWindows (WSL2 recommended)
# Install system tools via Chocolatey
choco install git make curl cmake
# Install uv
Invoke-WebRequest -Uri https://astral.sh/uv/install.ps1 -UseWebRequest | Invoke-ExpressionStep 3: Install Python Dependencies
Henry will automatically create a Python virtual environment and install required packages:
# This will create .venv/ and install all Python dependencies
make python-depsManual installation (if needed):
# Create virtual environment
uv venv .venv
# Activate it
source .venv/bin/activate # Linux/macOS
# or .\.venv\Scripts\activate on Windows
# Install PyTorch (CPU version)
uv pip install torch --extra-index-url https://download.pytorch.org/whl/cpu
# Install other dependencies
uv pip install transformers safetensors sentencepiece accelerate huggingface_hub tqdm gguf
# Install hf CLI
uv tool install huggingface_hub[cli]Step 4: Download Tools
Download the llamafile binary and zipalign tool:
make toolsThis downloads: - llamafile - The Mozilla AI Llamafile
launcher - zipalign - Tool for bundling GGUF into the
llamafile
Both are downloaded to the tools/ directory.
Step 5: Check Dependencies
Verify all dependencies are installed:
make depsThis checks for uv, git, make,
curl, and hf CLI.
Step 6: Generate Granite 4.1 8B Llamafile
Henry provides a pre-configured model definition for Granite 4.1 8B.
The model configuration is in
models/granite-4.1-8b-source.yaml.
Option A: Full Build from Source (Recommended)
This builds the GGUF from the original model weights with custom chat template:
# Build everything: tools, dependencies, download, convert, quantize, package
make all MODEL=granite-4.1-8b-sourceThis performs: 1. Downloads the Granite 4.1 8B model from
ibm-granite/granite-4.1-8b on Hugging Face 2. Converts it
to FP16 GGUF format 3. Applies the IBM Granite chat template
(llama.cpp/models/templates/ibm-granite-granite-4.1.jinja)
4. Sets context length to 131,072 tokens 5. Quantizes to Q4_K_M
quantization 6. Packages into a self-contained llamafile
Option B: Individual Steps
For more control, run each step individually:
# 1. Check dependencies
make deps
# 2. Install Python dependencies
make python-deps
# 3. Download tools
make tools
# 4. Download and convert model
make download MODEL=granite-4.1-8b-source
# 5. Quantize (already included in download for convert=true models)
make quantize MODEL=granite-4.1-8b-source
# 6. Package into llamafile
make package MODEL=granite-4.1-8b-sourceStep 7: Test the Llamafile
Once the build completes, test your new Llamafile:
make test MODEL=granite-4.1-8b-sourceOr run it directly:
# List available llamafiles
ls -lh llamafiles/
# Run the Granite 4.1 8B model
./llamafiles/granite-4.1-8b-source-Q4_K_M.llamafileThe Llamafile is self-contained and can be copied to any system with the same architecture to run.
Step 8: Clean Up (Optional)
To remove built llamafiles (keeps downloaded tools and cached models):
make cleanAvailable Models
To see all available models and their status:
make listThis shows a table with: - Model name - Quantization method - RAM requirements - Build status - Display name
Model Configuration
Each model is defined by a YAML file in the models/
directory. The Granite 4.1 8B model
(models/granite-4.1-8b-source.yaml) includes:
name: granite-4.1-8b-source
display_name: "IBM Granite 4.1 8B Instruct (Source)"
hf_repo: ibm-granite/granite-4.1-8b
source_type: safetensors
convert: true
quantization: Q4_K_M
context_length: 131072
chat_template: llama.cpp/models/templates/ibm-granite-granite-4.1.jinja
gguf_file: granite-4.1-8b-instruct-Q4_K_M.gguf
output: granite-4.1-8b-source-Q4_K_M.llamafileConfiguration Fields
| Field | Description |
|---|---|
name |
Internal model identifier |
display_name |
Human-readable model name |
hf_repo |
Hugging Face repository (e.g.,
ibm-granite/granite-4.1-8b) |
source_type |
Model format (safetensors, pytorch) |
convert |
Whether to convert from source (true) or download
pre-built GGUF (false) |
quantization |
Quantization method (e.g., Q4_K_M, Q5_K_M,
Q6_K) |
context_length |
Maximum context length in tokens |
chat_template |
Path to chat template file |
gguf_file |
Output GGUF filename |
output |
Final llamafile filename |
Troubleshooting
Common Issues
Missing dependencies:
ERROR: python3 is required but not found
Install Python 3.x from your package manager or python.org.
PyTorch installation fails:
ERROR: Failed to install torch
Ensure you’re using the correct index URL:
uv pip install torch --extra-index-url https://download.pytorch.org/whl/cpuOut of memory during conversion: The FP16 conversion requires significant memory. For large models like Granite 4.1 8B, ensure you have at least 16GB of RAM available.
llama.cpp clone fails:
ERROR: Failed to clone llama.cpp
Check your git connectivity and try again. You can also manually clone it:
git clone https://github.com/ggml-org/llama.cpp.git --depth 1Architecture-Specific Notes
aarch64 (Apple Silicon / ARM64): - aarch64 Linux requires registering the APE binary format once per boot:
sudo sh -c "echo ':APE:M::MZqFpD::/bin/sh:' > /proc/sys/fs/binfmt_misc/register"- This is automatic on macOS
Windows: - Use WSL2 (Windows Subsystem for Linux) for best compatibility - Native Windows support is experimental
Understanding the Process
What Happens During Build
- Download: Model weights are downloaded from Hugging
Face Hub to
models-cache/ - Convert: Model is converted from safetensors to FP16 GGUF format using llama.cpp
- Template: Custom chat template is applied for proper model interaction
- Context: Context length is set (131,072 tokens for Granite 4.1 8B)
- Quantize: FP16 GGUF is quantized to the target method (Q4_K_M by default)
- Package: Quantized GGUF is bundled with the llamafile launcher using zipalign
- Test: The final llamafile is verified to work correctly
Output Files
| Directory | Purpose |
|---|---|
tools/ |
llamafile and zipalign binaries |
models-cache/ |
Downloaded model weights and intermediate files |
llamafiles/ |
Final generated Llamafiles |
.venv/ |
Python virtual environment |
llama.cpp/ |
Cloned llama.cpp repository (auto-downloaded) |
Granite 4.1 8B Specifics
The IBM Granite 4.1 8B model built by Henry includes:
- Extended context: 131,072 tokens (128K+)
- Features: Tool calling, embedding, tagged content, function calling
- Multilingual: Supports multiple languages
- Use cases: Coding, RAG, AI assistant workflows
- Quantization: Q4_K_M (4-bit quantization with K and M matrices)
- Memory: ~6GB RAM required for inference
The custom chat template (ibm-granite-granite-4.1.jinja)
enables: - Proper instruction following - Tool calling support -
Structured output formatting - Multi-turn conversation handling
Customization
Create a Custom Model Configuration
To add a new model, create a YAML file in models/:
name: my-custom-model
display_name: "My Custom Model"
hf_repo: my-org/my-model
source_type: safetensors
convert: true
quantization: Q5_K_M
context_length: 32768
chat_template: templates/my-template.jinja
gguf_file: my-model-Q5_K_M.gguf
output: my-model-Q5_K_M.llamafileThen build it:
make all MODEL=my-custom-modelChange Quantization Method
Edit the model YAML file and change the quantization
field. Available methods include: - Q2_K - 2-bit (smallest,
lowest quality) - Q3_K_M - 3-bit - Q4_0 -
4-bit original - Q4_K_M - 4-bit with K and M matrices
(recommended) - Q5_K_M - 5-bit (better quality) -
Q6_K - 6-bit (higher quality) - Q8_0 - 8-bit
(highest quality, largest)
Then rebuild:
make clean MODEL=granite-4.1-8b-source
make quantize MODEL=granite-4.1-8b-source
make package MODEL=granite-4.1-8b-sourceResources
- Henry Repository: https://github.com/rsdoiel/henry
- Mozilla Llamafile: https://github.com/Mozilla-Orin/llamafile
- llama.cpp: https://github.com/ggml-org/llama.cpp
- IBM Granite 4.1 8B: https://huggingface.co/ibm-granite/granite-4.1-8b
- Hugging Face Hub: https://huggingface.co/
License
Henry is licensed under the AGPL-3.0 license. See LICENSE for details.