Programming Language Support Implementation Plan

Overview

This document outlines the phased implementation plan for comprehensive programming language support in Harvey, based on the design specified in programming_language_support_design.md.

Status: Active
Created: 2026-06-09
Related Documents: - programming_language_support_design.md - DECISIONS.md - Using_RAGs_with_Harvey.md


Quick Fix Applied

Date: 2026-06-09
File: commands.go
Function: looksLikePath (lines 3463-3472)

Change: Added missing programming language extensions to the knownExts slice: - .c, .cpp, .h, .hpp (C/C++) - .pas (Pascal) - .Mod, .obn (Oberon) - .lisp (Lisp) - .bas (Basic)

Impact: Tagged code blocks like c:program.c` orpascal:module.pas` are now correctly recognized as file paths, enabling auto-write functionality for these languages.


Implementation Phases

The implementation is divided into 6 phases with clear milestones, deliverables, and success criteria. Each phase builds on the previous one and includes testing and documentation.


Phase 1: Language Registry & Core Infrastructure

Duration: 1-2 weeks
Priority: High
Dependencies: None (foundational)

Objectives

Tasks

Task 1.1: Create Core Types and Interfaces

File: language_registry.go (new)

Task 1.2: Implement Language Registry

File: language_registry.go

Task 1.3: Define Language Metadata

File: language_registry.go

Task 1.4: Initialize Registry at Startup

File: language_registry.go

Task 1.5: Update looksLikePath Function

File: commands.go (modified)

Task 1.6: Add Unit Tests for Registry

File: language_registry_test.go (new)

Deliverables

  1. language_registry.go — core registry, all types and interfaces
  2. language_registry_test.go — 35 tests, 0 failures
  3. ✅ Updated commands.golooksLikePath uses registry
  4. harvey.go — no change needed (registry uses init())

Success Criteria


Phase 2: Basic Language Detection

Duration: 1 week
Priority: High
Dependencies: Phase 1 complete

Objectives

Tasks

Task 2.1: Implement LanguageDetector Interface

File: language_detector.go (new)

Task 2.2: Create Language-Specific Detectors

File: language_detector.go

Task 2.3: Add Shebang Detection

File: language_detector.go

Task 2.4: Add Magic Number Detection

File: language_detector.go

Task 2.5: Add Unit Tests

File: language_detector_test.go (new)

Deliverables

  1. language_detector.goExtensionDetector, ContentDetector, CombinedDetector, detectShebang, detectKeywords, isTextContent
  2. language_detector_test.go — 45 tests, 0 failures
  3. ✅ Updated language_registry.goDetectLanguage method; all 21 languages wired with CombinedDetector

Success Criteria


Phase 3: Code-Aware Chunking for RAG

Duration: 2-3 weeks
Priority: High
Dependencies: Phases 1-2 complete

Objectives

Tasks

Task 3.1: Implement CodeChunker Interface

File: code_chunkers.go (new)

Task 3.2: Implement C/C++ Chunker

File: code_chunkers.go

Task 3.3: Implement Pascal Chunker

File: code_chunkers.go

Task 3.4: Implement Oberon Chunker

File: code_chunkers.go

Task 3.5: Implement Lisp Chunker

File: code_chunkers.go

Task 3.6: Implement Basic Chunker

File: code_chunkers.go

Task 3.7: Integrate with RAG Ingestion

File: commands.go (modified)

Task 3.8: Update RagStore for Enriched Chunks (Required)

File: rag_support.go (modified)

Task 3.9: Add Unit Tests

File: code_chunkers_test.go (new)

Deliverables

  1. code_chunkers.goCChunker, PascalChunker, OberonChunker, LispChunker, BasicChunker; helpers findLineCol, makeChunk; SetChunker/initChunkers
  2. code_chunkers_test.go — 42 tests, 0 failures
  3. ✅ Updated commands.goragIngestFile uses code-aware chunking + binary skip
  4. ✅ Updated rag_support.goIngestEnriched + lazy schema migration
  5. ✅ Updated language_registry.goinit() calls initChunkers

Success Criteria


Phase 4: Documentation Extraction

Duration: 1-2 weeks
Priority: Medium
Dependencies: Phases 1-3 complete

Objectives

Tasks

Task 4.1: Implement DocExtractor Interface

File: doc_extractors.go (new)

Task 4.2: Implement C/C++ DocExtractor

File: doc_extractors.go

Task 4.3: Implement Pascal DocExtractor

File: doc_extractors.go

Task 4.4: Implement Oberon DocExtractor

File: doc_extractors.go

Task 4.5: Implement Lisp DocExtractor

File: doc_extractors.go

Task 4.6: Implement Basic DocExtractor

File: doc_extractors.go

Task 4.7: Integrate with Chunkers

File: commands.go (modified)

Task 4.8: Add Unit Tests

File: doc_extractors_test.go (new)

Deliverables

  1. doc_extractors.goCDocExtractor, PascalDocExtractor, OberonDocExtractor, LispDocExtractor, BasicDocExtractor; docsToSymbolMap; lispStringContent; SetExtractor/initExtractors
  2. doc_extractors_test.go — 34 tests, 0 failures
  3. ✅ Updated commands.goragIngestFile uses doc extractors to populate Docs field
  4. ✅ Updated language_registry.goinit() calls initExtractors
  5. ✅ Updated code_chunkers.go — fixed flushCurrent nil-access bug; fixed extractPascalSymbol/extractOberonSymbol leading-whitespace; fixed classifyC pointer return type

Success Criteria


Phase 5: Syntax Highlighting

Duration: 1 week
Priority: Medium
Dependencies: Phases 1-2 complete

Objectives

Tasks

Task 5.1: Implement SyntaxHighlighter Interface

File: syntax_highlighters.go (new)

Task 5.2: Create Base Highlighter

File: syntax_highlighters.go

Task 5.3: Implement Language-Specific Highlighters

File: syntax_highlighters.go

Task 5.4: Integrate with Terminal Output

Files: terminal.go (modified), syntax_highlighters.go (new)

Task 5.5: Add Configuration

File: config.go (modified)

Task 5.6: Add Unit Tests

File: syntax_highlighters_test.go (new)

Deliverables

  1. syntax_highlighters.goTerminalHighlighter, 13 language specs, highlightCodeBlocks, initHighlighters
  2. syntax_highlighters_test.go — 30 tests; all pass
  3. terminal.gohighlightCodeBlocks applied before display
  4. config.goSyntaxHighlight bool with YAML load/save
  5. language_registry.goinit() calls initHighlighters

Success Criteria


Phase 6: Code Formatting & Final Integration

Duration: 1-2 weeks
Priority: Medium
Dependencies: Phases 1-2 complete

Objectives

Tasks

Task 6.1: Implement CodeFormatter Interface

File: code_formatters.go (new)

Task 6.2a: Implement Pipe-Mode External Formatters

File: code_formatters.go

Task 6.2b: Implement File-Mode External Formatters

File: code_formatters.go

Task 6.3: Implement Built-in Formatters

File: code_formatters.go

Task 6.4: Integrate with write_file Tool

File: builtin_tools.go (modified)

Task 6.5: Add Configuration

Files: config.go (modified), commands.go (modified)

Task 6.6: Add Unit Tests

File: code_formatters_test.go (new)

Deliverables

  1. code_formatters.goPipeExternalFormatter, FileExternalFormatter, BuiltinFormatter, normaliseText, SetFormatter, initFormatters
  2. code_formatters_test.go — 33 tests; all pass
  3. builtin_tools.goapplyAutoFormat wired into write_file
  4. config.goAutoFormat bool with YAML load/save
  5. commands.go/format FILE [FILE...] command
  6. language_registry.goinit() calls initFormatters

Success Criteria


Cross-Cutting Tasks

Testing Infrastructure

Task T.1: Create Test Data

Directory: testdata/language_support/

Task T.2: Integration Tests

File: language_integration_test.go (new)

Task T.3: Benchmark Tests

File: language_benchmark_test.go (new)

Documentation

Task D.1: Update User Documentation

File: Using_RAGs_with_Harvey.md (modified)

Task D.2: Create Language-Specific Guides

File: RAG_Language_Support.md (new)

Task D.3: Update Help Text

File: helptext.go (modified)

Task D.4: Update Architecture Documentation

File: ARCHITECTURE.md (modified)


Milestone Summary

Milestone 1: Foundation Complete (Week 2-3)

Includes: Phases 1-2
Deliverables: - Language registry with all 17 languages - Language detection by extension and content - Updated looksLikePath using registry - Comprehensive unit tests

Success Criteria: - All languages detected correctly - Registry functional and tested - No regressions in existing functionality


Milestone 2: Code-Aware RAG (Week 4-6)

Includes: Phase 3
Deliverables: - Code-aware chunkers for all programming languages - Integration with RAG ingestion - Improved retrieval quality for code

Success Criteria: - Code structures preserved in chunks - Retrieval quality improved (measurable) - All existing RAG functionality preserved


Milestone 3: Enhanced Experience (Week 7-9)

Includes: Phases 4-6
Deliverables: - Documentation extraction - Syntax highlighting - Auto-formatting - Full integration

Success Criteria: - Documentation extracted and associated with code - Code blocks colorized in terminal - Auto-formatting works when enabled - All features configurable


Milestone 4: Testing & Polish (Week 10)

Includes: Cross-cutting tasks
Deliverables: - Comprehensive test suite - Updated documentation - Performance benchmarks - Bug fixes and polish

Success Criteria: - All tests passing - Documentation complete - Performance acceptable - Ready for release


Resource Requirements

Human Resources

Time Estimates

Phase Duration Person-Days
Phase 1 1-2 weeks 10-20
Phase 2 1 week 5-10
Phase 3 2-3 weeks 15-30
Phase 4 1-2 weeks 10-20
Phase 5 1 week 5-10
Phase 6 1-2 weeks 10-20
Cross-cutting 2 weeks 20-30
Total 10-14 weeks 75-150

External Dependencies

Dependency Purpose License Notes
clang-format C/C++ formatting Apache 2.0 Optional
black Python formatting MIT Optional
prettier JS/TS formatting MIT Optional
rustfmt Rust formatting Apache 2.0/MIT Optional
sly Lisp formatting MIT Optional

Risk Management

Technical Risks

Risk Probability Impact Mitigation
Chunker bugs break code across chunks Medium High Extensive testing, fallback to generic chunking
Performance regression Low Medium Benchmark before/after, optimize if needed
Embedding model limitations Medium Medium Test with multiple models, document limitations
Memory usage increase Medium Medium Profile memory, optimize data structures
Backward compatibility issues Low High Maintain generic chunking as fallback, migration guide
External formatter dependencies Low Medium Use built-in fallbacks, document requirements

Schedule Risks

Risk Probability Impact Mitigation
Scope creep Medium Medium Strict phase definitions, defer nice-to-haves
Resource availability Medium High Prioritize critical features, defer optional
Testing complexity Medium Medium Automate testing, create good test data
Integration issues Medium Medium Early integration testing, continuous integration

Quality Risks

Risk Probability Impact Mitigation
Bugs in production Medium High Comprehensive testing, code reviews
Poor user experience Medium Medium User testing, iterate on feedback
Incomplete documentation Medium Medium Documentation as part of each task
Performance issues Low Medium Performance testing, profiling

Monitoring & Metrics

Implementation Metrics

  1. Code Coverage: Target > 80% for new code
  2. Test Pass Rate: Target 100% for all tests
  3. Build Success: All builds must pass
  4. Performance: No >10% regression in critical paths

Quality Metrics

  1. RAG Retrieval Quality:
    • Measure precision/recall for code queries
    • Compare code-aware vs. generic chunking
    • Target: 20%+ improvement over generic
  2. Chunk Quality:
    • % of functions not split across chunks: Target > 95%
    • % of documentation extracted: Target > 90%
  3. User Satisfaction:
    • Feedback on syntax highlighting
    • Feedback on auto-formatting
    • Bug reports and feature requests

Performance Metrics

  1. Startup Time: < 100ms increase
  2. Chunking Time: < 100ms per file for typical sizes
  3. Formatting Time: < 500ms per file for typical sizes
  4. Memory Usage: < 10MB additional memory

Communication Plan

Stakeholders

Communication Channels

  1. GitHub Issues: Feature tracking, bug reports
  2. Discussions: Design discussions, feedback
  3. Changelog: Release notes, new features
  4. Documentation: Updated docs with new features

Key Messages

  1. Phase 1 Complete: “Harvey now has a language registry supporting all file types”
  2. Phase 3 Complete: “Code-aware RAG chunking improves code search quality”
  3. Phase 5 Complete: “Syntax highlighting and auto-formatting now available”
  4. Final Release: “Comprehensive programming language support in Harvey”

Contingency Plans

If Behind Schedule

  1. Prioritize: Focus on Phases 1-3 (registry, detection, chunking)
  2. Defer: Syntax highlighting and formatting can be deferred
  3. Simplify: Reduce scope (fewer languages, simpler chunkers)
  4. Parallelize: More developers on independent tasks

If Quality Issues

  1. Stop: Halt development, fix critical issues
  2. Isolate: Identify problematic components
  3. Rollback: Revert to last known good state if necessary
  4. Test: Add more tests to prevent regression

If Resource Constraints

  1. Reduce Scope: Implement for fewer languages initially
  2. Simplify: Use simpler implementations (regex-based instead of parser)
  3. Externalize: Defer some formatters to external tools
  4. Community: Seek community contributions

Approval & Sign-off

Reviewers

Approval Checklist


Appendix A: File Changes Summary

New Files

File Phase Size (est.) Purpose
language_registry.go 1 ~500 lines Language registry and metadata
language_registry_test.go 1 ~300 lines Registry tests
language_detector.go 2 ~400 lines Language detection
language_detector_test.go 2 ~250 lines Detection tests
code_chunkers.go 3 ~800 lines Code-aware chunkers
code_chunkers_test.go 3 ~500 lines Chunker tests
doc_extractors.go 4 ~600 lines Documentation extractors
doc_extractors_test.go 4 ~400 lines Extractor tests
syntax_highlighters.go 5 ~600 lines Syntax highlighting
syntax_highlighters_test.go 5 ~400 lines Highlighter tests
code_formatters.go 6 ~500 lines Code formatters
code_formatters_test.go 6 ~300 lines Formatter tests
language_integration_test.go T ~400 lines Integration tests
language_benchmark_test.go T ~200 lines Benchmark tests
RAG_Language_Support.md D ~500 lines User documentation
Total ~7,000 lines

Modified Files

File Phase Changes Impact
commands.go 1, 3, 5 Add registry usage, update chunking, add formatting Core
config.go 5, 6 Add language settings, formatter config Core
builtin_tools.go 6 Add auto-formatting to write_file Core
terminal.go 5 Add syntax highlighting UI
codeblock.go 3 Extend for language metadata Core
harvey.go 1 Initialize registry Core
Using_RAGs_with_Harvey.md D Update with new features Docs
ARCHITECTURE.md D Update with new components Docs
helptext.go D Update help text UI

Appendix B: Test Data Requirements

Sample Files Needed

testdata/language_support/
├── c/
│   ├── functions.c
│   ├── structures.c
│   ├── preprocessor.c
│   └── complex.c
├── cpp/
│   ├── classes.cpp
│   ├── templates.cpp
│   └── inheritance.cpp
├── pascal/
│   ├── procedures.pas
│   ├── types.pas
│   └── units.pas
├── oberon/
│   ├── module.Mod
│   └── procedures.Mod
├── lisp/
│   ├── functions.lisp
│   ├── macros.lisp
│   └── classes.lisp
├── basic/
│   ├── subroutines.bas
│   └── functions.bas
└── expected/
    ├── c_chunks.json
    ├── pascal_chunks.json
    └── ...

Test File Sizes


Appendix C: Configuration Examples

agents/harvey.yaml

# Language support configuration
language:
  # Enable auto-formatting on file write
  auto_format: true
  
  # Enable syntax highlighting in terminal
  syntax_highlight: true
  
  # Per-language settings
  languages:
    c:
      enabled: true
      formatter: "clang-format"
      formatter_args: ["-"]             # stdin mode; clang-format - reads from stdin
      formatter_mode: pipe              # pipe (default) or file
      chunking: "function"
      
    cpp:
      enabled: true
      formatter: "clang-format"
      formatter_args: ["-style=google", "-"]
      formatter_mode: pipe
      
    pascal:
      enabled: true
      formatter: "builtin"             # built-in Go formatter, no subprocess
      formatter_mode: pipe             # built-in always uses pipe mode
      
    oberon:
      enabled: true
      formatter: "builtin"
      formatter_mode: pipe
      # Example of a hypothetical file-mode-only formatter:
      # formatter: "oberon-format"
      # formatter_mode: file           # requires safe_mode: false in harvey.yaml
      
    lisp:
      enabled: true
      formatter: "builtin"  # or "sly" if installed
      
    basic:
      enabled: true
      formatter: "builtin"
      
    # Existing languages
    go:
      enabled: true
      formatter: "gofmt"
      
    python:
      enabled: true
      formatter: "black"
      
    javascript:
      enabled: true
      formatter: "prettier"
      formatter_args: ["--tab-width=2", "--single-quote"]

Appendix D: Command Examples

New Commands

# Enable/disable auto-formatting
harvey> /config set language.auto_format true
harvey> /config set language.auto_format false

# Set formatter for a language
harvey> /config set language.c.formatter clang-format
harvey> /config set language.c.formatter_args "-style=llvm"

# Enable/disable syntax highlighting
harvey> /config set language.syntax_highlight true

# Manually format a file
harvey> /format path/to/file.c

# Show supported languages
harvey> /languages list

# Show language info
harvey> /languages info c

# Test highlighting
harvey> /highlight c path/to/file.c

This plan is a living document. It will be updated as implementation progresses and as new requirements or constraints emerge.