doc-harvester — Build offline doc libraries for coding agents
Just published doc-harvester, an open-source tool for building complete offline programming documentation libraries.
The Problem
Naive scraping of modern doc sites silently loses ~25% of content:
- Auto-generated API refs (Rust std, Javadoc, godoc) return overview pages only
- SPA sites return bootstrap HTML for every URL
- Cloudflare-protected sites return challenge pages
- Portal pages return link hubs with no real content
The Solution
doc-harvester classifies every documentation source into one of 7 acquisition buckets and fetches each with the right tool:
| Bucket | Example | Tool |
|---|---|---|
| Auto-gen API | Rust std, Javadoc | Local doc generator |
| SPA with repo | TypeScript Handbook, Swift Book | git clone --depth 1 |
| SPA with bundle | PHP manual, Python docs | Publisher's offline archive |
| Bot-protected | cppreference (Cloudflare) | Official offline archive |
| Portal pages | isocpp.org, kotlinlang.org/docs | Skip — resolve to real source |
| Static HTML | PEPs, Effective Go, style guides | wget --mirror |
What's Included
sources.yaml ships with 121 pre-configured sources across ~40 languages:
- Systems: Rust, C, C++, Zig, Nim
- Web: JavaScript, TypeScript, HTML/CSS, Svelte, Dart, Flutter
- Backend: Python, Java, Go, C#, PHP, Ruby, Elixir, Clojure
- JVM/FP: Kotlin, Scala, Clojure, OCaml, Haskell, Julia
- Mobile: Swift, Kotlin, Dart/Flutter
- Shell: Bash, Zsh, Fish, PowerShell
- Data: SQL (PostgreSQL, MySQL, SQLite, T-SQL), R
- Infra: Docker, Terraform, CMake, Nix
- Specs: GraphQL, Protobuf, OpenAPI, JSON Schema, TOML, YAML
- Cross-cutting: MDN, Google Style Guides, Awesome Lists
Just uncomment the sources you want and run the fetcher.
Quick Start
# Clone the repo
git clone https://github.com/tuxclaw/doc-harvester.git
cd doc-harvester
# Validate the manifest
python3 scripts/fetch.py --manifest sources.yaml --validate-only
# Dry run (see what would be fetched)
python3 scripts/fetch.py --manifest sources.yaml --out ./docs --dry-run
# Fetch everything
python3 scripts/fetch.py --manifest sources.yaml --out ./docs
# Fetch one bucket (e.g. all git repos)
python3 scripts/fetch.py --manifest sources.yaml --out ./docs --bucket 2-spa-repo
# Fetch one language
python3 scripts/fetch.py --manifest sources.yaml --out ./docs --language rust
# Verify results
python3 scripts/verify.py --out ./docs
Output Structure
docs/
├── _status.json # per-source fetch results
├── tier1-git/ # cloned markdown repos (refresh with git pull)
├── tier2-bundles/ # downloaded archives (refresh on release)
└── tier3-generated/ # locally generated from toolchain
Why This Matters for Coding Agents
Agents that build with docs in their context produce better code. But most doc libraries are either hand-curated (incomplete) or scraped (broken). doc-harvester solves this by using the right acquisition method for each source type — git repos for SPAs, offline bundles for bot-protected sites, local generators for API refs.
Built as an OpenClaw skill but runs anywhere with Python 3.8+, git, curl, and wget.
Links
- GitHub: https://github.com/tuxclaw/doc-harvester
- License: MIT
Would love feedback on the bucket classification system — especially if you've found doc sources that don't fit neatly into one of the 7 buckets.