An LLM-maintained registry for agentic skills — where skills live as markdown, dedup is automatic, and every merge is auditable. The single most important idea: the registry page and the deployed skill are two different artifacts, kept separate by design.
Walk through the central design decision, the operating loop, the scoring formulas, and merge anatomy.
Skills start life as ordinary markdown files. As the store grows, you want rich metadata — provenance, overlap scores, status, relationships — to maintain them. But you also want the deployed artifact to stay lean: just a name and a description the runtime can read. The solution is to keep two artifacts and a build step in between.
On the left: the maintainer LLM's working artifact — thick with tabs, tags, log entries, and provenance. On the right: the deployed artifact — three runtime fields and nothing else. Both refer to the same skill; only one is shipped.
Claude Code / Codex-style agents read skill descriptions at runtime
to decide which to invoke — there is no embedding classifier doing the routing.
The Agent Skills spec has no triggers field.
Structured triggers live in the registry for overlap analysis only, and get
distilled into the description at build time. Don't ship a
triggers: block into a deployed skill — it's non-standard
and won't be read.
The platform listing trims descriptions hard. The skill catalog your agent reads at runtime is tight on budget. That's exactly why dedup matters: two skills with near-identical descriptions create genuine ambiguity at routing time.
Without the split, either you pollute deployed skills with maintenance metadata (the runtime ignores it but the file gets cluttered), or you starve the maintainer LLM of the rich signal it needs to deduplicate. The two-artifact design gives each consumer exactly what it consumes.
The store splits into three physical zones. Each zone has a single editor and a single set of rules. Keeping them separate is what makes the whole system auditable.
source.yaml — origin, commit, license, fetch dateoriginal/ — unchanged SKILL.md + scripts + resourceshashes.txt — content hashes for idempotencyInvariant: append-only. The LLM never edits anything in here.
A changed source creates a new source-id, never overwrites the old one.
One canonical page per skill, plus comparisons/, merges/, and deprecated/.
Pages validate against schema/skill.schema.json and follow a fixed body template.
Invariant: this is the only place the maintainer LLM edits content.
Every page must link back to its source-id(s) in ## Provenance.
Spec-compliant SKILL.md with only the runtime fields the agent consumes:
name, description, optional allowed-tools. Plus scripts/ and resources/.
Invariant: rebuilt by skillstore build.
Never hand-edited. Stripped at build: slug, version, status, domains, tags, triggers, anti_triggers,
overlap, provenance, supersedes, superseded_by, created, updated.
The skillstore CLI: ingest, compare, merge, lint, build, search, graph. Plus the schema and templates under schema/.
AGENTS.md is the configuration that turns a generic chatbot
into a disciplined skill librarian. The rules are short and non-negotiable.
Most of them are about what the LLM must not do.
raw/Immutable provenance. A changed source gets a new source-id.
dist/Always regenerated by skillstore build.
Inspect first; scan for secrets and destructive commands; sandbox before any run.
log.mdOne dated line per operation. Greppable by prefix: INGEST, COMPARE, MERGE, LINT, BUILD.
skill.schema.jsonOn failure: keep as draft and emit lint errors — never discard.
Every skill links back to its source-id(s) in a ## Provenance section.
Merges are reversible. Originals move to registry/deprecated/ with a stub.
The selection-critical field. Must state:
# good Fill, flatten, and validate AcroForm/XFA PDF forms. Use when the user asks to populate, complete, or auto-fill PDF forms. # bad Helps with PDFs.
Triggers exist for overlap analysis, not for the deployed SKILL.md.
"coding" or "help user"triggers:
- id: fill-form
intent: "populate a PDF form with data"
keywords: [fill, complete, populate, autofill, form]
anti_triggers:
- "extracting text from a PDF without filling fields"
overlap ≥ 0.80 → propose merge
0.55–0.79 AND trigger_sim ≥ 0.80 → propose merge
(routing collision)
< 0.55 → keep separate
conflict_score ≥ 0.45 → BLOCK auto-merge
Any new domain, tag, or trigger verb
must be added to schema/taxonomy.yaml before it can appear in a registry page.
This prevents uncontrolled vocabulary drift as the store grows.
The whole system is one loop. Each station has a clear input, output, and a single responsibility. Click a station to expand its I/O contract, or watch a sample skill flow through the whole pipeline.
Capture raw, normalize, schema-validate.
Score overlap & conflict. Cluster.
Consolidate overlapping skills explicitly.
Schema, descriptions, dead refs, conflicts.
Strip registry-only fields → dist/.
Source: a repo, a URL, a directory, or pasted prose (-).
hashes.txt for idempotencyraw/sources/<source-id>/ (immutable)registry/skills/<slug>.md (normalized)Pairwise or whole-registry via similarity graph + community detection.
--cluster)registry/comparisons/<id>.md per clusteroverlap_score, conflict_score, verdictBase + conditional branches; explicit conflict resolution; no silent picks.
registry/skills/<slug>.md (bumped major version)registry/merges/<id>.md with conflict resolutionsregistry/deprecated/Proposes diffs with --fix; never silently auto-fixes.
log.mdGenerate lean SKILL.md per runtime (default: spec-compliant).
status: active only--runtime codex)dist/skills/<slug>/SKILL.md per active skill
Every skill that lands in the store travels the same physical path. Raw crates arrive at Ingest, get sorted at Compare, sometimes consolidate at Merge, get inspected at Lint, and exit Build as a finished deployable package.
The five-station structure is also the CLI: skillstore ingest,
compare, merge, lint, build.
Each command is idempotent and emits one log.md entry.
pdf-xfa-helper enters the pipeline
A real skill travels through five stations. Each station may produce
multiple registry/ artifacts (compare pages, merge pages, lint reports).
Overlap is computed from six independent signals, each weighted by how much it correlates with actual routing ambiguity. Cached by content hash so re-runs are cheap. Below: the formula visualized as a stacked bar, plus an interactive threshold explorer.
overlap_score = 0.30·desc_sim # embedding cosine + 0.30·trigger_sim # routing-critical + 0.15·instr_sim + 0.10·tool_sim # Jaccard, required > optional + 0.10·domain_tag_sim # domains ∪ tags + 0.05·output_sim
Trigger similarity is the routing-critical signal: for each trigger in A, take max similarity to any trigger in B (intent embedding + keyword Jaccard), then average both directions.
Drag the slider to see what an overlap score means.
Two skills can have high overlap and high conflict — meaning "keep separate or resolve carefully," not "merge freely." Detect conflicts by extracting imperative clauses and flagging opposing pairs.
| Skill A | Skill B | Conflict |
|---|---|---|
| "Always run full suite first" | "Run focused test first" | sequencing |
| "Never edit tests" | "Update tests when behavior changes" | policy |
| "Use grep" | "Use ripgrep only" | toolchain |
The naive way to merge two skills is to paste them together. That creates a vague mega-skill whose description can no longer cleanly signal when to use it — degrading routing for every prompt. The right pattern is a shared default plus conditional branches, with every conflict resolved explicitly.
## Instructions ### Default workflow (common steps that always run) ### If AcroForm (branch: handles AcroForm-specific path) ### If XFA (branch: handles dynamic XFA path) ### Conflict resolutions (explicit chosen behavior for each opposing imperative we found)
The superset's common path. Every merged skill has one of these.
Each former skill's distinct logic gets its own subsection, gated by an if-condition.
Every opposing imperative is documented here with the chosen behavior + rationale.
The LLM must never silently pick. Every conflict gets one of these, recorded with rationale on the merge page.
One is clearly safer / better.
e.g. "always validate"
Both right under different conditions.
e.g. "if X, do A; else B"
Turn into an option with a stated default.
e.g. flatten=true (default)
The conflict means they shouldn't merge.
→ keep separate
Anything in a source skill not represented in the merged body must
appear in the merge page's "deliberately dropped" list.
Originals move to registry/deprecated/ with a stub page —
never deleted, so the merge is reversible.
The merged skill:
supersedes: [pdf-fill-acroform, pdf-xfa-helper]provenance.sources from all inputsThe original skills:
status: supersededsuperseded_by: pdf-form-fillingregistry/deprecated/ with a stub bodyEvery skill moves through five states. Click a state to see what it means and what the LLM is allowed to do with it.
registry/skills/<slug>.md (not exported to dist/)skillstore lint actually checks
Lint never silently auto-fixes — it proposes diffs with --fix.
Severity ladder: error blocks CI, warn
should be addressed before merge, info is advisory.
No name or description, or frontmatter fails skill.schema.json.
resources[].path listed but file missing on disk.
Graph traversal over dependencies: finds a cycle or dangling node.
superseded_by points to a skill that no longer exists.
Length > 1024 chars, or "no when-to-use" heuristic triggered.
trigger_sim ≥ 0.9 between two active skills.
Open comparison with verdict ≠ keep-separate.
File in skill dir not listed in resources:.
Opposing imperatives detected in one body.
Destructive command pattern with no human-approval gate nearby.
Tag or domain not in taxonomy.yaml.
No inbound/outbound links and no overlap peers — possible miscategorization.
Listed in frontmatter but not referenced in body or scripts.
last_compared older than N days — re-run compare.
The skill store gives you discipline and auditability — it doesn't give you correctness guarantees. Four limits are worth carrying into any implementation.
Offline overlap metrics approximate runtime routing behavior. Calibrate thresholds against an actual false-merge rate rather than trusting the defaults, and gate autonomous merges behind human approval.
Provenance is preserved, but behavior is assumed unchanged.
Add per-skill smoke tests in tests/ and an eval.md
per skill; run them after every merge.
Not just scripts — untrusted skill instructions can carry
prompt-injection. Treat imported text as untrusted until reviewed.
Scan with shellcheck, ruff, gitleaks.
Agent Skills are designed to work across Codex, Gemini CLI, Cursor, etc. Heavy per-runtime down-conversion is often unnecessary — keep export adapters thin, only diverge where a runtime genuinely demands it.
ingest (immutable raw + normalized registry, auto-compared on entry) →
compare (cluster, flag redundancy/conflict) →
merge (consolidate, resolve explicitly, deprecate originals) →
lint (hygiene) →
build (regenerate lean dist/) — with
index.md and log.md keeping the whole thing navigable and auditable.
raw/ append-only · registry/ LLM-edited · dist/ generated.