Architect Explainer

Building a Skill Store.

An LLM-maintained registry for agentic skills — where skills live as markdown, dedup is automatic, and every merge is auditable. The single most important idea: the registry page and the deployed skill are two different artifacts, kept separate by design.

Walk through the central design decision, the operating loop, the scoring formulas, and merge anatomy.

A librarian organizing a card catalog while a glowing AI assistant hovers nearby
01 · The central design decision

Registry ≠ Deployed skill

Skills start life as ordinary markdown files. As the store grows, you want rich metadata — provenance, overlap scores, status, relationships — to maintain them. But you also want the deployed artifact to stay lean: just a name and a description the runtime can read. The solution is to keep two artifacts and a build step in between.

📇 registry/skills/<slug>.md   (rich, internal)
name
pdf-form-filling
description
Fill, flatten, validate…
allowed-tools
[Bash, Read, Write]

slug
pdf-form-filling
version
2.0.0
status
active
domains
[documents, pdf]
triggers
[4 structured triggers…]
overlap
{score: 0.41, …}
provenance
{sources: [...], …}
supersedes
[pdf-fill-acroform]
🚀 dist/skills/<slug>/SKILL.md   (lean, runtime)
name
pdf-form-filling
description
Fill, flatten, validate…
allowed-tools
[Bash, Read, Write]

—   everything else stripped at build time —
slug / version / status / domains /
triggers / overlap / provenance /
supersedes / supersedes_by
runtime fields (exported) registry-only fields (stripped)

What the split looks like in practice

On the left: the maintainer LLM's working artifact — thick with tabs, tags, log entries, and provenance. On the right: the deployed artifact — three runtime fields and nothing else. Both refer to the same skill; only one is shipped.

A thick Registry binder full of tabs and logs next to a slim Deployed system overview card
Critical constraint

The description is the only routing signal.

Claude Code / Codex-style agents read skill descriptions at runtime to decide which to invoke — there is no embedding classifier doing the routing. The Agent Skills spec has no triggers field. Structured triggers live in the registry for overlap analysis only, and get distilled into the description at build time. Don't ship a triggers: block into a deployed skill — it's non-standard and won't be read.

Why descriptions cap at ~1024 chars

The platform listing trims descriptions hard. The skill catalog your agent reads at runtime is tight on budget. That's exactly why dedup matters: two skills with near-identical descriptions create genuine ambiguity at routing time.

Why this matters at scale

Without the split, either you pollute deployed skills with maintenance metadata (the runtime ignores it but the file gets cluttered), or you starve the maintainer LLM of the rich signal it needs to deduplicate. The two-artifact design gives each consumer exactly what it consumes.

02 · Directory structure

Three zones with strict invariants

The store splits into three physical zones. Each zone has a single editor and a single set of rules. Keeping them separate is what makes the whole system auditable.

raw/
raw/sources/<source-id>/
byte-for-byte provenance
What lives here & who edits
  • source.yaml — origin, commit, license, fetch date
  • original/ — unchanged SKILL.md + scripts + resources
  • hashes.txt — content hashes for idempotency

Invariant: append-only. The LLM never edits anything in here. A changed source creates a new source-id, never overwrites the old one.

registry/
registry/skills/<slug>.md
the wiki layer (LLM-edited)
What lives here & who edits

One canonical page per skill, plus comparisons/, merges/, and deprecated/. Pages validate against schema/skill.schema.json and follow a fixed body template.

Invariant: this is the only place the maintainer LLM edits content. Every page must link back to its source-id(s) in ## Provenance.

dist/
dist/skills/<slug>/SKILL.md
generated, deployable
What lives here & who edits

Spec-compliant SKILL.md with only the runtime fields the agent consumes: name, description, optional allowed-tools. Plus scripts/ and resources/.

Invariant: rebuilt by skillstore build. Never hand-edited. Stripped at build: slug, version, status, domains, tags, triggers, anti_triggers, overlap, provenance, supersedes, superseded_by, created, updated.

tooling/
tooling/skillstore
CLI entrypoint
What lives here

The skillstore CLI: ingest, compare, merge, lint, build, search, graph. Plus the schema and templates under schema/.

03 · Operating contract

AGENTS.md: turning a chatbot into a disciplined librarian

AGENTS.md is the configuration that turns a generic chatbot into a disciplined skill librarian. The rules are short and non-negotiable. Most of them are about what the LLM must not do.

Golden rules

1

Don't edit raw/

Immutable provenance. A changed source gets a new source-id.

2

Don't edit dist/

Always regenerated by skillstore build.

3

Don't execute imported scripts at ingest

Inspect first; scan for secrets and destructive commands; sandbox before any run.

4

Every state change appends to log.md

One dated line per operation. Greppable by prefix: INGEST, COMPARE, MERGE, LINT, BUILD.

5

Every page validates against skill.schema.json

On failure: keep as draft and emit lint errors — never discard.

6

Preserve provenance

Every skill links back to its source-id(s) in a ## Provenance section.

7

Prefer deprecate over delete

Merges are reversible. Originals move to registry/deprecated/ with a stub.

Description field policy

The selection-critical field. Must state:

  • WHEN to use it (the routing signal)
  • WHAT it does
  • Distinctive keywords the model can latch onto
# good
Fill, flatten, and validate AcroForm/XFA PDF forms.
Use when the user asks to populate, complete, or
auto-fill PDF forms.

# bad
Helps with PDFs.

Trigger policy (registry-only)

Triggers exist for overlap analysis, not for the deployed SKILL.md.

  • Prefer specific intent / phrase / file-pattern triggers
  • Prefer anti-triggers (what this skill is not for)
  • Avoid broad triggers like "coding" or "help user"
triggers:
  - id: fill-form
    intent: "populate a PDF form with data"
    keywords: [fill, complete, populate, autofill, form]

anti_triggers:
  - "extracting text from a PDF without filling fields"

Merge decision policy

overlap ≥ 0.80                → propose merge
0.55–0.79 AND trigger_sim ≥ 0.80 → propose merge
                                      (routing collision)
< 0.55                          → keep separate
conflict_score ≥ 0.45         → BLOCK auto-merge

Taxonomy hygiene

Any new domain, tag, or trigger verb must be added to schema/taxonomy.yaml before it can appear in a registry page. This prevents uncontrolled vocabulary drift as the store grows.

04 · The operating loop

ingest → compare → merge → lint → build

The whole system is one loop. Each station has a clear input, output, and a single responsibility. Click a station to expand its I/O contract, or watch a sample skill flow through the whole pipeline.

📥
Station 1

Ingest

Capture raw, normalize, schema-validate.

⚖️
Station 2

Compare

Score overlap & conflict. Cluster.

🧬
Station 3

Merge

Consolidate overlapping skills explicitly.

🔍
Station 4

Lint

Schema, descriptions, dead refs, conflicts.

🚀
Station 5

Build

Strip registry-only fields → dist/.

📥 Ingest — capture & normalize

Source: a repo, a URL, a directory, or pasted prose (-).

Inputs
  • Path / URL / repo / stdin prose
  • Existing hashes.txt for idempotency
Outputs
  • raw/sources/<source-id>/ (immutable)
  • registry/skills/<slug>.md (normalized)
  • Auto-compare against nearest neighbor

⚖️ Compare — score & cluster

Pairwise or whole-registry via similarity graph + community detection.

Inputs
  • Two skills (pairwise) or whole registry (--cluster)
  • Cached embeddings by content hash
Outputs
  • One registry/comparisons/<id>.md per cluster
  • overlap_score, conflict_score, verdict

🧬 Merge — consolidate

Base + conditional branches; explicit conflict resolution; no silent picks.

Inputs
  • Source skills (frozen with version + hash)
  • Comparison verdict (must be merge-recommended)
Outputs
  • New merged registry/skills/<slug>.md (bumped major version)
  • registry/merges/<id>.md with conflict resolutions
  • Originals → registry/deprecated/

🔍 Lint — surface problems

Proposes diffs with --fix; never silently auto-fixes.

Inputs
  • Registry pages, dist artifacts, scripts
Outputs
  • Errors / warnings / info, severity-coded
  • Dated report appended to log.md

🚀 Build — export to runtime

Generate lean SKILL.md per runtime (default: spec-compliant).

Inputs
  • Registry pages with status: active only
  • Target runtime flag (e.g. --runtime codex)
Outputs
  • dist/skills/<slug>/SKILL.md per active skill
  • Deprecated skills removed from dist
An assembly line with five labeled stations — Ingest, Compare, Merge, Lint, Build — moving artifacts from input to output

One pipeline, five stations

Every skill that lands in the store travels the same physical path. Raw crates arrive at Ingest, get sorted at Compare, sometimes consolidate at Merge, get inspected at Lint, and exit Build as a finished deployable package.

The five-station structure is also the CLI: skillstore ingest, compare, merge, lint, build. Each command is idempotent and emits one log.md entry.

Ready · a sample skill pdf-xfa-helper enters the pipeline
Ingest raw/ + registry/ capture & normalize Compare overlap · conflict score & cluster Merge if verdict = merge consolidate, deprecate Lint errors · warnings hygiene check Build dist/skills/<slug>/ deployable artifact 📦

A real skill travels through five stations. Each station may produce multiple registry/ artifacts (compare pages, merge pages, lint reports).

05 · How overlap is measured

A weighted composite, not a single number

Overlap is computed from six independent signals, each weighted by how much it correlates with actual routing ambiguity. Cached by content hash so re-runs are cheap. Below: the formula visualized as a stacked bar, plus an interactive threshold explorer.

The formula

overlap_score = 0.30·desc_sim       # embedding cosine
              + 0.30·trigger_sim    # routing-critical
              + 0.15·instr_sim
              + 0.10·tool_sim        # Jaccard, required > optional
              + 0.10·domain_tag_sim # domains ∪ tags
              + 0.05·output_sim
desc_sim · 30%
trigger_sim · 30%
instr_sim · 15%
tool_sim · 10%
domain_tag · 10%
output · 5%
desc_sim — embedding cosine of description trigger_sim — symmetric best-match over triggers instr_sim — normalized instructions body tool_sim — Jaccard of tool sets domain_tag_sim — domains ∪ tags output_sim — declared outputs

Trigger similarity is the routing-critical signal: for each trigger in A, take max similarity to any trigger in B (intent embedding + keyword Jaccard), then average both directions.

Threshold explorer

Drag the slider to see what an overlap score means.

< 0.55
0.55 – 0.69
0.70 – 0.84
≥ 0.85
0.72
verdict: substantial overlap (0.70–0.84) → review, likely merge

Overlap × conflict are orthogonal

Two skills can have high overlap and high conflict — meaning "keep separate or resolve carefully," not "merge freely." Detect conflicts by extracting imperative clauses and flagging opposing pairs.

overlap_score → conflict_score → 0.55 0.45 related, compatible → link, refine triggers related, conflict! → keep separate / resolve unrelated → no action unrelated, conflict! → rare; review pdf-form-filling overlap 0.41 · conflict 0.18 pdf-fill-acroform overlap 0.84 · conflict 0.32 pdf-xfa-helper overlap 0.79 · conflict 0.51 merge candidates

Conflict extraction example

Skill A Skill B Conflict
"Always run full suite first""Run focused test first"sequencing
"Never edit tests""Update tests when behavior changes"policy
"Use grep""Use ripgrep only"toolchain
06 · What a good merge looks like

Section-merge, don't concatenate

The naive way to merge two skills is to paste them together. That creates a vague mega-skill whose description can no longer cleanly signal when to use it — degrading routing for every prompt. The right pattern is a shared default plus conditional branches, with every conflict resolved explicitly.

Instructions structure after merge

## Instructions

### Default workflow
  (common steps that always run)

### If AcroForm
  (branch: handles AcroForm-specific path)

### If XFA
  (branch: handles dynamic XFA path)

### Conflict resolutions
  (explicit chosen behavior for each
   opposing imperative we found)

Shared default

The superset's common path. Every merged skill has one of these.

Conditional branches

Each former skill's distinct logic gets its own subsection, gated by an if-condition.

Conflict resolutions

Every opposing imperative is documented here with the chosen behavior + rationale.

Four conflict resolution strategies

The LLM must never silently pick. Every conflict gets one of these, recorded with rationale on the merge page.

Precedence

One is clearly safer / better.

e.g. "always validate"

Conditionalize

Both right under different conditions.

e.g. "if X, do A; else B"

Parameterize

Turn into an option with a stated default.

e.g. flatten=true (default)

Split

The conflict means they shouldn't merge.

→ keep separate
Provenance rule

Nothing vanishes silently.

Anything in a source skill not represented in the merged body must appear in the merge page's "deliberately dropped" list. Originals move to registry/deprecated/ with a stub page — never deleted, so the merge is reversible.

Two streams of light converging into a single unified stream surrounded by documents and code symbols

What happens to provenance

The merged skill:

  • Bumps major version (2.0.0)
  • Sets supersedes: [pdf-fill-acroform, pdf-xfa-helper]
  • Unions provenance.sources from all inputs

The original skills:

  • Get status: superseded
  • Get superseded_by: pdf-form-filling
  • Move to registry/deprecated/ with a stub body

Cluster view (graph mode)

0.41 0.84 0.79 pdf-form -filling pdf -extract pdf -merge pdf-fill -acroform pdf-xfa -helper < 0.55 (related) ≥ 0.55 (merge candidate)
07 · Skill lifecycle

Five states, one direction

Every skill moves through five states. Click a state to see what it means and what the LLM is allowed to do with it.

draft
active
deprecated
superseded
archived

draft

When
Synthesized from prose or other source with no clean frontmatter. Set automatically by ingest.
Where
registry/skills/<slug>.md (not exported to dist/)
Allowed operations
edit, lint, compare, validate. Not allowed: build → dist.
Promoted to active when
Schema validates, description policy satisfied, lint clean.
draft synthesized from prose active runtime-consumed deprecated no successor yet superseded moved to deprecated/ archived read-only promote retire supersede archive
08 · Lint surface

What skillstore lint actually checks

Lint never silently auto-fixes — it proposes diffs with --fix. Severity ladder: error blocks CI, warn should be addressed before merge, info is advisory.

E

Schema invalid / missing required field

No name or description, or frontmatter fails skill.schema.json.

ERROR
E

Dead script reference

resources[].path listed but file missing on disk.

ERROR
E

Dependency cycle / missing dep

Graph traversal over dependencies: finds a cycle or dangling node.

ERROR
E

Dangling supersedes

superseded_by points to a skill that no longer exists.

ERROR
W

Description too long / vague

Length > 1024 chars, or "no when-to-use" heuristic triggered.

WARN
W

Duplicate triggers across active skills

trigger_sim ≥ 0.9 between two active skills.

WARN
W

Unresolved trigger collision

Open comparison with verdict ≠ keep-separate.

WARN
W

Orphaned resource

File in skill dir not listed in resources:.

WARN
W

Contradictory instructions

Opposing imperatives detected in one body.

WARN
W

Unsafe instruction without approval gate

Destructive command pattern with no human-approval gate nearby.

WARN
W

Vocab drift

Tag or domain not in taxonomy.yaml.

WARN
i

Orphan skill

No inbound/outbound links and no overlap peers — possible miscategorization.

INFO
i

Tool declared but unused

Listed in frontmatter but not referenced in body or scripts.

INFO
i

Stale overlap

last_compared older than N days — re-run compare.

INFO
09 · Caveats worth carrying in

Honest limits of the design

The skill store gives you discipline and auditability — it doesn't give you correctness guarantees. Four limits are worth carrying into any implementation.

1. Scores are proxies, not ground truth

Offline overlap metrics approximate runtime routing behavior. Calibrate thresholds against an actual false-merge rate rather than trusting the defaults, and gate autonomous merges behind human approval.

2. Merges don't guarantee correctness

Provenance is preserved, but behavior is assumed unchanged. Add per-skill smoke tests in tests/ and an eval.md per skill; run them after every merge.

3. Ingest is a supply-chain risk

Not just scripts — untrusted skill instructions can carry prompt-injection. Treat imported text as untrusted until reviewed. Scan with shellcheck, ruff, gitleaks.

4. The spec is portable

Agent Skills are designed to work across Codex, Gemini CLI, Cursor, etc. Heavy per-runtime down-conversion is often unnecessary — keep export adapters thin, only diverge where a runtime genuinely demands it.

In summary

The operating loop in one line

ingest (immutable raw + normalized registry, auto-compared on entry) → compare (cluster, flag redundancy/conflict) → merge (consolidate, resolve explicitly, deprecate originals) → lint (hygiene) → build (regenerate lean dist/) — with index.md and log.md keeping the whole thing navigable and auditable.

  • Two artifacts. The registry page is rich; the deployed skill is lean. Build is the bridge.
  • Three zones. raw/ append-only · registry/ LLM-edited · dist/ generated.
  • Description is the only routing signal. Keep it crisp; distill triggers into it at build time.
  • Overlap and conflict are orthogonal. Cluster, but check for opposing imperatives.
  • Nothing vanishes silently. Deprecated ≠ deleted. Deliberately-dropped lists are mandatory.