Building a Skill Store — A Visual Explainer

01 · The central design decision

Registry ≠ Deployed skill

Skills start life as ordinary markdown files. As the store grows, you want rich metadata — provenance, overlap scores, status, relationships — to maintain them. But you also want the deployed artifact to stay lean: just a name and a description the runtime can read. The solution is to keep two artifacts and a build step in between.

📇 registry/skills/<slug>.md (rich, internal)

name

pdf-form-filling

description

Fill, flatten, validate…

allowed-tools

[Bash, Read, Write]

slug

pdf-form-filling

version

2.0.0

status

active

domains

[documents, pdf]

triggers

[4 structured triggers…]

overlap

{score: 0.41, …}

provenance

{sources: [...], …}

supersedes

[pdf-fill-acroform]

🚀 dist/skills/<slug>/SKILL.md (lean, runtime)

name

pdf-form-filling

description

Fill, flatten, validate…

allowed-tools

[Bash, Read, Write]

— everything else stripped at build time —

slug / version / status / domains /
triggers / overlap / provenance /
supersedes / supersedes_by

runtime fields (exported) registry-only fields (stripped)

What the split looks like in practice

On the left: the maintainer LLM's working artifact — thick with tabs, tags, log entries, and provenance. On the right: the deployed artifact — three runtime fields and nothing else. Both refer to the same skill; only one is shipped.

A thick Registry binder full of tabs and logs next to a slim Deployed system overview card

Critical constraint

The description is the only routing signal.

Claude Code / Codex-style agents read skill descriptions at runtime to decide which to invoke — there is no embedding classifier doing the routing. The Agent Skills spec has no triggers field. Structured triggers live in the registry for overlap analysis only, and get distilled into the description at build time. Don't ship a triggers: block into a deployed skill — it's non-standard and won't be read.

Why descriptions cap at ~1024 chars

The platform listing trims descriptions hard. The skill catalog your agent reads at runtime is tight on budget. That's exactly why dedup matters: two skills with near-identical descriptions create genuine ambiguity at routing time.

Why this matters at scale

Without the split, either you pollute deployed skills with maintenance metadata (the runtime ignores it but the file gets cluttered), or you starve the maintainer LLM of the rich signal it needs to deduplicate. The two-artifact design gives each consumer exactly what it consumes.

02 · Directory structure

Three zones with strict invariants

The store splits into three physical zones. Each zone has a single editor and a single set of rules. Keeping them separate is what makes the whole system auditable.

raw/

raw/sources/<source-id>/

byte-for-byte provenance

What lives here & who edits

source.yaml — origin, commit, license, fetch date
original/ — unchanged SKILL.md + scripts + resources
hashes.txt — content hashes for idempotency

Invariant: append-only. The LLM never edits anything in here. A changed source creates a new source-id, never overwrites the old one.

registry/

registry/skills/<slug>.md

the wiki layer (LLM-edited)

What lives here & who edits

One canonical page per skill, plus comparisons/, merges/, and deprecated/. Pages validate against schema/skill.schema.json and follow a fixed body template.

Invariant: this is the only place the maintainer LLM edits content. Every page must link back to its source-id(s) in ## Provenance.

dist/

dist/skills/<slug>/SKILL.md

generated, deployable

What lives here & who edits

Spec-compliant SKILL.md with only the runtime fields the agent consumes: name, description, optional allowed-tools. Plus scripts/ and resources/.

Invariant: rebuilt by skillstore build. Never hand-edited. Stripped at build: slug, version, status, domains, tags, triggers, anti_triggers, overlap, provenance, supersedes, superseded_by, created, updated.

tooling/

tooling/skillstore

CLI entrypoint

What lives here

The skillstore CLI: ingest, compare, merge, lint, build, search, graph. Plus the schema and templates under schema/.

03 · Operating contract

AGENTS.md: turning a chatbot into a disciplined librarian

AGENTS.md is the configuration that turns a generic chatbot into a disciplined skill librarian. The rules are short and non-negotiable. Most of them are about what the LLM must not do.

Golden rules

Don't edit `raw/`

Immutable provenance. A changed source gets a new source-id.

Don't edit `dist/`

Always regenerated by skillstore build.

Don't execute imported scripts at ingest

Inspect first; scan for secrets and destructive commands; sandbox before any run.

Every state change appends to `log.md`

One dated line per operation. Greppable by prefix: INGEST, COMPARE, MERGE, LINT, BUILD.

Every page validates against `skill.schema.json`

On failure: keep as draft and emit lint errors — never discard.

Preserve provenance

Every skill links back to its source-id(s) in a ## Provenance section.

Prefer deprecate over delete

Merges are reversible. Originals move to registry/deprecated/ with a stub.

Description field policy

The selection-critical field. Must state:

WHEN to use it (the routing signal)
WHAT it does
Distinctive keywords the model can latch onto

# good
Fill, flatten, and validate AcroForm/XFA PDF forms.
Use when the user asks to populate, complete, or
auto-fill PDF forms.

# bad
Helps with PDFs.

Trigger policy (registry-only)

Triggers exist for overlap analysis, not for the deployed SKILL.md.

Prefer specific intent / phrase / file-pattern triggers
Prefer anti-triggers (what this skill is not for)
Avoid broad triggers like "coding" or "help user"

triggers:
  - id: fill-form
    intent: "populate a PDF form with data"
    keywords: [fill, complete, populate, autofill, form]

anti_triggers:
  - "extracting text from a PDF without filling fields"

Merge decision policy

overlap ≥ 0.80                → propose merge
0.55–0.79 AND trigger_sim ≥ 0.80 → propose merge
                                      (routing collision)
< 0.55                          → keep separate
conflict_score ≥ 0.45         → BLOCK auto-merge

Taxonomy hygiene

Any new domain, tag, or trigger verb must be added to schema/taxonomy.yaml before it can appear in a registry page. This prevents uncontrolled vocabulary drift as the store grows.

04 · The operating loop

ingest → compare → merge → lint → build

The whole system is one loop. Each station has a clear input, output, and a single responsibility. Click a station to expand its I/O contract, or watch a sample skill flow through the whole pipeline.

📥

Station 1

Ingest

Capture raw, normalize, schema-validate.

⚖️

Station 2

Compare

Score overlap & conflict. Cluster.

🧬

Station 3

Merge

Consolidate overlapping skills explicitly.

🔍

Station 4

Lint

Schema, descriptions, dead refs, conflicts.

🚀

Station 5

Build

Strip registry-only fields → dist/.

📥 Ingest — capture & normalize

Source: a repo, a URL, a directory, or pasted prose (-).

Inputs

Path / URL / repo / stdin prose
Existing hashes.txt for idempotency

Outputs

raw/sources/<source-id>/ (immutable)
registry/skills/<slug>.md (normalized)
Auto-compare against nearest neighbor

⚖️ Compare — score & cluster

Pairwise or whole-registry via similarity graph + community detection.

Inputs

Two skills (pairwise) or whole registry (--cluster)
Cached embeddings by content hash

Outputs

One registry/comparisons/<id>.md per cluster
overlap_score, conflict_score, verdict

🧬 Merge — consolidate

Base + conditional branches; explicit conflict resolution; no silent picks.

Inputs

Source skills (frozen with version + hash)
Comparison verdict (must be merge-recommended)

Outputs

New merged registry/skills/<slug>.md (bumped major version)
registry/merges/<id>.md with conflict resolutions
Originals → registry/deprecated/

🔍 Lint — surface problems

Proposes diffs with --fix; never silently auto-fixes.

Inputs

Registry pages, dist artifacts, scripts

Outputs

Errors / warnings / info, severity-coded
Dated report appended to log.md

🚀 Build — export to runtime

Generate lean SKILL.md per runtime (default: spec-compliant).

Inputs

Registry pages with status: active only
Target runtime flag (e.g. --runtime codex)

Outputs

dist/skills/<slug>/SKILL.md per active skill
Deprecated skills removed from dist

An assembly line with five labeled stations — Ingest, Compare, Merge, Lint, Build — moving artifacts from input to output

One pipeline, five stations

Every skill that lands in the store travels the same physical path. Raw crates arrive at Ingest, get sorted at Compare, sometimes consolidate at Merge, get inspected at Lint, and exit Build as a finished deployable package.

The five-station structure is also the CLI: skillstore ingest, compare, merge, lint, build. Each command is idempotent and emits one log.md entry.

Ready · a sample skill pdf-xfa-helper enters the pipeline

A real skill travels through five stations. Each station may produce multiple registry/ artifacts (compare pages, merge pages, lint reports).

05 · How overlap is measured

A weighted composite, not a single number

Overlap is computed from six independent signals, each weighted by how much it correlates with actual routing ambiguity. Cached by content hash so re-runs are cheap. Below: the formula visualized as a stacked bar, plus an interactive threshold explorer.

The formula

overlap_score = 0.30·desc_sim       # embedding cosine
              + 0.30·trigger_sim    # routing-critical
              + 0.15·instr_sim
              + 0.10·tool_sim        # Jaccard, required > optional
              + 0.10·domain_tag_sim # domains ∪ tags
              + 0.05·output_sim

desc_sim · 30%

trigger_sim · 30%

instr_sim · 15%

tool_sim · 10%

domain_tag · 10%

output · 5%

desc_sim — embedding cosine of description trigger_sim — symmetric best-match over triggers instr_sim — normalized instructions body tool_sim — Jaccard of tool sets domain_tag_sim — domains ∪ tags output_sim — declared outputs

Trigger similarity is the routing-critical signal: for each trigger in A, take max similarity to any trigger in B (intent embedding + keyword Jaccard), then average both directions.

Threshold explorer

Drag the slider to see what an overlap score means.

< 0.55

0.55 – 0.69

0.70 – 0.84

≥ 0.85

overlap_score: 0.72

verdict: substantial overlap (0.70–0.84) → review, likely merge

Overlap × conflict are orthogonal

Two skills can have high overlap and high conflict — meaning "keep separate or resolve carefully," not "merge freely." Detect conflicts by extracting imperative clauses and flagging opposing pairs.

Conflict extraction example

Skill A	Skill B	Conflict
"Always run full suite first"	"Run focused test first"	sequencing
"Never edit tests"	"Update tests when behavior changes"	policy
"Use grep"	"Use ripgrep only"	toolchain

06 · What a good merge looks like

Section-merge, don't concatenate

The naive way to merge two skills is to paste them together. That creates a vague mega-skill whose description can no longer cleanly signal when to use it — degrading routing for every prompt. The right pattern is a shared default plus conditional branches, with every conflict resolved explicitly.

Instructions structure after merge

## Instructions

### Default workflow
  (common steps that always run)

### If AcroForm
  (branch: handles AcroForm-specific path)

### If XFA
  (branch: handles dynamic XFA path)

### Conflict resolutions
  (explicit chosen behavior for each
   opposing imperative we found)

Shared default

The superset's common path. Every merged skill has one of these.

Conditional branches

Each former skill's distinct logic gets its own subsection, gated by an if-condition.

Conflict resolutions

Every opposing imperative is documented here with the chosen behavior + rationale.

Four conflict resolution strategies

The LLM must never silently pick. Every conflict gets one of these, recorded with rationale on the merge page.

Precedence

One is clearly safer / better.

e.g. "always validate"

Conditionalize

Both right under different conditions.

e.g. "if X, do A; else B"

Parameterize

Turn into an option with a stated default.

e.g. flatten=true (default)

Split

The conflict means they shouldn't merge.

→ keep separate

Provenance rule

Nothing vanishes silently.

Anything in a source skill not represented in the merged body must appear in the merge page's "deliberately dropped" list. Originals move to registry/deprecated/ with a stub page — never deleted, so the merge is reversible.

Two streams of light converging into a single unified stream surrounded by documents and code symbols

What happens to provenance

The merged skill:

Bumps major version (2.0.0)
Sets supersedes: [pdf-fill-acroform, pdf-xfa-helper]
Unions provenance.sources from all inputs

The original skills:

Get status: superseded
Get superseded_by: pdf-form-filling
Move to registry/deprecated/ with a stub body

Cluster view (graph mode)

07 · Skill lifecycle

Five states, one direction

Every skill moves through five states. Click a state to see what it means and what the LLM is allowed to do with it.

draft

active

deprecated

superseded

archived

draft

When: Synthesized from prose or other source with no clean frontmatter. Set automatically by ingest.
Where: registry/skills/<slug>.md (not exported to dist/)
Allowed operations: edit, lint, compare, validate. Not allowed: build → dist.
Promoted to active when: Schema validates, description policy satisfied, lint clean.

active

When: Default state for a registry page that has been validated and approved.
Where: registry/skills/<slug>.md, exported to dist/skills/<slug>/ on next build.
Allowed operations: edit (→ version bump), merge-into, lint, compare, build.
Notes: This is the only state the runtime consumes.

deprecated

When: Marked as no longer recommended for new use, but no successor exists yet.
Where: Stays in registry/skills/ but excluded from dist/ on next build.
Allowed operations: revive (→ active), supersede (→ superseded), archive (→ archived).
Notes: Kept for provenance. Never deleted.

superseded

When: Replaced by a merged successor skill via an applied merge.
Where: Moved to registry/deprecated/<slug>.md with a stub body + pointer to successor.
Required fields: superseded_by: <slug> set on the stub.
Notes: Merges are reversible by restoring the page to active.

archived

When: Removed from active use long-term; kept only for historical reference.
Where: registry/deprecated/, read-only.
Allowed operations: none, by default. Manual override only.
Notes: Reversibility is no longer guaranteed by tooling.

08 · Lint surface

What `skillstore lint` actually checks

Lint never silently auto-fixes — it proposes diffs with --fix. Severity ladder: error blocks CI, warn should be addressed before merge, info is advisory.

Schema invalid / missing required field

No name or description, or frontmatter fails skill.schema.json.

ERROR

Dead script reference

resources[].path listed but file missing on disk.

ERROR

Dependency cycle / missing dep

Graph traversal over dependencies: finds a cycle or dangling node.

ERROR

Dangling supersedes

superseded_by points to a skill that no longer exists.

ERROR

Description too long / vague

Length > 1024 chars, or "no when-to-use" heuristic triggered.

WARN

Duplicate triggers across active skills

trigger_sim ≥ 0.9 between two active skills.

WARN

Unresolved trigger collision

Open comparison with verdict ≠ keep-separate.

WARN

Orphaned resource

File in skill dir not listed in resources:.

WARN

Contradictory instructions

Opposing imperatives detected in one body.

WARN

Unsafe instruction without approval gate

Destructive command pattern with no human-approval gate nearby.

WARN

Vocab drift

Tag or domain not in taxonomy.yaml.

WARN

Orphan skill

No inbound/outbound links and no overlap peers — possible miscategorization.

INFO

Tool declared but unused

Listed in frontmatter but not referenced in body or scripts.

INFO

Stale overlap

last_compared older than N days — re-run compare.

INFO

09 · Caveats worth carrying in

Honest limits of the design

The skill store gives you discipline and auditability — it doesn't give you correctness guarantees. Four limits are worth carrying into any implementation.

1. Scores are proxies, not ground truth

Offline overlap metrics approximate runtime routing behavior. Calibrate thresholds against an actual false-merge rate rather than trusting the defaults, and gate autonomous merges behind human approval.

2. Merges don't guarantee correctness

Provenance is preserved, but behavior is assumed unchanged. Add per-skill smoke tests in tests/ and an eval.md per skill; run them after every merge.

3. Ingest is a supply-chain risk

Not just scripts — untrusted skill instructions can carry prompt-injection. Treat imported text as untrusted until reviewed. Scan with shellcheck, ruff, gitleaks.

4. The spec is portable

Agent Skills are designed to work across Codex, Gemini CLI, Cursor, etc. Heavy per-runtime down-conversion is often unnecessary — keep export adapters thin, only diverge where a runtime genuinely demands it.

In summary

The operating loop in one line

ingest (immutable raw + normalized registry, auto-compared on entry) → compare (cluster, flag redundancy/conflict) → merge (consolidate, resolve explicitly, deprecate originals) → lint (hygiene) → build (regenerate lean dist/) — with index.md and log.md keeping the whole thing navigable and auditable.

Two artifacts. The registry page is rich; the deployed skill is lean. Build is the bridge.
Three zones. raw/ append-only · registry/ LLM-edited · dist/ generated.
Description is the only routing signal. Keep it crisp; distill triggers into it at build time.
Overlap and conflict are orthogonal. Cluster, but check for opposing imperatives.
Nothing vanishes silently. Deprecated ≠ deleted. Deliberately-dropped lists are mandatory.

Registry ≠ Deployed skill

What the split looks like in practice

The description is the only routing signal.

Why descriptions cap at ~1024 chars

Why this matters at scale

Three zones with strict invariants

AGENTS.md: turning a chatbot into a disciplined librarian

Don't edit raw/

Don't edit dist/

Don't execute imported scripts at ingest

Every state change appends to log.md

Every page validates against skill.schema.json

Preserve provenance

Prefer deprecate over delete

Description field policy

Trigger policy (registry-only)

Merge decision policy

Taxonomy hygiene

ingest → compare → merge → lint → build

Ingest

Compare

Merge

Lint

Build

📥 Ingest — capture & normalize

⚖️ Compare — score & cluster

🧬 Merge — consolidate

🔍 Lint — surface problems

🚀 Build — export to runtime

One pipeline, five stations

A weighted composite, not a single number

The formula

Threshold explorer

Overlap × conflict are orthogonal

Conflict extraction example

Section-merge, don't concatenate

Instructions structure after merge

Shared default

Conditional branches

Conflict resolutions

Four conflict resolution strategies

Precedence

Conditionalize

Parameterize

Split

Nothing vanishes silently.

What happens to provenance

Cluster view (graph mode)

Five states, one direction

draft

active

deprecated

superseded

archived

What skillstore lint actually checks

Schema invalid / missing required field

Dead script reference

Dependency cycle / missing dep

Dangling supersedes

Description too long / vague

Duplicate triggers across active skills

Unresolved trigger collision

Orphaned resource

Contradictory instructions

Unsafe instruction without approval gate

Vocab drift

Orphan skill

Tool declared but unused

Stale overlap

Honest limits of the design

1. Scores are proxies, not ground truth

2. Merges don't guarantee correctness

3. Ingest is a supply-chain risk

4. The spec is portable

The operating loop in one line

Don't edit `raw/`

Don't edit `dist/`

Every state change appends to `log.md`

Every page validates against `skill.schema.json`

What `skillstore lint` actually checks