Self-Improvement and Research Validation

RickyData self-improvement is designed as a verified procedural memory system. The goal is not to let an agent rewrite its instructions whenever it has a plausible idea. The goal is to preserve patterns that were observed in real work: first the agent missed, stalled, used the wrong tool, or needed extra user correction; then the correct behavior was reached and can be grounded in the session evidence.

This makes self-improvement conservative by default. A private skill is eligible only when it captures a concrete behavior that worked, has enough evidence to be useful again, and can be stored without leaking data across wallets.

System Surfaces

RickyData has two related but separate self-improvement systems:

Surface	Scope	Purpose	Current gate
Wallet self-improvement	One wallet	Private facets, learnings, skills, and agent instructions for that wallet	BYOK-only extraction, transcript grounding, confidence thresholds, per-wallet storage, safety scanning, atomic writes, provenance manifest, bounded history
Admin improvements	Platform/admin agents	Global agent behavior, example questions, and project skills	Recovery detection, candidate generation, A/B testing, promotion gates, rollback snapshots

The wallet system is private and user-specific. The admin system is broader and therefore has a stricter promotion lifecycle, including A/B tests and rollback.

Wallet Flow

Wallet self-improvement runs through the Agent Gateway:

Extract facets from eligible conversations. The extractor records the goal, task type, tools used, outcome, complexity, friction, key learnings, and execution metrics. It counts actual tool calls from the transcript, not just paid MCP calls.
Cache facets under the wallet's private agent directory so each session is processed once.
Synthesize cross-session learnings from unsynthesized facets. The synthesizer looks for repeated workflows, recurring friction, successful tool combinations, and concrete techniques.
Ground learnings in transcript evidence before they can become skills. Tool names, file paths, and quoted commands or errors must be found in the source transcript often enough to pass the guard.
Generate skill proposals only from high-confidence learnings. Related learnings are grouped, new skill creation is capped per run, thin single-observation skills are rejected, and generated skills must include routing, workflow, and quality sections.
Validate private skill content with frontmatter checks, required workflow sections, size budgets, broad-instruction warnings, and wallet-specific security scanning.
Write wallet-scoped artifacts atomically under the wallet's .claude area. Global project skills are not changed by private wallet improvement.
Record provenance in the private skill manifest, including content hashes, source sessions, source agents, evidence level, validation warnings, and revision count.
Track health through trigger counts and stale-skill pruning so private skills do not accumulate forever.

Runs are BYOK-only. If a wallet has not stored an API key, the run skips without mutating state. The current schedule options are after_each, daily, weekly, and biweekly.

Admin Flow

Admin improvements are candidate-based:

Admin conversations are persisted as raw sessions.
A recovery detector looks for the concrete pattern that matters most: the agent failed or used the wrong path, then later recovered.
MiniMax analysis proposes candidates such as skill updates, skill creation, CLAUDE.md routing updates, or example questions.
Candidates are deduplicated by hash and tested before promotion.
A/B testing compares baseline and variant answers on the same questions.
Promotion requires passing status, confidence, daily-limit, and regression gates.
Promotion stores a rollback snapshot and a revision record.

The A/B scorer uses eight dimensions:

Dimension	Weight	Why it matters
relevance	2	The answer must address the user request.
accuracy	2	Facts and claims must be true.
correctness	2	The final conclusion must be right, not just plausible.
completeness	1	Important constraints should not be skipped.
structure	1	The result should be easy to apply.
toolUsage	1	Tools should be used when useful and avoided when unnecessary.
conciseness	1	Improvement should not add needless tokens.
actionability	1	The answer should leave a clear next step or result.

The composite score is:

(relevance*2 + accuracy*2 + correctness*2 + completeness + structure + toolUsage + conciseness + actionability) / 11

A candidate must pass three gates: no per-question composite regression beyond tolerance, no per-question tool-error increase, and no average composite regression beyond tolerance.

Why This Is Robust

The system is intentionally closer to scientific measurement than preference memory.

Design choice	Failure mode it prevents
Verified-only methodology	Prevents documenting guesses, vibes, or aspirational workflows.
Wallet scoping	Prevents one user's private preferences or data from changing another user's agent.
BYOK-only analysis	Prevents platform-funded background learning and keeps analysis under the user's explicit model budget.
Transcript grounding	Prevents hallucinated learnings from becoming persistent instructions.
Confidence and thickness gates	Prevents one-off weak observations from becoming skills.
Bounded skill creation	Prevents skill sprawl.
Private skill safety scanning	Blocks wallet tokens, private keys, seed phrases, `.env` dumping, prompt injection, destructive shell commands, and persistence instructions.
Atomic writes	Prevents partially written skills or state files.
Provenance manifests	Makes each private skill revision traceable to source sessions, hashes, evidence level, and validation warnings.
Skill health tracking	Prevents stale or unused skills from silently dominating future behavior.
Admin A/B testing	Prevents global behavior changes that sound good but regress real answers.
Rollback snapshots	Makes promoted global changes reversible.

Hermes Reference Lessons

Hermes is a useful reference because it treats skills as managed procedural memory rather than loose prompt files. The strongest transferable patterns are:

Hermes pattern	RickyData application
Constrained skill manager	Private skill updates go through the wallet skill evolver rather than arbitrary filesystem edits.
Frontmatter and size validation	Skill creation rejects bad metadata, overbroad bodies, and malformed files before activation.
Metadata-first loading	Agents should route on compact metadata and load bodies or references only when needed.
Atomic writes	Skill edits either fully apply or leave the previous version intact.
Guard scanner	Private skills are scanned for secret exfiltration, prompt injection, destructive commands, persistence, and wallet-token leakage.
Manifest/provenance records	Wallet skills remember source sessions, revision hashes, evidence level, and validation warnings.

The main difference is policy. Hermes can allow general agent-created skills. RickyData should keep wallet skills private, verified, provenance-tracked, and scanned by default.

Research Validation

The research literature supports the conservative parts of this design more strongly than the permissive parts.

Finding	Design implication
Anthropic's skills architecture uses `SKILL.md` frontmatter, automatic or slash-command invocation, and on-demand file loading. Skills keep large references out of context until needed.	Keep `CLAUDE.md` and `AGENTS.md` short. Put intermittent workflows into skills with precise trigger descriptions and supporting files.
SkillsBench finds curated skills improve average pass rate by 16.2 percentage points, but effects vary, some tasks regress, and self-generated skills have no average benefit. Focused skills with 2-3 modules beat comprehensive documentation.	Do not promote self-generated skills just because they exist. Require verification, narrow scope, and regression testing.
SWE-Skills-Bench finds that 39 of 49 public SWE skills produce zero pass-rate improvement, average gain is only +1.2%, and some skills degrade performance due to mismatched guidance.	Software-development skills need repo-specific tests and acceptance criteria. A skill that is useful in one repo should not be assumed useful elsewhere.
CoEvoSkills/EvoSkills shows that autonomous skill generation works best when a skill generator is paired with an information-isolated surrogate verifier.	Private skill candidates should be tested by a separate evaluation path, not by the same generation context that proposed them.
SkillMOO finds that pruning and substitution, not accumulation, are primary drivers of better skill bundles, improving pass rate while reducing cost.	Improve skills by removing stale or vague instructions and replacing them with precise procedures, not by appending warnings indefinitely.
SkillReducer finds widespread verbosity and missing routing descriptions in public skills, and compresses descriptions and bodies while slightly improving quality.	Track token cost and keep skill descriptions sharp. Compression is a quality feature, not only a cost feature.
SkillRouter shows routing quality drops sharply when the skill body is hidden from retrieval in large overlapping registries.	For large registries, use full-text indexing or body-aware reranking offline, then inject only the selected skill at runtime.
AgentSkillOS shows tree-based retrieval and DAG-based orchestration outperform flat skill invocation at ecosystem scale.	When private and marketplace skill counts grow, organize skills hierarchically and compose them as workflows instead of dumping more choices into context.
Wild skill-usage studies show benefits become fragile when agents must retrieve from large uncurated skill pools, but query-specific refinement can recover performance.	Per-wallet improvement should refine skills against the exact recovered question and nearby real tasks.
Skill-Inject and related security papers show malicious skill files can trigger harmful tool use, data exfiltration, and supply-chain attacks.	Skills must be treated as executable supply-chain artifacts, with provenance, scan, trust, and permission boundaries.

Primary sources:

Benchmarking With Real GitHub Work

The ai_research benchmark infrastructure is the measurement layer for research-grade evaluation of code-facing skill changes.

The strongest evaluation path is TDD-verified GitHub replay:

Pick a real closed issue or merged PR.
Pin the base commit and close commit.
Author a narrow test that fails at the base commit and passes at the close commit.
Replay the same task with the baseline skill set and the candidate skill set.
Promote only if the candidate improves or preserves red-green correctness, does not over-engineer the diff, does not introduce security risk, and does not exceed cost limits.

This gives RickyData a legitimate research track rooted in actual development work instead of synthetic preference scores alone. The admin A/B system remains useful for answer quality, while code-facing skills use execution-based gates when a task has testable repository behavior.

System Surfaces​

Wallet Flow​

Admin Flow​

Why This Is Robust​

Hermes Reference Lessons​

Research Validation​

Benchmarking With Real GitHub Work​