Documentation

DefendableDatasets docs

DefendableDatasets is a static-first dataset store, registry, graph browser, selector, verifier, and export system for AI builders. Current corpora were curated on sovereign bare-metal RTX 6000 fleet and RTX 3090 systems.

Registry Schema

Registry JSON lives under /data/registry. Datasets include identity, license, formats, tasks, source summary, validation, files, hashes, receipts, examples, citation, and model compatibility.

Graph Schema

The graph contains DOMAIN, CATEGORY, DATASET, VERSION, FILE, LICENSE, FORMAT, TASK, and RECEIPT nodes connected by typed edges such as CONTAINS, HAS_FILE, LICENSED_AS, AVAILABLE_AS, SUPPORTS_TASK, and VERIFIED_BY.

How to Add a Dataset

Add a registry entry, create a dataset folder under /datasets/[domain]/[dataset_id], include manifest.json, dataset.card.md, samples, receipts, and split files where licensing allows.

How to Export a Pack

Use the graph, registry, or detail page to add datasets to the pack. The pack page exports pack.manifest.json, hf_dataset_card.md, fine_tune_manifest.json, sha256_manifest.json, and README snippets.

How Receipts Work

Receipts are proof objects that describe hashes, validation runs, license checks, provenance summaries, or future Merkle proofs. Verified entries require receipt records and file-level SHA256 hashes.

License Policy

Every dataset must declare a license and whether commercial use is allowed. Packs warn when gated research or attribution licenses are mixed into exports.

Download Quotas

Public metadata remains open. Production file delivery should use the Cloudflare Worker download gate with 500 successful file downloads per email per rolling 30-day window.

Quality Foundry

The defdata Python CLI turns raw JSONL into schema-valid, deduped, graded, split, hashed, manifest-backed packages with stage receipts. Tiers are royal_jelly, honey, jelly, and propolis.

Hack Edge Reviewer

Hack is registered as node_hack_orin / worker_hack with model_lfm2_5_8b_a1b for edge-volume review. Use defdata grade --reviewer hack with the finance rubric for WACC and related referee passes.

CLI

Use defendable-datasets validate, defendable-datasets hash, and defendable-datasets pack to check registry integrity, generate SHA256 receipts, and create pack manifests before opening pull requests.

License Compatibility Matrix

cc0-1.0
compatible
Keep provenance and dataset card when possible.
apache-2.0
compatible
Preserve notices where applicable.
mit
compatible
Preserve license notice.
cc-by-4.0
review
Attribution required. Keep source and license notices with exports.
defendable-community-research
gated review
Research and evaluation access is gated. Commercial training or redistribution requires explicit written approval.

Roadmap

Object-storage dataset delivery
Hugging Face sync
S3/object storage backend
DefendableCloud member access
Dataset signing
Merkle proofs
Dataset quality evaluator
CLI: defendable-datasets validate
CLI: defendable-datasets pack
API access
Fine-tune job handoff
Model compatibility scoring
Dataset lineage graph
Dataset license compatibility checker