GADGET

how-llms-work

Walk through the LLM's lifecycle. Click any node to have ol' Claude explain it (tersely I hope).

Web crawlCommon Crawl + archivesBooksLong-form proseCodeGitHub, package reposFilter & dedupQuality, NSFW, near-dupRaw corpusTrillions of tokensHundreds of TB → trillions of tokens after cleaningThe corpus a model is trained on quietly determines almost everything about it.

click any node and ol' Claude will explain it