Model Benchmarks
By the
numbers.
How GRU 1, GRU 2, GPT 3.1, and the upcoming GPT 3.2 compare across size, language quality, William resemblance, and usability.
Scale
How much each model has to work with — parameters and context window.
Parameters
Context Window
GPT 3.2 (shown faded) is announced but not yet released — scale specs are confirmed, performance scores are not available.
Text Generation
Can it produce coherent, natural-sounding language? Scored 0–100 against standard NLP benchmarks. GRU 1 was never trained on general text.
GRU 1
GRU 2
GPT 3.1
William Resemblance
Does it actually sound like William? Rated 0–100 by two groups: people who know him, and William himself.
GRU 1
GRU 2
GPT 3.1
Rated by William's friends
Rated by William himself
General Usability
Is it actually good to talk to? Rated 0–10 by two groups with very different expectations.
GRU 1
GRU 2
GPT 3.1
William & friends
Strangers & third parties
What this tells you
GRU 1
--
GRU 2
--
GPT 3.1 live
--
GPT 3.2 announced
--
Note: Scale specs (parameters, context window) are confirmed for all models. Performance scores are not yet published — they will be filled in once the rating process is complete.