Model Benchmarks

By the
numbers.

How GRU 1, GRU 2, GPT 3.1, and the upcoming GPT 3.2 compare across size, language quality, William resemblance, and usability.

GRU 1 · Legacy
GRU 2 · Stable
GPT 3.1 · Live now
GPT 3.2 · Announced

Scale

How much each model has to work with — parameters and context window.

Parameters

GRU 1
15M
GRU 2
~40M
GPT 3.1
163M
GPT 3.2
163M

Context Window

GRU 1
256 tok
GRU 2
512 tok
GPT 3.1
2048 tok
GPT 3.2
2048 tok

GPT 3.2 (shown faded) is announced but not yet released — scale specs are confirmed, performance scores are not available.

Text Generation

Can it produce coherent, natural-sounding language? Scored 0–100 against standard NLP benchmarks. GRU 1 was never trained on general text.

GRU 1

GRU 2

GPT 3.1

--
/100
--
/100
--
/100

William Resemblance

Does it actually sound like William? Rated 0–100 by two groups: people who know him, and William himself.

GRU 1

GRU 2

GPT 3.1

Rated by William's friends

--
/100
--
/100
--
/100

Rated by William himself

--
/100
--
/100
--
/100

General Usability

Is it actually good to talk to? Rated 0–10 by two groups with very different expectations.

GRU 1

GRU 2

GPT 3.1

William & friends

--
/10
--
/10
--
/10

Strangers & third parties

--
/10
--
/10
--
/10

What this tells you

GRU 1

--

GRU 2

--

GPT 3.1 live

--

GPT 3.2 announced

--

Note: Scale specs (parameters, context window) are confirmed for all models. Performance scores are not yet published — they will be filled in once the rating process is complete.