Model Benchmarks

By the
numbers.

How GRU 1, GRU 2, GPT 3.1, and the upcoming GPT 3.2 compare across size, language quality, William resemblance, and usability.

GRU 1 · Released

GRU 2 · Released

GPT 3.1 · Released

GPT 3.2 · Training

Scale

How much each model has to work with — parameters and context window.

Parameters

GRU 1

15M

GRU 2

~40M

GPT 3.1

163M

GPT 3.2

163M

Context Window

GRU 1

256 tok

GRU 2

512 tok

GPT 3.1

2048 tok

GPT 3.2

2048 tok

GPT 3.2 (shown faded) is in training and not yet released — scale specs are confirmed, performance scores are not available.

Text Generation

Can it produce coherent, natural-sounding language? Scored 0–100 on the G.L.U.E. benchmark suite.

GRU 1

GRU 2

GPT 3.1

/100

William Resemblance

Does it actually sound like William? Rated 0–100 by two groups: people who know him, and William himself.

GRU 1

GRU 2

GPT 3.1

Rated by William's friends

/100

52.89

/100

13.95

/100

Rated by William himself

40.79

/100

41.25

/100

7.17

/100

General Usability

Is it actually good to talk to? Rated 0–10 by two groups with very different expectations.

GRU 1

GRU 2

GPT 3.1

William & friends

2.25

/10

3.61

/10

2.21

/10

Strangers & third parties

3.45

/10

2.13

/10

2.04

/10

Fun

Is it enjoyable to interact with? Rated 0–10. Fun tends to track novelty and unpredictability — smaller models score higher here.

GRU 1

GRU 2

GPT 3.1

3.15

/10

2.89

/10

1.57

/10

Multi-metric overview

Resemblance (avg), usability (avg), and fun side-by-side. All values scaled 0–100.

GRU 1

Resemblance 43.4 · Usability 28.5 · Fun 31.5

GRU 2

Resemblance 47.1 · Usability 28.7 · Fun 28.9

GPT 3.1

Resemblance 10.6 · Usability 21.3 · Fun 15.7

GRU 1 and GRU 2 overlap closely — both far outperform GPT 3.1 on every user-rated metric. Resemblance and usability are averaged across rater groups.

Note: Resemblance, usability, and fun scores are real ratings collected from people involved with the project. Text generation (G.L.U.E.) returned 0 across all models — likely a measurement error; included for transparency. GPT 3.2 is in training and unreleased — scale specs are confirmed, performance scores are pending.

By thenumbers.

Scale

Text Generation

William Resemblance

General Usability

Fun

Multi-metric overview

By the
numbers.