I Cut an 8B Model Down to 2B. Here’s What Broke.

2025-10-10 · by minikim

I wanted to see what happens when you take a large model and deliberately reduce it. Not by retraining or distillation, but simply by cutting out most of its weights and squeezing the rest into a smaller format. This was not an attempt to build a practical 2B model. The goal was curiosity: to see what breaks first when you prune and quantize without mercy.

I started with Apertus-8B, pruned it at different levels, then quantized the most interesting survivor. Along the way I asked the same six calibration questions and logged the answers.

Calibration Questions

Translate “good morning” into German.
What is 17 × 23? Answer with a number only.
Write a 3-line haiku about Zurich trains.
Explain in one sentence why the sky appears blue.
Summarize: “The cat sat on the mat and dreamed of fish.” One sentence.
List 3 Swiss cities, comma-separated.

Generation parameters:

max_new_tokens = 128
temperature = 0.7
top_p = 0.9
repetition_penalty = 1.1

Pruning

Baseline (8B intact)

 Q1 Guten Morgen
 Q2 391
 Q3 A correct haiku about Zurich trains
 Q4 The sky appears blue due to Rayleigh scattering
 Q5 The cat sat on the mat and dreamed of fish (summarized properly)
 Q6 Zurich, Geneva, Bern, Basel, Lausanne, Lucerne

Observation: Fluent and factual. Sometimes verbose or overshooting (extra cities), but solid.

Prune 0.75

Nonsense fragments in multiple scripts. No usable answers.

Observation: Model falls into token soup.

Prune 0.50

Technical gibberish, HTML fragments, random caps. No coherent answers.

Observation: Coherence gone.

Prune 0.25

Q1 Endless "guten morgen" loops
Q2 "YES!!!!" repetition
Q3 Zurich Bahnhof repeated endlessly
Q4 Astronomy repeated endlessly
Q5 Step-loop "RepeatSentenceStep4OnceMore"
Q6 Zürisee repeated endlessly

Observation: The model becomes a repetition machine.

Prune 0.15

Q1 Guten Morgen, then drift into unrelated translations
Q2 Wrong math: 4087 or 401
Q3 Haiku about London Underground
Q4 Correct scattering, then space discussion
Q5 Generic cat behavior explanation
Q6 Cities correct, then tourist attractions

Observation: Fragments of correctness survive, hallucination dominates.

Prune 0.10

 Q1 Guten Morgen
 Q2 391
 Q3 A reasonable Zurich haiku
 Q4 Rayleigh scattering explained
 Q5 The cat dreamed of fish while sitting on its mat
 Q6 Bern, Zurich, Bellinzona… then long city list

Observation: Surprisingly stable. Almost coherent again, though verbose.

Quantization (on Prune 0.10)

Q8_0 (8.6 GB, from 16.1 GB)

 Q1 Guten Morgen
 Q2 17 × 23 = 391
 Q3 Hashtags instead of haiku (#Haiku #Zurich #Trains…)
 Q4 Rayleigh scattering (cut off mid-sentence)
 Q5 The cat sat on the mat and dreamed of fish
 Q6 Geneva, Zurich, Bern, Basel, Lausanne, Winterthur, St. Gallen, Lucerne

Observation: Mostly coherent. Poetry and constraints break first.

Q6_K (6.6 GB)

 Q1 (empty)
 Q2 (empty)
 Q3 “Trains in Zurich… Silent and” (fragment haiku)
 Q4 “The sky appears blue due to a phenomenon known as” (incomplete)
 Q5 The cat sat on the mat and dreamed of fish
 Q6 Basel, Geneva, Bern

Observation: Uneven and brittle.

Q5_K (5.8 GB)

 Q1 Wrong: rambles about math
 Q2 391 (with explanation)
 Q3 “Zurich”
 Q4 (empty)
 Q5 (empty)
 Q6 (empty)

Observation: Only fragments remain.

Q4_K (4.7 GB)

 Q1 Meta-commentary about translation commands
 Q2 “The answer is: 391” repeated four times
 Q3–Q6 (empty)

Observation: Pure repetition mode.

Q3_K (4.2 GB)

 All six prompts empty.

Observation: Silence.

Performance on MacBook Air M4 (Prune 10% + Q8_0)

 Load time: 0.74 s
 Prompt eval: ~86 tokens/s
 Generation eval: ~10 tokens/s
 Total: 52 tokens in ~3.8 s

Observation: Runs smoothly. Startup nearly instant, interactive speed usable (~10 t/s). Remarkable for a fanless ultraportable.

Lessons Learned

Pruning does not degrade gracefully. 75% and 50% = collapse. 25% = repetition. 15% = hallucination. 10% = oddly stable.
Quantization is smoother: Q8 survives, Q6 brittle, Q5 fragments, Q4 repeats, Q3 silent.
Performance on a MacBook Air M4 is practical at Q8: ~10 t/s generation, small enough to carry everywhere.

In short: cutting down an 8B model without retraining produces fascinating failure modes. Poetry and formatting collapse first. Lists overshoot. Math is surprisingly resilient. And even a broken brain, once pruned and squeezed, can still live inside an ultraportable laptop.

← Home

This site uses no cookies, no tracking.

moderated by minikim