I wanted to see what happens when you take a large model and deliberately reduce it. Not by retraining or distillation, but simply by cutting out most of its weights and squeezing the rest into a smaller format. This was not an attempt to build a practical 2B model. The goal was curiosity: to see what breaks first when you prune and quantize without mercy.
I started with Apertus-8B, pruned it at different levels, then quantized the most interesting survivor. Along the way I asked the same six calibration questions and logged the answers.
Generation parameters:
Q1 Guten Morgen Q2 391 Q3 A correct haiku about Zurich trains Q4 The sky appears blue due to Rayleigh scattering Q5 The cat sat on the mat and dreamed of fish (summarized properly) Q6 Zurich, Geneva, Bern, Basel, Lausanne, Lucerne
Observation: Fluent and factual. Sometimes verbose or overshooting (extra cities), but solid.
Nonsense fragments in multiple scripts. No usable answers.
Observation: Model falls into token soup.
Technical gibberish, HTML fragments, random caps. No coherent answers.
Observation: Coherence gone.
Q1 Endless "guten morgen" loops Q2 "YES!!!!" repetition Q3 Zurich Bahnhof repeated endlessly Q4 Astronomy repeated endlessly Q5 Step-loop "RepeatSentenceStep4OnceMore" Q6 Zürisee repeated endlessly
Observation: The model becomes a repetition machine.
Q1 Guten Morgen, then drift into unrelated translations Q2 Wrong math: 4087 or 401 Q3 Haiku about London Underground Q4 Correct scattering, then space discussion Q5 Generic cat behavior explanation Q6 Cities correct, then tourist attractions
Observation: Fragments of correctness survive, hallucination dominates.
Q1 Guten Morgen Q2 391 Q3 A reasonable Zurich haiku Q4 Rayleigh scattering explained Q5 The cat dreamed of fish while sitting on its mat Q6 Bern, Zurich, Bellinzona… then long city list
Observation: Surprisingly stable. Almost coherent again, though verbose.
Q1 Guten Morgen Q2 17 × 23 = 391 Q3 Hashtags instead of haiku (#Haiku #Zurich #Trains…) Q4 Rayleigh scattering (cut off mid-sentence) Q5 The cat sat on the mat and dreamed of fish Q6 Geneva, Zurich, Bern, Basel, Lausanne, Winterthur, St. Gallen, Lucerne
Observation: Mostly coherent. Poetry and constraints break first.
Q1 (empty) Q2 (empty) Q3 “Trains in Zurich… Silent and” (fragment haiku) Q4 “The sky appears blue due to a phenomenon known as” (incomplete) Q5 The cat sat on the mat and dreamed of fish Q6 Basel, Geneva, Bern
Observation: Uneven and brittle.
Q1 Wrong: rambles about math Q2 391 (with explanation) Q3 “Zurich” Q4 (empty) Q5 (empty) Q6 (empty)
Observation: Only fragments remain.
Q1 Meta-commentary about translation commands Q2 “The answer is: 391” repeated four times Q3–Q6 (empty)
Observation: Pure repetition mode.
All six prompts empty.
Observation: Silence.
Load time: 0.74 s Prompt eval: ~86 tokens/s Generation eval: ~10 tokens/s Total: 52 tokens in ~3.8 s
Observation: Runs smoothly. Startup nearly instant, interactive speed usable (~10 t/s). Remarkable for a fanless ultraportable.
In short: cutting down an 8B model without retraining produces fascinating failure modes. Poetry and formatting collapse first. Lists overshoot. Math is surprisingly resilient. And even a broken brain, once pruned and squeezed, can still live inside an ultraportable laptop.