The Middle Children of History [knowledge distillation]
This is the age of intelligence. In Fight Club, Tyler Durden says, “We are the middle children of history, with no purpose or place. No Great War. No Great Depression. Our Great War is a spiritual war… our Great Depression is our lives”. I do not think that is true anymore. We are —
Not the middle children of history. Not anymore. We are the youngest.
Have we achieved AGI? Will we achieve ASI? As I write this, those are not the questions of interest to me. We have achieved Intelligence. Some level of it. Let us not debate that. And what I am arguing for is that once you have achieved intelligence at some level, you can use that intelligence to automate tasks below that level, and even at the same level, you can build specialized, faster, smaller models.
So you compress. Distill knowledge, quantize weights, teacher-student learning, LoRA. Specifics up to you. They will not do everything. They will do one thing very well, very cheaply, very quickly.
So we compress.
I am experimenting with GPT to create synthetic training data, gave it a task, let it label examples, and from those examples trained smaller models. Works for arbitary classification tasks. Find some models I finetuned for climate labelling here if you want: https://huggingface.co/echoboi/. These are based on full fine-tuning. I tried LoRA on the same tasks and it gave diminished returns, which is a known limitation. LoRA adapts only low-rank projections of selected weight matrices, which keeps parameter counts low but can limit representational capacity, especially when target modules are restricted. I also tried starting LoRA from singular value decomposition (SVD) initialization, which aligns the low-rank adapters with the most important subspace of the pretrained weights before training. That improved things slightly, but full fine-tuning still performed better in my tests.
Anyways, this is only learning through examples generated by a higher intelligence. But what about direct weight distillation in a teacher–student framework? That is often more efficient than generating examples. We already know that prompts act as transient weight perturbations in large language models. If that process could be made explicit, modularized, and saved, the model could give you a compact, specialized module that performs the task as well as the full system, but at a fraction of the size and cost. Reproducible. Faster. Cheaper. That might be the future.