塑造當代AI的十篇奠基性論文

Hacker News·3 個月前

本文依時間順序，列舉了十篇對現代人工智慧發展產生重大影響的關鍵研究論文，強調了它們對當前AI架構和能力的累積性貢獻。

Dead Neurons

Ten Papers That Built the AI We Have Today

Every time you use ChatGPT, Claude, or Gemini, you’re running inference on an architecture that can be traced back to about a dozen key ideas, most published in the last decade. Strip away the hype and the corporate press releases, and modern AI is really just a stack of clever papers, each one building on the last, plus a bunch of compute. Some introduced new primitives. Others figured out how to scale existing ones. A few changed how we train models entirely.

What follows are the ten papers that, more than any others, bent the trajectory of the field. They’re ordered chronologically, because the story makes more sense that way: each breakthrough depended on what came before, and the chain of inheritance explains why modern AI looks the way it does.

2012: The Proof of Concept

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS.

Before 2012, neural networks were largely considered a dead end. The dominant view in machine learning was that hand-engineered features combined with algorithms like support vector machines were the path forward. Then Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a deep convolutional neural network into the ImageNet competition and won by a margin so large it looked like a data entry error. Their network achieved a top-5 error rate of 15.3%, compared to 26.2% for the second-place entry.

AlexNet, as it came to be known, wasn’t doing anything theoretically new. Convolutional networks had existed since the 1980s. What was new was the scale: 60 million parameters trained on 1.2 million images using two GPUs. The paper demonstrated that neural networks, given sufficient data and compute, actually worked. This single result convinced funding agencies, companies, and sceptical researchers to take deep learning seriously, triggering an avalanche of investment and talent that continues today.

2013: Words as Vectors

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS.

Once vision had its breakthrough, natural language processing needed one too. The problem was representation: how do you turn words into something a neural network can process? Tomas Mikolov and colleagues at Google proposed a surprisingly simple solution called Word2Vec. They trained a shallow neural network to predict words from their context (or vice versa) on billions of words of text, then extracted the learned internal representations.

The result was that each word became a dense vector of a few hundred numbers, and these vectors captured semantic relationships in a way that felt almost magical. The famous example: if you took the vector for “king,” subtracted “man,” and added “woman,” you got something very close to “queen.” Suddenly anyone could download pre-trained word embeddings and use them as features for downstream tasks. Word2Vec democratised NLP in a way that hadn’t been possible before.

2015: Learning to Pay Attention

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR.

Neural machine translation in 2014 worked by encoding an entire source sentence into a single fixed-length vector, then decoding that vector into the target language. This was a brutal bottleneck: all the information in a 50-word sentence had to squeeze through a vector of perhaps 1000 numbers. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed a simple fix. Instead of compressing everything into one vector, let the decoder “attend” to different parts of the encoder’s output at each step of generation.

This attention mechanism allowed the model to dynamically focus on relevant source words when producing each target word. Translation quality improved substantially, but more importantly, the paper introduced a primitive that would prove far more general than its authors likely anticipated. Attention turned out to be useful for essentially everything.

2016: The Identity Shortcut

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR.

Training very deep neural networks was surprisingly hard. You’d expect that a 50-layer network could at least match a 20-layer network (after all, it could just learn to make the extra 30 layers do nothing), yet in practice deeper networks often performed worse. Kaiming He and colleagues at Microsoft Research diagnosed the problem as gradient degradation and proposed a disarmingly simple solution. Instead of having each layer compute y = F(x), have it compute y = x+F(x).

This residual connection creates a “shortcut” that allows gradients to flow directly through the network without passing through every layer’s transformations. With this change, networks of over 100 layers suddenly became trainable, and a 152-layer ResNet won the ImageNet competition with a 3.57% error rate. The residual connection pattern, output = input+f(input), now appears in every modern transformer and has proven remarkably durable; recent work on “hyper-connections” and manifold-constrained variants still builds directly on this 2016 insight.

2017: Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS.

By 2017, the standard architecture for sequence tasks combined recurrent neural networks with attention. Ashish Vaswani and colleagues at Google proposed eliminating the recurrence entirely. Their transformer architecture used only attention mechanisms (applied in parallel across all positions) plus feedforward networks. The provocative title, “Attention Is All You Need,” turned out to be more or less correct.

The key innovation was self-attention: each position in the sequence attends to every other position, computing relevance weights and aggregating information accordingly. This made the architecture embarrassingly parallel (unlike sequential RNNs), which meant it could actually use modern GPU hardware efficiently. The transformer achieved state-of-the-art results on machine translation, though at the time nobody quite realised they’d invented the architecture that would underpin the next decade of AI progress. GPT, BERT, Claude, Gemini, Llama, and essentially every frontier model today is a transformer.

2018: Pre-training Pays Off

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.

If you have a neural network architecture that works, the question becomes: how do you train it? Jacob Devlin and colleagues at Google introduced BERT, which established what became the dominant paradigm for several years. The recipe was conceptually simple: pre-train a transformer on massive amounts of unlabelled text using a self-supervised objective (predict masked words), then fine-tune on specific downstream tasks with labelled data.

BERT used a clever training trick called masked language modelling: randomly hide 15% of the words in a sentence and train the model to predict them from context. This forced the model to learn deep bidirectional representations, since it couldn’t predict a masked word without understanding both what came before and after. BERT crushed existing benchmarks across a wide range of NLP tasks, and the pre-train-then-fine-tune paradigm became standard practice almost overnight.

2020: Scale Changes Everything

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. (2020). Language Models are Few-Shot Learners. NeurIPS.

GPT-2 had shown that larger language models exhibited surprising capabilities, though at 1.5 billion parameters it was still modest by later standards. OpenAI’s GPT-3 scaled to 175 billion parameters, and something qualitatively different emerged. The model could perform tasks it had never been explicitly trained for, simply by being shown a few examples in the prompt. This “in-context learning” meant you could get the model to translate, summarise, or answer questions without any gradient updates at all.

Perhaps more importantly, GPT-3 demonstrated that capabilities emerged smoothly with scale. A 1.3 billion parameter model couldn’t do arithmetic reliably; a 175 billion parameter model could. This suggested a clear path forward: make models bigger, train on more data, and new capabilities would appear. The paper shaped the strategic bets of every major AI lab for the next several years.

2022: Making Models Actually Useful

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.

Raw language models are trained to predict the next word, which isn’t quite the same as being helpful or harmless. GPT-3 would happily generate toxic content or confidently state falsehoods, because that’s what appears in training data. Long Ouyang and colleagues at OpenAI introduced a method called reinforcement learning from human feedback (RLHF) to bridge this gap.

The recipe involves three steps: first fine-tune on human demonstrations of good behaviour, then train a separate “reward model” to predict which outputs humans prefer, and finally optimise the language model against this reward model using reinforcement learning. The resulting InstructGPT was rated substantially higher than the base GPT-3 by human evaluators, despite being 100 times smaller. This paper made ChatGPT possible. The raw capabilities were there in GPT-3, but RLHF was what made the model actually pleasant and safe to interact with.

2022: The Scaling Laws Correction

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., et al. (2022). Training Compute-Optimal Large Language Models. NeurIPS.

By 2022, the conventional wisdom was that bigger models were better and you should make them as large as your compute budget allowed. Jordan Hoffmann and colleagues at DeepMind challenged this with careful empirical analysis. They found that most models were dramatically undertrained: given a fixed compute budget, you should balance model size and training data more equally than common practice suggested.

Their 70 billion parameter Chinchilla model, trained on 1.4 trillion tokens (about 4x more data than comparable models), outperformed Gopher, a 280 billion parameter model trained on the typical amount of data. The paper included scaling laws that precisely characterised the optimal allocation of compute between model size and data. Every major training run since has been influenced by these findings; the Llama models, for instance, explicitly cite Chinchilla when justifying their training decisions.

2025: Reasoning from Reinforcement

DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.

The most recent entry on this list is DeepSeek-R1, which demonstrated something the field had suspected but not proven. The reasoning capabilities that made models like OpenAI’s o1 impressive could be incentivised through pure reinforcement learning, without requiring human-annotated reasoning traces. DeepSeek trained their model using only a binary reward signal (is the final answer correct?) and watched as sophisticated reasoning behaviours emerged spontaneously.

The model learned to pause and reflect, verify its own work, and try alternative approaches when stuck. These behaviours emerged from the RL process itself, not from imitating human demonstrations. DeepSeek also showed that these reasoning patterns could be distilled from large models to smaller ones, achieving better results than training small models with RL directly. The paper was released with open weights under an MIT licence, which accelerated research across the field.

The Pattern

Looking at these ten papers together, a few themes emerge. Architectural innovations (attention, transformers, residual connections) provide new building blocks. Scaling insights (GPT-3, Chinchilla) tell us how to use compute effectively. Training paradigms (BERT’s pre-training, RLHF, RL for reasoning) determine what capabilities we can elicit from those architectures at those scales.

What Comes Next

If the pattern holds, the near-term future is probably less about bolts from the blue and more about combining and refining existing ideas. The papers from 2025 alone introduced native sparse attention, manifold-constrained residual connections, and pure RL for reasoning, but these techniques have mostly been studied in isolation. The obvious next step is integration: what happens when you combine sparse attention with expanded residual streams with MoE feedforward layers? Do the benefits compound or conflict? Expect a wave of papers exploring these interactions.

A deeper theme is the shift toward learned, input-dependent computation. Sparse attention learns which tokens to attend to. Mixture of Experts learns which parameters to activate. Residual stream routing could learn which information pathways matter for each input. The fixed architectural choices of early transformers are gradually becoming dynamic decisions made by the model itself.

No posts

Ready for more?

— Hacker News