向量嵌入：AI 不懂文字，它懂數學

Hacker News·4 個月前

本文深入探討向量嵌入，解釋 AI 如何將人類語言轉譯為數學表示法來理解概念，從而彌合了定性的人類溝通與定量機器計算之間的語義鴻溝。

Intelligence Unbound

AIdeas is a chronicle of deep dives into AI concepts, ML theories, and practical applications. Join this growing community of practitioners and enthusiasts as we dissect complex topics, share insights, and collectively deepen our understanding of the technology shaping our future

A deep dive into Vector Embeddings: AI Doesn’t Understand Words. It Understands Math.

Listen

LLMs don’t read text; they read lists of floating-point numbers. Here is a visual guide to the evolution, linear algebra, and code behind how AI captures human concepts.

Introduction: The Fundamental Problem

Computers are calculators. They understand numbers perfectly but are oblivious to nuance, irony, or synonyms. Humans, on the other hand, communicate almost entirely in nuance.

To build a system like ChatGPT, or even a simple “Chat with PDF” (RAG) tool, you must solve the Semantic Gap: How do you translate the fuzzy, qualitative world of human language into the precise, quantitative world of machine computation?

The answer is Vector Embeddings. An embedding is a translation layer that maps a discrete concept (like a word) to a continuous point in a high-dimensional mathematical space.

The Evolution of “Meaning”

To appreciate how modern Transformers work, we must first understand the problems with older methods.

1. The Old Way: The “Bag of Words” (Counting)

Before deep learning, computers treated language as a “Bag of Words.” To represent a vocabulary of 10,000 words, you created a vector of 10,000 numbers, almost all of which were zero.

The word “Apple” was just a 1 in a specific column.

The Problem: This is incredibly inefficient. Worse, it contains zero meaning. The mathematical distance between “Apple” and “Orange” is the exact same as the distance between “Apple” and “Carburetor.” The computer only knows they are different indices.

2. The Breakthrough: Word2Vec (Context)

Around 2013, researchers found a way to create dense vectors. Instead of 10,000 zeros, “King” became a compact list of maybe 300 numbers:

[0.12, -0.45, 0.88, ...]

How? By training a neural network on the idea that “You shall know a word by the company it keeps.”

The model looks at a sliding window of text. If “King” and “Queen” frequently appear surrounded by similar context words like “throne,” “crown,” and “ruled,” the network learns to place their mathematical representations close together.

3. The Modern Era: Transformers (Nuance)

Word2Vec had a flaw: the word “bank” had one static vector, regardless of whether you meant a “river bank” or a “financial bank.”

Modern Transformer models (like BERT and GPT) generate contextual embeddings. They look at the entire sentence at once. In a modern LLM, the vector for the word “bank” changes dynamically depending on the other words in the sentence.

The Linear Algebra of Meaning

Because these concepts are now just lists of numbers, we can perform math on them. This is where the true magic of embeddings is revealed. It leads to the most famous example in all of Natural Language Processing:

King — Man + Woman ≈ Queen

This isn’t a metaphor; it’s literal arithmetic. If you take the vector coordinate for “King,” subtract the coordinate for “Man” (effectively removing the “masculinity” component), and add the coordinate for “Woman” (adding the “femininity” component), you land almost exactly on the coordinate for “Queen.”

The Ruler (Cosine Similarity)

Now, how do we use this for something practical like search? If a user queries for “Puppy,” how does the AI know to return a document about “Dog Health”?

It needs a ruler to measure distance in this multi-dimensional space. We don’t use standard “as-the-crow-flies” distance (Euclidean distance) because the length of a document’s vector can vary based on how many words it has.

Why Euclidean Distance Fails

Your first instinct might be high-school geometry distance (Euclidean distance).

The problem is that in high-dimensional spaces, magnitude (the length of the arrow) can be misleading. A long document about apples might have a very long vector, while a short sentence about apples has a short vector. They point in the same direction (same meaning), but the Euclidean distance between their tips is huge.

Enter Cosine Similarity

Instead, we measure the angle between two vectors. This metric is called Cosine Similarity.

This is the core mechanism of semantic search. The AI calculates the cosine similarity between your query’s vector and the vectors of every document in its database, then returns the ones with the highest scores.

The Code (From toy example to reality)

Let’s solidify this with Python. We will start with a manual, low-dimensional example to see the math work, and then look at what real-world embedding data looks like.

A Toy 3D Search Engine

We will use numpy to build a semantic search engine in just a few lines of code. We will define a hypothetical 3D concept space:

[Is_Animal, Is_Domesticated, Has_Wheels]

The math successfully captured our intuition. “Dog” is the closest match. “Wolf” is related (it’s an animal), but the angle is wider because the “domesticated” dimension doesn’t align. “Truck” is mathematically irrelevant.

Real-World Vectors (1536 Dimensions)

What do these vectors look like in production? They aren’t nice, readable numbers like 0.9. They are a dense block of abstract floats.

Here is a snippet of Python code using langchain and OpenAI to fetch a real vector for the sentence "Hello world".

Note: You need an OPENAI_API_KEY set in your environment to run this.

That list of 1,536 numbers is how the AI “understands” the concept of “Hello world”. Every single piece of text you send to a RAG system is converted into one of these lists before any searching happens.

Conclusion: The Foundation of Applied AI

Vector embeddings are the fundamental data structure of modern AI. They are the bridge that connects the fuzzy, qualitative world of human language to the precise, quantitative world of machines.

Here is a summary of what we’ve covered:

Understanding these core concepts is the difference between just using an AI library and truly understanding how to build and debug intelligent applications.

Published in Intelligence Unbound

Written by Anonymous

No responses yet

Help

Status

About

Careers

Press

Blog

Privacy

Rules

Terms

Text to speech

— Hacker News