The Foundation of Large Language Models: Understanding Embeddings in 5 Minutes

I recently revisited Chapter 5 of the NLP “bible”—Jurafsky & Martin’s Speech and Language Processing. This chapter focuses on Static Embeddings, such as Word2vec and GloVe.

You might ask: Doesn’t everyone use BERT and GPT now? Why bother learning “old stuff”?

Because static models are the foundation. Without a solid foundation, you can’t understand the skyscraper built on top.


1. Core Intuition: The Distributional Hypothesis

“You shall know a word by the company it keeps.”

Here’s an example from the book: If an unfamiliar word Ongchoi frequently appears near “garlic,” “delicious,” and “sautéed,” the AI can infer it’s a vegetable (water spinach)—without ever consulting a dictionary.

Figure 5.1 Figure 5.1: In this vector space, words with similar meanings automatically cluster together. Notice how “sweet” is surrounded by “honey”, “candy”, and “chocolate”.


2. Static vs Dynamic: The One-Line Distinction

Static (Word2vec): One word, one vector. The word “apple” always has the same vector, whether it refers to the fruit or the company.

Dynamic (BERT): One word, multiple vectors. The “apple” in “apple pie” and “Apple stock” generates different vectors based on context.

Although static models can’t handle polysemy, they’re fast, cheap, and interpretable—many RAG systems and recommendation engines still use them today.


3. Why Do We Need Embeddings?

Early sparse vectors had dimensions equal to vocabulary size (tens of thousands), with most values being zero.

The biggest problem was “orthogonality”: car and automobile appeared completely unrelated to the model because they occupied different dimensions.

Embeddings (dense vectors) solved this:


4. How Do We Measure “Closeness”? Cosine Similarity

Why not use Euclidean distance? Because high-frequency words naturally have longer vectors (more co-occurrences), which would dominate the distance calculation.

Cosine Similarity = normalized dot product. It eliminates the influence of vector length (word frequency), letting us focus purely on semantic direction.

The core formula:

cosine(v, w) = (v · w) / (|v| × |w|)

Figure 5.5 Figure 5.5: The angle between “Digital” and “Information” is small, indicating semantic similarity; “Cherry” points in a completely different direction.


5. The Art of Training Word2vec

Skip-gram is essentially a “pretend game”: it pretends to predict whether a word will appear near another word.

What we want isn’t the prediction result, but the trained weights—that’s the Embeddings.

Figure 5.7 Figure 5.7: The skip-gram model pulls “apricot” and “jam” closer while pushing noise words like “Tolstoy” and “matrix” away.

Pro tip: Word2vec can’t handle words it hasn’t seen during training (the OOV problem). FastText solves this by breaking words into n-gram subwords—even new words can be represented by combining subword vectors.


6. The Other Side of the Coin: AI Bias

Embeddings don’t just learn language—they also learn human stereotypes:

Worse still, bias gets amplified in the model. This is the “Allocational Harm” that AI deployment must guard against.

Figure 5.9 Figure 5.9: The parallelogram model showing analogy relationships.


Who Needs This Knowledge Most?


From Word2vec to BERT, the underlying logic remains consistent: represent semantics with vectors, measure similarity with distance.

Build a solid foundation, then construct the skyscraper.


Source: Jurafsky & Martin, Speech and Language Processing, Chapter 5