Negative Sampling for Efficient Word Embedding Training

Explore how negative sampling makes training Word2Vec dramatically more efficient

Context Windows and Negative Sampling

Training Example with Context Window

the
quick
brown
fox
jumps
Green: Context words (positive samples) Red: Random words (negative samples)
Context Window Size 2
Negative Samples 5
// Training log will appear here
// Adjust parameters and click "Run Training Step"
10,000
Vocabulary Size
0.05%
vs. Full Softmax
20x
Speed Increase

Negative Sampling Benefits

  • Dramatically reduces computation by sampling only a few negative examples
  • Makes training on billions of words practical with limited resources
  • Produces high-quality embeddings despite simplified objective
  • Scales efficiently with vocabulary size (critical for language models)
  • Preserves semantic relationships in the resulting word vectors
Traditional Softmax Approach

For each center word, calculate a probability for every word in the vocabulary:

P(c|w) = ${\large\frac{e^{v_c \cdot v_w}}{\sum_{c' \in V} e^{v_{c'} \cdot v_w}}}$

This requires:

  • Dot product with every word vector in vocabulary
  • Sum over all vocabulary (often 100,000+ words)
  • Computational complexity: O(V) where V is vocabulary size

Extremely computationally expensive for large vocabularies!

Negative Sampling Approach

For each center word, train on:

  • Actual context words (positive samples)
  • A few random words (negative samples)
${\log \sigma(v_c \cdot v_w) + \sum_{i=1}^{k} \mathbb{E}_{c_i \sim P_n(w)} [\log \sigma(-v_{c_i} \cdot v_w)]}$

This requires:

  • Dot product with only k negative samples (usually 5-20)
  • No summation over entire vocabulary
  • Computational complexity: O(k) where k is number of negative samples

Makes training on billions of words practical!

How Negative Sampling Works