Negative Sampling Demo

Context Windows and Negative Sampling

the

quick

brown

fox

jumps

Green: Context words (positive samples) Red: Random words (negative samples)

Context Window Size 2

Negative Samples 5

// Training log will appear here

// Adjust parameters and click "Run Training Step"

10,000

Vocabulary Size

0.05%

vs. Full Softmax

20x

Speed Increase

❌

Traditional Softmax Approach

For each center word, calculate a probability for every word in the vocabulary:

P(c|w) = ${\large\frac{e^{v_c \cdot v_w}}{\sum_{c' \in V} e^{v_{c'} \cdot v_w}}}$

This requires:

Extremely computationally expensive for large vocabularies!

✓

Negative Sampling Approach

For each center word, train on:

${\log \sigma(v_c \cdot v_w) + \sum_{i=1}^{k} \mathbb{E}_{c_i \sim P_n(w)} [\log \sigma(-v_{c_i} \cdot v_w)]}$

This requires:

Makes training on billions of words practical!

Positive Sampling: For a center word, its context words (within window) are positive examples
Negative Sampling: Randomly sample words from vocabulary as negative examples
Simplified Objective: Binary classification instead of multi-class prediction
Unigram Distribution: Sample according to word frequency (raised to power 3/4) for better performance
Mathematical Foundation: Approximates the NCE (Noise Contrastive Estimation) approach
Bridge to Next Module: This efficiency enables training on vast text corpora, setting the stage for larger models