Module 2: Transformer Architecture

From static embeddings to dynamic attention - understanding the architecture that powers modern LLMs

Interactive Transformer Explorations

This module takes you inside the transformer architecture through hands-on visualizations. See how attention mechanisms enable dynamic focus, how position embeddings solve order problems, and how the complete transformer block processes language.

13+
Interactive Demos
4
Core Sessions
8+
Key Concepts
0
Prerequisites
2.0

The Big Picture: Generative Search Engines

Conceptual overview of transformers as generative search systems

3
Visualizations
Generative Search Engine Demo
See how transformers differ from traditional search engines
Concepts Search Generation
Explore Demo
Architecture Evolution
Trace the evolution from Bengio's model to modern transformers and understand scaling advantages
Evolution Bengio to Transformer Scaling
Explore Demo
Attention vs Knowledge Storage
Interactive comparison of attention mechanism and FFN knowledge storage
Architecture Concepts
2.1

From Text to Transformer Inputs

Tokenization, knowledge storage, and the selection problem

3
Visualizations
Tokenization Explorer
Discover how subword tokenization handles the long tail of language
Tokenization BPE Power Laws
Explore Tokenization
FFN Knowledge Storage
See how feed-forward networks store knowledge through expand-contract architecture
FFN Knowledge Architecture
Explore Storage
Selection Problem Demo
Why Bengio's fixed concatenation fails for dynamic language understanding
Selection Context Problems
See the Problem
2.2

Attention and the Transformer Block

Understanding attention mechanisms and complete transformer architecture

4
Visualizations
Attention Visualizer
Interactive exploration of attention weights and dynamic selection
Attention Weights Selection
Explore Attention
Position Embeddings
Why attention needs position information to understand word order
Position Order Embeddings
Explore Position
Multi-Head Attention
How different heads specialize in different types of relationships
Multi-Head Specialization Parallel
Explore Heads
Transformer Block Builder
Build and understand the complete transformer block architecture
Architecture Builder Complete
Build Blocks
2.3

Training and Scaling Modern LLMs

Scaling laws and supervised fine-tuning transformation

3
Visualizations
Scaling Laws Explorer
Discover how AI performance improves predictably with scale
Scaling Power Laws Performance
Explore Scaling
SFT Transformation
See how supervised fine-tuning transforms text predictors into assistants
SFT Training Assistant
See Transformation
Pre-training vs Fine-tuning
Interactive comparison of training phases and their effects
Pre-training Fine-tuning Comparison