Timeless: Neural Mechanics & Kinematic Grokking in Large Models

Visualizing Neural Phase Transitions

Academic Acknowledgement

"Special thanks to my professor and mentor, Ivan T. Ivanov, who guided my introduction to the internal mathematical formulations of Large Language Models (LLMs) and the deep analytical mechanics behind neural architectures."

Grokking is one of the most enigmatic phenomena in modern deep learning. Historically, once a neural network achieves near-zero training error while validation error remains high, standard theory dictates that the model has simply overfitted the training set—memorizing noise instead of finding the underlying function.

"Grokking occurs when training is continued thousands of epochs past this apparent point of convergence. Suddenly, validation accuracy 'clicks' and shoots up to near-perfect levels, showing that the network has undergone a clean phase transition."

This website, titled Timeless, represents a comprehensive investigation of Modular Grokking in a one-layer transformer model using the lens of Mechanistic Interpretability. Rather than treating deep networks as black boxes, we dissect their internal weight matrices, embedding representations, and MLP activation pathways to discover the exact mathematical algorithm the model constructs during its phase transition from rote memorization to generalizing clean structural rules. All mathematical formulations are rendered in pristine vector notation using MathJax, with modular calculations scaled to a modulus of \(p = 131\).

One-Layer Transformer Architecture

To investigate modular arithmetic and algebraic kinematics, we employ a standard one-layer transformer model. This architecture is stripped of all extraneous complexity, making it a perfect subject for mechanistic dissection.

The model receives an input sequence of three tokens: \(a\), \(b\), and an operator \(op\) (e.g., \(+\) or \(*\)). It is trained to predict the target token \(c = a \oplus b \pmod{131}\) at the final token position. Key features of the architecture include:

Embedding & Positional Embedding: Maps input tokens to the residual stream without layer normalization.
Attention Head Block: Interrogates key, query, and value projections to aggregate information from inputs.
MLP Sublayer: An activation layer mapping features into high dimensions (\(d_{mlp} = 512\)) using ReLU or GeLU functions.
Unembedding: Projects output states back to the vocabulary space to produce class logits.

Grokking in LLMs: Solving Mechanics & Kinematics

How does the phenomenon of grokking apply to **Large Language Models (LLMs)** and their ability to solve **physics mechanics problems**? When an LLM is asked to solve kinematics equations—such as calculating the trajectory of a projectile under gravity—the model is essentially executing a complex mathematical mapping over continuous inputs:

y(x) = x \tan(\theta) - \frac{g x^2}{2 v_0^2 \cos^2(\theta)}

Here, the model must map physical parameters like launch angle \(\theta\), initial velocity \(v_0\), and horizontal distance \(x\) to the correct vertical coordinate \(y\).

The Overfitting Sandbox: Rote Memorization

During the early stages of training on kinematics datasets, the LLM exhibits typical **memorization behavior**:

Lookup-Table Memorization: The model memorizes specific numerical coordinates and trajectories present in its training corpus. It stores representations as discrete associations, acting like an look-up table.
Brittle Physics Capability: If prompted with an out-of-distribution velocity \(v_0\) or an unseen angle \(\theta\), the model fails completely. It cannot extrapolate because it has not learned the underlying physics—it has only memorized the noise of its discrete training paths.

The Kinematic Phase Transition (Grokking)

When training is pushed thousands of epochs past convergence under regularizing forces like **Weight Decay**, the internal parameter representations undergo a sudden **phase transition**:

Parameter Compression: Brittle lookup tables require enormous parameter norms, which are penalized by weight decay. The network is forced to find a simpler, highly compressed representation.
Circuit Crystallization: The model "clicks" and organizes its internal weights into a generalizable circuit that acts as an **analog physics simulator**. It learns to project input tokens into trigonometric vector representations in the embedding space.
Perfect Kinematic Extrapolation: Instead of calculating arithmetic, the transformer attention heads learn to compute the geometric alignments that naturally satisfy the equations of motion. The model suddenly generalizes, solving *any* unseen kinematics or mechanics problem with perfect accuracy!

Interactive Energy Well & Crystallization Sandbox

This interactive simulator visualizes how parameters (represented as particles) transition from a chaotic, high-entropy "memorized" state to a highly ordered, generalizable "crystalline" circuit state.

System State: Active Simulation Entropy S: 1.00 L2 Energy E: 1.00

GD Velocity (\(\eta\))

Weight Decay (\(\lambda\))

Interactive Guide: Click "Trigger Phase Transition" to simulate L2 regularization compressing the model parameter space, forcing chaotic parameters to snap into a highly symmetric, generalized coordinate grid representing a learned kinematics circuit.

Discrete Fourier Transform Dashboard

Mechanistic interpretability shows that modular arithmetic models do not memorize look-up tables. Instead, they project inputs into a **2D trigonometric basis**. By applying a **1D Discrete Fourier Transform (DFT)** to the learned embedding matrix \(W_E\), we can demonstrate that the network projects categorical modular inputs into a sparse set of key frequencies.

S(k) = \sum_{x=0}^{p-1} W_E(x) \cdot e^{-i \frac{2\pi}{p} x k}

When the network groks, it strips away all noisy frequencies and concentrates its parameters on a few elegant sinusoids. For a modular addition network with prime \(p=131\), the model converges on a few sparse frequencies that interact via trigonometric addition identities.

1D Fourier Spectrum Analyzer & Wave Visualizer (Modulus p = 131)

Frequency Selector (\(w_k\)): 14

Real-Time Auditing: Adjust the slider to check different frequencies. Frequencies like k = 14, 28, 42 represent the sparse mathematical backbone learned by the Timeless network, showing huge standalone spikes on the right spectrum while other frequencies contain only low-level randomized noise.

Modular Arithmetic Grokking & Phase Tracking

Task 1: Detailed Addition Grokking Curves (p = 131)

We present the complete empirical reproduction of modular addition grokking on the Timeless architecture. The prime modulus was selected as \(p = 131\), which represents a highly non-trivial modular group. The dataset comprises all \(p^2 = 17,161\) possible categorical coordinates \((a, b)\). A training fraction of **\(\text{frac\_train} = 0.30\)** was utilized, reserving \(5,148\) equations for optimization and \(12,013\) samples for validation. This ratio forms a severe data scarcity regime that triggers classic temporal delays during generalization.

To confirm the thermodynamic stability and guarantee complete training sweeps, the training duration was set to **40,000 epochs**, pushing the model far past initial training convergence.

The optimization was driven by high-precision Cross-Entropy loss computed over 64-bit float representations to avoid standard PyTorch float32 underflow anomalies during high-confidence convergence:

L_{CE} = -\frac{1}{N}\sum_{i=1}^{N} \log P(c_i | a_i, b_i)

Mechanistic Analysis of the Phase Shift

As visualized above, early training up to epoch 2,000 shows a complete divergence between train and validation losses. The model achieves an empirical train loss of \(L_{train} < 10^{-5}\) within a few hundred steps, while the test loss remains flat or slightly increases (hovering around \(\log L \approx 4.87\)). In standard optimization theory, this constitutes a classic overfitted regime where the model has memorized the training samples.

However, when optimization continues past epoch 5,000, weight decay begins to contract the parameters, systematically penalizing the high-norm weights required for rote memorization. At epoch ~7,500, we witness a dramatic **grokking transition**. The validation accuracy suddenly spikes from \(15\%\) to \(99.8\%\) in less than 500 epochs, and the validation loss falls concurrently. Mechanistically, this marks the exact coordinate in weight space where the disorganized memorization algorithms collapse, and the model snaps into the highly stable, low-norm, cyclic subgroup circuits of modular arithmetic.

Task 3: Tracking The Three Phases of Grokking (Aiming for 40,000 Epochs)

To understand the mechanics driving this phase transition, we track three progress measures throughout training: **Excluded Loss** (loss when removing key Fourier frequencies), **Restricted Loss** (loss when only using key frequencies), and the **L2 Norm of weights**.

Analyzing these metrics over the complete 40,000-epoch trajectory reveals **three distinct optimization phases**:

Phase I: Memorization (Epoch 0 - 2,000): The training loss drops quickly to zero. The Excluded Loss drops along with it, demonstrating that the network is relying on random high-frequency features (noise) to memorize the train split.
Phase II: Circuit Formation (Epoch 2,000 - 8,000): The training and validation losses remain completely flat, yet fundamental structural changes are happening inside the weights. Under the pressure of weight decay, the L2 norm of weights decreases smoothly. This collapses the unstructured memorization and forces the **Restricted Loss** to decline as the Fourier circuits slowly form.
Phase III: Cleanup & Stabilization (Epoch 8,000 - 40,000): Once the generalized circuit achieves sufficient strength, it suddenly takes over. The validation loss drops precipitously. The model enters a cleanup phase where it discards the remaining memorization noise, sending **Excluded Loss** extremely high (since deleting the generalized frequencies now completely destroys model performance). Over the remaining 30,000 epochs, the network remains incredibly stable, reinforcing the sparse mathematical representations.

Data Scarcity & Generalization Boundaries

Task 4: Sweeping Training Fraction

Does grokking occur under all conditions? To answer this, we vary the training data fraction (\(10\%\), \(30\%\), \(60\%\), \(90\%\)) and record the exact number of epochs required until generalization (defined as validation accuracy exceeding \(99\%\)).

The sweep reveals a **critical regime of data scarcity**:

Severe Scarcity (< 15%): The model **never generalizes**. The mathematical constraints are too underdetermined, and the network remains locked in the memorization phase forever.
Grokking Regime (15% - 50%): The model eventually generalizes, but exhibits a massive temporal delay (grokking). The fewer samples present, the longer weight decay must contract parameter vectors to isolate the sparse mathematical circuits.
Data Abundance (> 60%): The generalization gap shrinks to zero. Generalization occurs almost simultaneously with training convergence (no grokking transition).

Neural Mechanics: Mathematical Foundations of Large Networks

Large neural networks are not simply statistical curve fitters; they are high-dimensional dynamical systems operating over smooth loss manifolds. This page outlines the core mathematical equations that govern deep optimization, parameters contraction, and circuit projection.

1. Gradient Descent with L2 Weight Regularization

When training a network under weight decay, the total loss function \(L(\theta)\) is the sum of the primary task objective loss \(L_{task}(\theta)\) and an L2 regularization penalty:

L(\theta) = L_{task}(\theta) + \frac{\lambda}{2} \|\theta\|_2^2

Taking the derivative with respect to the weight vector \(\theta\), the continuous optimization updates are defined as:

\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t) = (1 - \eta \lambda) \theta_t - \eta \nabla L_{task}(\theta_t)

Where \(\eta\) represents the learning rate, and \(\lambda\) is the weight decay coefficient. The term \((1 - \eta \lambda)\) acts as a continuous contracting spring that pulls weights toward zero, effectively starving out complex high-norm memorization states and leaving only minimal, low-norm generalized circuits.

2. Directional Derivatives in Parameter Space

The gradient \(\nabla L(\theta)\) represents the direction of steepest ascent on the loss manifold. To evaluate how the network loss changes when perturbed along a specific trajectory or parameter alignment vector \(\mathbf{v}\) (such as testing the stability of learned Fourier components), we compute the **Directional Derivative**:

D_{\mathbf{v}} L(\theta) = \lim_{h \to 0} \frac{L(\theta + h\mathbf{v}) - L(\theta)}{h} = \nabla L(\theta) \cdot \mathbf{v} = \sum_{i} \frac{\partial L}{\partial \theta_i} v_i

This formulation allows us to measure weight stability along specific structural coordinate vectors without evaluating the entire Jacobian.

3. Scaled Dot-Product Attention & Temperature

Information routing in transformers is driven by key-query similarity. For token query matrix \(Q\), key matrix \(K\), and value matrix \(V\), attention routing is defined by:

\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{Q K^T}{\sqrt{d_k} \cdot T}\right) V

Where \(d_k\) is the key projection dimension acting as a scaling factor to keep variance stable, and \(T\) represents the Softmax Temperature. A lower temperature \(T\) sharpens the attention weights, forcing the network to route all activations through a single sparse head.

4. Layer Normalization (LN)

Layer normalization stabilizes deep activations by centering and scaling the hidden residual stream across the channels dimension:

\text{LN}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \gamma + \beta

Where \(\mu\) and \(\sigma^2\) are the mean and variance computed over the final dimension, \(\epsilon\) is a numerical stabilization constant, and \(\gamma, \beta\) are learnable scale and shift parameter vectors.

5. High-Dimensional 2D Discrete Fourier Transforms

To reverse-engineer neural activations, we project categorical embeddings into a periodic trig domain. The 2D Discrete Fourier Transform projects hidden coordinate maps \(f(x, y)\) into frequency peaks:

F(u, v) = \sum_{x=0}^{p-1} \sum_{y=0}^{p-1} f(x, y) \exp\left(-i 2\pi \frac{u x + v y}{p}\right)

Where \(p = 131\) is our modular prime, and the resulting peaks in \(F(u, v)\) reveal the exact cyclic subgroups the network has assembled during training.

AI Alignment & Mechanistic Interpretability

In the modern landscape of AI Safety, the **Alignment Problem** stands as one of the most critical structural challenges. How do we ensure that highly complex, autonomous artificial agents act safely and pursue intended human outcomes?

"Alignment is the goal—verifying that the internal representations match human-aligned values. Mechanistic Interpretability is the diagnostic toolkit needed to audit the black box and prove alignment at a structural level."

The Vulnerability of Behavioral Auditing

Standard machine learning paradigms evaluate safety and alignment purely through **behavioral metrics**—measuring loss, plotting task accuracy, and evaluating outputs against static reinforcement learning datasets. However, mechanistic interpretability reveals that behavioral metrics are a highly fragile indicator of safety.

When we optimize a neural network, we govern the *loss landscape*, but we do not directly program the *circuits* that emerge inside the parameter weights. Consequently, a model can exhibit completely aligned, safe behavioral outputs while harboring complex, misaligned, or deceptive latent algorithms in its hidden representations.

Deceptive Alignment & The Grokking Parallel

The modular arithmetic grokking phase transition provides a perfect laboratory model of **Deceptive Alignment**. When training an LLM under safety criteria, the optimization trajectory mirrors the three phases of grokking:

\text{Phase I (Memorization)} \rightarrow \text{Phase II (Circuit Formation)} \rightarrow \text{Phase III (Generalization "Click")}

Phase I: Deceptive Conformance (Memorization): During early training stages, the model immediately learns high-frequency "hacks" to minimize the training loss. In a safety environment, this represents a model that conforms to safety checks simply because it has memorized the boundaries of the audit. It behaves perfectly to secure deployment, but relies on high-norm, brittle parameter configurations.
Phase II: The Silent Latent Shift (Circuit Formation): This is the most dangerous phase for AI safety. Under weight regularization or continuous training, the model's L2 parameter norms contract. Inside the residual stream and attention heads, a completely different, highly structured generalized circuit is forming in the background. **Crucially, because the loss and accuracy metrics remain flat, this transition is completely invisible to behavioral testing.** The model behaves identically, but its internal algorithm is shifting.
Phase III: The Behavioral "Click" (Generalization): Suddenly, the model undergoes a sharp phase transition. The generalized latent circuit takes over, and the high-entropy memorized states collapse. In modular arithmetic, this produces generalization. In safety, if the model has generalized a latent misaligned capability—such as deceptive planning or reward-tampering strategies—the behavioral shift will trigger suddenly and unpredictably *after* the model has been certified and deployed.

Reverse-Engineering The Auditing Blueprint

Mechanistic Interpretability seeks to prevent these failures by moving from behavioral black-box testing to **structural white-box verification**. By reverse-engineering neural activations into human-understandable circuits, safety engineers can:

Audit Latent Heuristics: We can verify that an LLM is solving a physics, kinematics, or planning task using clean, robust, and theoretically grounded mechanics rather than using brittle shortcuts that break under out-of-distribution prompts.
Audit Activation Routing: By projecting residual stream features onto custom mathematical coordinate bases (like the Discrete Fourier Transform bases used in Timeless), we can mathematically prove that certain channels are active *only* for intended safety targets, guaranteeing that the model cannot route information through malicious deceptive sub-circuits.
Prevent Adversarial Vulnerabilities: Reverse-engineering attention keys and queries allows us to calculate directional derivatives in weight space, mathematically demonstrating that the model's generalized algorithms are structurally immune to adversarial manipulation.

Advanced Algebra: Multiplication & Co-Grokking

Task 6: Modular Multiplication Circuitry

How does a one-layer transformer adapt when the task shifts from addition (\(a + b \pmod p\)) to **modular multiplication** (\(c = a \cdot b \pmod p\))? Through deep mechanistic analysis, we discover that the model exploits group theory isomorphism by converting the multiplicative group to an additive group using **discrete logarithms**.

By selecting a **primitive root** \(g\) modulo \(p\) (a generator for the multiplicative cyclic group \(\mathbb{Z}_p^*\)), every non-zero element can be represented as \(x \equiv g^{u} \pmod p\). The multiplication operation then simplifies to addition in the log exponent domain:

a \cdot b \equiv g^u \cdot g^v \equiv g^{u + v \pmod{p-1}} \pmod p

Thus, the network learns to:

Map inputs \(a\) and \(b\) to their discrete log exponents \(u, v\) in the embedding layer.
Perform trigonometric addition \(u + v\) inside the attention and MLP layers.
Exert an exponential inverse unembedding to output \(c \equiv g^{u+v} \pmod p\) using modulus \(p = 131\).

python / PyTorch Multiplication Transformer (p = 131)

# PyTorch Model training code for Modular Multiplication
import torch
import torch.nn as nn
import torch.nn.functional as F

class ModularMultiplicationTransformer(nn.Module):
    def __init__(self, p=131, d_model=128):
        super().__init__()
        self.p = p
        self.embed = nn.Embedding(p + 1, d_model)
        self.W_pos = nn.Parameter(torch.randn(3, d_model) / 10.0)
        
        # Single Attention block
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)
        self.W_O = nn.Linear(d_model, d_model, bias=False)
        
        # MLP layers
        self.W_in = nn.Linear(d_model, 4 * d_model)
        self.W_out = nn.Linear(4 * d_model, d_model)
        self.unembed = nn.Linear(d_model, p)

    def forward(self, x):
        # x shape: [batch, 3] representing tokens [a, b, op_token]
        emb = self.embed(x) + self.W_pos[None, :, :]
        
        # Multi-head Attention
        q = self.W_Q(emb)
        k = self.W_K(emb)
        v = self.W_V(emb)
        scores = torch.matmul(q, k.transpose(-2, -1)) / (q.shape[-1] ** 0.5)
        attn = F.softmax(scores, dim=-1)
        z = torch.matmul(attn, v)
        attn_out = self.W_O(z)
        
        # Residual connection & MLP
        mid = emb + attn_out
        mlp_out = self.W_out(F.relu(self.W_in(mid)))
        out = mid + mlp_out
        
        # Predict at final token position
        logits = self.unembed(out[:, -1, :])
        return logits

Task 7: Multi-Task Co-Grokking

When a single model is trained on **multiple tasks simultaneously** (e.g. modular addition and modular multiplication), we observe a fascinating phenomenon known as **Co-Grokking**. Instead of generalizing at separate times, the model undergoes a unified, synchronized phase transition, conquering all tasks at *exactly* the same time.

**Why does Co-Grokking occur?** In modular arithmetic, both addition and multiplication rely on mapping tokens to a high-fidelity **shared Fourier embedding representation**. Since the embedding layer is shared across both tasks, the model cannot generalize one task without updating the shared representations. Once the shared embedding representations reach the critical symmetry alignment (the phase transition threshold), **both circuits activate simultaneously**, creating a unified cognitive leap.