Do Foundation Models Have an Ability to Learn "World Models"?

chocogoodluck·2025년 7월 13일

1. High Prediction Accuracy, No Intrinsic World Model

This paper proposes a framework called "Inductive Bias Probe" to evaluate whether foundation models, particularly Transformer-based sequence prediction models, genuinely learn a "world model". This approach utilizes synthetic datasets generated according to specific physical laws or logical structures to assess whether the model has internalized the reasoning derived from those laws.

The most representative experiment involves providing time-series data of planets orbiting a sun to a Transformer model and evaluating if it has truly learned Newtonian mechanics. Although the model achieves high prediction accuracy with an R² > 0.9999, this merely demonstrates its excellent sequence prediction capability. Crucially, it fails at fine-tuning tasks involving the inference of force vectors or laws of gravity. Furthermore, the physical equations derived through symbolic regression bear no resemblance to Newton's law of universal gravitation. This indicates that despite high prediction accuracy, the model has not internalized a world model, suggesting it has simply overfit to task-specific heuristics within the data.

The framework is extended to other domains with clear state spaces, such as lattice problems and Othello games. A common phenomenon observed across all experiments is that while prediction accuracy is high, the inductive bias towards the underlying fundamental structure is weak.

2. Overfitted Simulations, Limited Interpretations Due to Lack of Structural Analysis

While the analysis presented in the paper is interesting, its experimental design and interpretation exhibit several significant limitations.

First, the model was trained to perform only a single task—namely, synthetic planetary orbit prediction—and the data diversity was extremely limited.

Newton's law of universal gravitation, a universal law in physics, operates consistently across all conditions, not just specific situations or environments. As demonstrated in this paper, it applies not only to planetary orbits in our solar system but also consistently to various other physical scenarios, such as:

Free fall of an apple
Binary star systems where two stars orbit each other
Gravitational wells where gravitational influence is strong at specific locations
Lunar orbit, escape velocity of rockets, and so on

That is, the law of universal gravitation is not a localized rule specific to particular data or conditions; it possesses a general and deterministic structure that operates consistently across diverse scales and conditions.

However, the Transformer model used in this paper was trained solely on data from a single, restricted simulation environment: planets orbiting a sun. This environment is physically simplified, incorporating constraints such as a fixed central mass, absence of forces other than gravity, and 2D coordinate system. Consequently, the model could perform predictions very well within this specific environment, achieving a high prediction accuracy of R² > 0.9999.

Naturally, such performance is likely not due to the model internalizing physical laws, but rather to its overfitting to specific patterns operating within that simulation environment. In other words, the model merely learned task-specific heuristics that work well only for the given data, and it did not possess the ability to generalize universal physical structures. (This is precisely the point the paper raises as a problem.)

However, is it really valid to judge whether "LLMs learn world models" based on results derived from a model configured in this way? A model trained solely on one type of simulation (and synthetic data, not observational data) inherently induces overfitting, making it difficult to derive general laws. This suggests that the model might not have been suitable for measuring inductive bias in the first place.

Second, there is a complete absence of linear algebraic analysis of the model's internal structure or representation space.

The paper mentions the following about the LLM's architecture:

Transformer model (based on Vaswani et al., 2017): decoder-only, 12 layers, 12 attention heads, 768 dimensions (embedding)
Number of parameters: 109M parameters
Training: 25 epochs, utilizing 8 H100 GPUs
Optimizer: Adam optimizer (Kingma & Ba, 2014)
Learning Rate: 6e-4
Batch size: 64 (for force magnitude prediction)
Weight decay: 0.1
Gradient clipping: 1
While there is no explicit mention of the linear transformation structure (given it's a decoder-only Transformer, it's inferred to follow the basic Q/K/V projection, W_q, W_k, W_v, W_o structures)
Other details like MLP Layer configuration, Activation function, model dimension (d_model), or hidden dimension are not mentioned

When processing input data, a Transformer model projects input tokens into Query (Q), Key (K), and Value (V) vectors using three main linear transformation matrices: W_q, W_k, and W_v. These projected vectors are crucial in determining the relationships between tokens, specifically which information attends to what. The core operation here is the dot product of Q and K, which quantifies how much one token "attends" to another.

This structure can be seen as forming a semantic space within the model, beyond mere similarity calculations between data points. Therefore, if this model truly understood and internalized the physical world, this semantic space should exhibit a certain alignment with the directions (axes) of actual physical quantities like velocity, distance, mass, and time. In other words, if specific attention heads or the principal directions (e.g., eigenvectors) of the QK space align well with particular physical quantities, it would serve as evidence that the model has learned the intrinsic structure corresponding to those quantities.

However, this paper does not analyze such internal structures (e.g., whether specific attention heads show specialized responses to elements like gravitational distance r or mass m, or whether the projection space of QK dot product is structured in physically meaningful directions). Concluding that the model lacks a world model solely based on fine-tuning failures and symbolic regression results, without a linear algebraic analysis of how the model's internal representations reflect physical concepts, requires careful consideration.

Furthermore, Transformer models employ multiple attention heads (this model uses 12 heads), each designed to focus on different meanings or types of information. This implies that the model forms a distributed semantic space, making it inappropriate to evaluate the entire model's semantic structure based on just one or two fine-tuning results.

In conclusion, asserting the absence of a world model solely based on the model's output without analyzing how the Transformer's internal linear transformation structure aligns with the axes of actual physical quantities is premature and necessitates more precise structural analysis. This is a crucial issue, especially for understanding how the internal representations of high-dimensional models like LLMs correspond to our physical concepts, particularly from the perspective of interpretability.

Third, the model's objective function is fixed on next-token prediction.

Physical laws intrinsically possess a deterministic structure. This means that when identical input conditions (e.g., mass, position, velocity) are provided, they consistently produce identical outputs (e.g., force, energy). For instance, Newton's law of universal gravitation can be expressed as a single function where, given only the masses and distance between two objects, the gravitational force between them can always be precisely calculated. Thus, laws of the physical world are composed of well-defined functional relationships that presuppose consistency and reproducibility.

In contrast, the Transformer-based model used in this paper was trained using a next-token prediction method. The goal of this method is to probabilistically predict the next token given a specific input sequence. For example, given previously observed positions, velocities, etc., it predicts what the next position might be. This training method fundamentally learns statistical patterns at the token level and does not necessarily require the model to output exactly one correct answer for a given state. In fact, it is a structure that allows for multiple possible outcomes for the same state, meaning the model operates probabilistically by nature.

This leads to the following problems:

Even if the input state is identical, different predictions may arise. That is, the model can produce different outputs for the same physical conditions based on the learned probability distribution or context. This contradicts the principle of "identical conditions → identical results" required by physical laws.
The model learns the sequence of tokens generated by the state, not the state itself. Therefore, the concept of a physical state vector may not explicitly exist within the model, making it difficult to ascertain which physical laws it follows.
Consistent inference or the derivation of laws is not the learning objective. The purpose of next-token prediction is to accurately predict the next value for a specific dataset, not to learn the universal laws or principles that generated that dataset.

For these reasons, even if the Transformer exhibits high prediction accuracy (e.g., R² > 0.9999), it is difficult to claim that the model "understood" physical laws. This is because the objective function was not designed to internalize a world model in the first place. For a model to truly internalize well-defined structures like physical laws, such structural constraints must be explicitly imposed during the learning process, which is challenging with the conventional next-token prediction method alone.

In conclusion, the observed result in this paper—"despite high prediction accuracy, physical laws were not learned"—is more accurately seen as a structural limitation of the learning objective function rather than a limitation of the model itself.

3. Evaluation of Internalization Requires Linear Algebraic Structure and Physics-Based Constraints.

To overcome these limitations and verify whether Transformer-based prediction models can truly internalize universal physical laws, fundamental improvements in experimental design and model structure are necessary. Here are some personal suggestions for improvements:

(1) Ensure Generalizability with Diverse Physical Scenarios: The current experiment includes only one type of simulation: the sun-planet orbit. However, universal gravitation applies consistently in various situations, such as free fall, two-body interactions (binary systems), gravitational wells, and escape velocity. To truly verify generalizability as a world model, it is necessary to evaluate whether the model can infer a consistent force law across these diverse conditions.

(2) Analyze Layer-wise QK Subspace and Physical Quantity Basis Alignment: The Transformer's Q/K/V projection matrices form meaningful subspaces. It is necessary to verify how well these spaces align with actual physical quantity axes—such as mass, distance, and velocity—through eigenvector analysis, PCA, or weight probing techniques. If a particular attention head consistently shows sensitivity to gravitational distance r, mass m, or velocity v, it can be interpreted that this head carries meaningful physical implications.

(3) Induce Physical Consistency through PINN-based Loss Functions: While traditional fine-tuning uses MSE or cross-entropy-based losses, to guide the model towards compliance with actual physical laws, a PINN (Physics-Informed Neural Network) structure is needed. This involves adding the residual of differential equations as a loss term. For example, a physics-based loss term like the following could be used:

$L_{phys} = ||F - \frac{Gm_1m_2}{r^2}||^2$

Such a structure can induce the model to internalize physically consistent representations in addition to prediction accuracy.

4. Proposal for a Hybrid-Basis Transformer Design for Physical World Interpretation

This paper is closer to answering the question, "Can general physical laws be discovered solely from data prediction capabilities?" Therefore, general physical laws were not considered in the Transformer's design. If we truly want to construct a "world model," what kind of configuration would be necessary? As an answer to this, I propose a structural inductive bias design through a Hybrid Basis.

The Hybrid Basis approach aims to design the Transformer's internal representation space to simultaneously reflect prior knowledge of physical laws (known physics) and newly derived structures from data (emergent structure). The core idea is to fix (freeze) parts of the Transformer's main linear transformation matrices—especially W_q, W_k, and W_v—as pre-defined physical quantity-based bases, while allowing the remaining parts to be freely learned.

For example, physically important quantities such as:

Position vector
Velocity
Acceleration
Mass
Force

These physical quantities can be considered as orthonormal bases in a high-dimensional vector space, and their corresponding projection directions can be explicitly assigned to specific columns of W_q, W_k, and W_v. For instance, the first column of W_q could be fixed to correspond to the "velocity" physical quantity axis, and the second column to "position." Such fixation can be maintained not only during initialization but also throughout the training process by freezing these directions.

This approach would give the Transformer a hybrid structure:

Known Head (fixed physical quantity-based head): Some attention heads would have fixed physical quantity basis directions as their projection directions. These heads would directly extract and compare physically important components (e.g., distance, velocity, mass) from the input state. This is akin to the model projecting data onto "physical semantic axes."
Learned Head (freely learned head): The remaining heads would be able to learn information beyond the fixed physical quantity axes. They could capture new structures or anomalous interactions from the data, and in some cases, even constitute useful latent patterns different from existing laws.

This hybrid structure offers the following advantages:

Increased Interpretability: By fixing certain bases, it becomes clear which attention head corresponds to which physical concept. For example, if a head consistently shows strong attention only to vectors along the "distance" direction, it can be interpreted as responsible for distance-based interactions.
Maintenance of Physical Consistency: The known-physics-based heads provide constant physical stability to the overall model, acting as a constraint to ensure the model does not violate physical laws.
Potential for New Law Discovery: Through the learned heads, the model can autonomously learn emergent interactions or latent structures within the data that are not covered by existing physical laws.

Therefore, the Hybrid Basis approach structurally embeds semantic axes into the model's representation space. It offers the potential to connect the model's inductive bias with human-understandable physical laws while also retaining data-driven flexibility.

5. It's Too Early to Conclude

This paper clearly demonstrates that "good prediction" does not necessarily imply a "good world model". However, the experiments presented are overly restrictive, and structural interpretation of the model's representation space is lacking. Concluding the absence of a world model solely based on fine-tuning failures or symbolic regression failures is premature; linear algebraic structural analysis must be conducted in parallel. Defining physics-based bases and quantitatively analyzing how attention subspaces align with these axes is also essential. Ultimately, the core of future research should be to design models that can internalize how observed data is governed by the world's laws, and to make these internalized structures interpretable.

Reference: What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models, Keyon Vafa et al.

https://arxiv.org/abs/2507.06952

chocogoodluck

이전 포스트

논문 분석: Foundation model의 ‘World Model’ 학습 능력에 대한 고찰

다음 포스트