TL;DR for operators
Dates look harmless. They sit in spreadsheets, contracts, forecasts, audit trails, delivery plans, and board decks pretending to be objective little integers. The problem is that a language model may not treat them as just integers.
A new paper, The Other Mind: How Language Models Exhibit Human Temporal Cognition, studies how 12 large language models judge similarity between years from 1525 to 2524.1 The authors find that larger models often organise years around a subjective reference point near the recent present, rather than simply comparing numerical distance. The models also show logarithmic compression: years farther from that reference point become less finely distinguished, in a pattern reminiscent of the Weber-Fechner law in human perception.
The useful interpretation is not “LLMs have a human sense of time”. Please put that headline gently back in its box. The paper shows something narrower and more operationally relevant: when models process temporal stimuli, their internal representations can become anchored, compressed, asymmetric, and scale-dependent.
For businesses deploying LLMs in planning, forecasting, financial commentary, compliance timelines, historical analysis, project management, or agentic workflows, this matters because temporal errors are often not obvious arithmetic mistakes. A model can know that 2030 is five years after 2025 and still represent future years as semantically blurred, strategically distant, or oddly interchangeable.
The operational lesson is simple: test temporal cognition directly. Do not assume that good general reasoning implies a clean internal timeline.
Dates are numbers until the model decides they are not
The paper begins with a deceptively simple task: ask a model how similar two years are.
A year such as 1874 can be processed in several ways. It is a number. It is a string of digits. It is a historical marker. It belongs to a century. It may evoke wars, empires, scientific discoveries, tax years, copyright terms, or nothing in particular. This is exactly why the experiment is interesting. A model that sees “1874” as merely a four-digit number should behave differently from a model that sees “1874” as a point in a temporal world.
The authors use a similarity judgment task from cognitive science. For every pair of years from 1525 to 2524, models rate similarity on a continuous scale from 0 to 1. That produces one million pair-wise similarity values per task. The same setup is repeated with “number” instead of “year”, creating a control condition. Temperature is set to zero, which matters because the goal is to inspect stable model behaviour, not sample a personality audition.
The models include two closed models, Gemini-2.0-flash and GPT-4o, plus Qwen2.5 and Llama 3 instruct models across several sizes. The scale range is important because the paper’s claim is not just that “models” behave this way. The more interesting claim is that the pattern becomes clearer as model capacity increases.
The authors then compare model judgments against three candidate explanations:
| Candidate explanation | What it captures | Role in the paper |
|---|---|---|
| Log-linear distance | Numerical magnitude compressed logarithmically | Baseline for number-like cognition |
| Levenshtein distance | String-level digit similarity | Control for surface-form matching |
| Reference-log-linear distance | Temporal distance compressed around a reference year, fixed at 2025 for cross-model comparison | Main temporal-cognition hypothesis |
This is a good design choice because it prevents the laziest interpretation: “the model is just bad at arithmetic.” The real question is subtler. When the prompt says “year”, does the model still behave like it is comparing numbers, or does it reorganise those numbers into a subjective temporal frame?
The answer is: increasingly, the second.
The behavioural result is a temporal anchor, not a calendar app
In the number-to-number task, ordinary log-linear distance generally explains model judgments best. That aligns with previous work showing that LLMs often represent numbers in compressed, non-linear ways. Ten and twenty feel farther apart than five hundred and five hundred ten, even though both pairs are ten units apart. Humans do this too. Biology, as usual, has been compressing data long before startups discovered the word.
The year-to-year task changes the pattern. For many larger models, the reference-log-linear distance becomes a stronger predictor than plain log-linear distance. In other words, when the same four-digit values are framed as years, the model’s similarity judgments are pulled toward a temporal reference point.
The paper’s visual matrices show the pattern becoming clearer in larger systems. Years near the reference region are more finely differentiated. Years far in the past or future become compressed. Future years, in particular, are often judged as more similar to one another than past years.
The authors also run a non-parametric diagonal sliding window analysis to estimate where models show maximum perceptual differentiation. This is not the main proof; it is a sensitivity check on whether the reference point appears without simply optimising a free parameter. The estimates vary: Llama-3.1-70B-Instruct is around 2010, Gemini-2.0-flash around 2011, GPT-4o around 2024, Qwen2.5-14B-Instruct around 2012, and Qwen2.5-72B-Instruct around 2020. The authors then keep 2025 fixed for later cross-model analysis, which is methodologically tidy even if the models themselves are not perfectly tidy little clocks.
That distinction matters. The paper does not show that every model has one precise internal “present”. It shows that several larger models exhibit a recent-present-like anchor when judging temporal similarity. The reference is a useful abstraction, not a supernatural wristwatch.
The first mechanism: temporal neurons appear in the middle-to-late network
The paper then moves from behaviour to mechanism.
At the neuronal level, the authors compare activations under two input formats: “Year: x-x-x-x” and “Number: x-x-x-x”. They focus on feed-forward network neurons across transformer layers and identify temporal-preferential neurons using three filters: large activation difference, statistical significance after false discovery rate correction, and consistency across years.
These neurons are not most of the model. They are a small fraction of the FFN population, typically between 0.67% and 1.71%. That is one of the paper’s more useful numbers because it keeps the claim proportionate. Temporal processing is not everywhere. It is specialised, sparse relative to the whole FFN, and concentrated in middle-to-late layers.
That location is meaningful. Early layers are usually closer to lexical and surface-level processing. Middle-to-late layers are where more abstract task-relevant features tend to emerge. The paper’s distribution therefore supports the view that “yearness” is not merely a formatting artefact. The model appears to construct a higher-level temporal feature after first processing the input more concretely.
The activation curves are also telling. For the top temporal-preferential neurons, mean activation often forms a trough near the subjective reference region. As years move away into the past or future, activation rises in a compressed, logarithmic-like pattern. In larger models such as Llama-3.1-70B-Instruct and Qwen2.5-72B-Instruct, the structure becomes sharper. In Qwen2.5-72B-Instruct, the authors report that neurons in layer 71 fit past-year logarithmic distance with an $R^2$ of 0.756.
This is mechanistic evidence, not merely decorative neuroscience cosplay. The neurons give the behavioural result a plausible internal substrate: a small group of units responds preferentially to temporal framing and encodes distance from a reference point in a compressed way.
But it is still not evidence of lived experience. A neuron activation trough is not nostalgia. It is a pattern in a computational system.
The second mechanism: layers recode a year from quantity into orientation
The representational analysis is where the paper becomes most interesting for operators.
The authors extract residual stream representations during the similarity task and train linear probes at each layer to predict the three theoretical distances: log-linear, Levenshtein, and reference-log-linear. For larger models, they sample about 25 layers across the depth of the network to keep the analysis tractable. This is not a behavioural benchmark; it is a layer-wise diagnostic of what information is linearly recoverable from hidden states.
The result is a construction story.
In early layers, models tend to encode numerical properties. A year still looks much like a number. In deeper layers, the reference-log-linear distance becomes more decodable, meaning the representation increasingly contains information about temporal orientation around the reference point.
The model does not simply receive “time” as a ready-made object. It builds it.
The Llama and Qwen families do this differently. In larger Llama models, reference-log-linear representation catches up with ordinary log-linear representation in later layers, suggesting coexistence of numerical and temporal structure. In Qwen models, the pattern is sharper: the numerical representation rises earlier, then later declines as the temporal representation peaks. The paper describes this as suppression of the foundational numerical representation as the more abstract temporal representation emerges.
For deployment, this is not a minor technical detail. It means that temporal reasoning errors may not come from a lack of date facts. They may come from a layer-wise transformation that turns dates into an internally compressed temporal landscape. Once that happens, the model may answer with fluent confidence while carrying a warped sense of similarity, distance, or relevance.
That is the exact species of error that makes AI systems expensive: not the error that throws an exception, but the error that writes a convincing paragraph.
The third mechanism: the training environment already bends time
The paper’s final mechanism asks whether the temporal structure may already be present in the model’s information environment.
The authors use three embedding models — Qwen3-Embedding-8B, text-embedding-3-large, and Gemini-embedding-001 — to encode the same “Year: x-x-x-x” strings from 1525 to 2524. They compute cosine similarity matrices and use multidimensional scaling to visualise the semantic space.
The result is not a flat line of calendar years. The embedding spaces show non-linear temporal structure. Distant past and future years cluster densely. Future years are especially similar to one another, likely because future dates have lower information richness in training corpora. There are fewer distinct documented events for 2080 than for 1945. The future is not empty, but in text data it is often vague, speculative, and semantically repetitive.
The regression results support this reading. For all three embedding models, reference-log-linear distance explains semantic distances better than plain log-linear distance. The reported $R^2$ values for reference-log-linear distance are 0.6422 for Qwen3-Embedding-8B, 0.5684 for text-embedding-3-large, and 0.5159 for Gemini-embedding-001.
This evidence should be interpreted carefully. It does not prove that a specific training corpus directly caused a specific model’s internal timeline. The embedding models are proxies for broader language-data structure, not a forensic reconstruction of training history. Still, the finding is useful because it explains why subjective temporal compression might emerge without explicit instruction. Human text already treats time unevenly. Models trained on human text inherit the mess, then compress it elegantly enough to look cognitive. Very tasteful. Very dangerous if nobody tests it.
What each experiment supports, and what it does not
| Evidence in the paper | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Year-to-year similarity matrices across 12 models | Main behavioural evidence | Larger models often judge years through a compressed temporal frame rather than plain numerical distance | That models experience time or have consciousness |
| Number-to-number control | Control comparison | The “year” framing changes the representational pattern relative to numbers | That numerical cognition and temporal cognition are fully separable |
| Diagonal sliding window reference estimates | Robustness and sensitivity check | Some models show maximum differentiation near a recent-present region | That 2025 is the exact internal reference point of every model |
| Temporal-preferential FFN neurons | Mechanistic evidence | A small middle-to-late neuron population responds preferentially to temporal framing and shows compressed activation patterns | That those neurons alone explain all temporal reasoning |
| Layer-wise linear probes | Representational mechanism | Years shift from numerical properties in shallow layers toward temporal orientation in deeper layers | That the decoded features are always causally necessary |
| Embedding-model semantic spaces | Exploratory information-exposure evidence | Human language data contains non-linear temporal structure that may provide raw material for model temporal cognition | That any one model’s training corpus caused the exact observed behaviour |
This table is the antidote to the obvious overclaim. The paper is strongest when read as a multi-level mechanism study. It is weakest if turned into a philosophy meme about machine consciousness before lunch.
The business value is temporal diagnostics, not metaphysics
The practical question is not whether the model has “a mind”. The practical question is whether its internal construction of time can distort outputs in workflows where timing is material.
That includes:
- forecasting narratives;
- financial analysis across reporting periods;
- project plans and delivery milestones;
- contract and compliance timelines;
- insurance, legal, and audit review;
- historical document analysis;
- policy comparison across decades;
- agentic systems that choose actions over time.
The paper directly tests similarity judgments, not these applications. Cognaptus’s business inference is that temporal-bias testing should become part of model evaluation whenever date structure affects the outcome. The inference is reasonable, but it remains an inference.
A clean operational test would not ask, “Can the model calculate the difference between 2025 and 2030?” That is too easy. It should ask whether the model preserves distinctions across past, present, and future when those distinctions affect prioritisation, retrieval, summarisation, or risk.
For example, in a compliance workflow, two deadlines ten days apart may be operationally very different. In a strategic forecast, 2030 and 2040 may require entirely different capital assumptions. In a legal history task, 1986 and 1996 may sit in different regulatory worlds. If the model’s representational geometry treats distant or future years as blurry neighbours, arithmetic correctness will not save the workflow.
The model can count the years and still mishandle the timeline. That is the annoying part. Naturally.
A practical temporal-bias checklist for AI systems
Operators do not need to reproduce the full paper to benefit from it. They need smaller probes that detect whether a deployed model compresses time in ways that matter for the domain.
| Operational check | What to test | Why it matters |
|---|---|---|
| Year-vs-number contrast | Ask parallel similarity, ranking, or relevance questions using “year” and “number” framing | Detects whether temporal framing changes the model’s internal comparison behaviour |
| Anchor sensitivity | Repeat tasks with explicit reference dates such as “as of 2024”, “as of 2030”, or “from a 2015 perspective” | Shows whether the model can shift temporal viewpoint when instructed |
| Past-future asymmetry | Compare how the model distinguishes historical years versus future years at equal intervals | Finds whether future periods are being over-smoothed |
| Domain-year probes | Use dates meaningful to the business domain, not only generic calendar years | A financial model may treat 2008, 2020, and 2022 differently for good reasons; the question is whether it does so consistently |
| Retrieval interaction tests | Check whether date-aware retrieval changes temporal judgments | Separates model-internal priors from missing contextual evidence |
| Long-horizon planning audits | Compare plans at weekly, quarterly, annual, and multi-year horizons | Identifies whether longer horizons collapse into vague strategic fog |
The key is to treat time as a representational variable, not merely metadata. Dates should be explicit in prompts, retrieval filters, evaluation sets, and review rubrics. If an AI agent is planning across time, the evaluation should include temporal stress tests before it starts producing beautiful nonsense with Gantt-chart energy.
The misconception to avoid: human-like does not mean human
The authors use the language of human-like temporal cognition because the observed pattern resembles known psychophysical compression. That is fair within the paper’s frame. It is also a phrase that can run loose in public interpretation and knock over furniture.
The paper does not show that LLMs have subjective experience, memory, anticipation, embodiment, mortality, boredom, regret, or the faint dread of an approaching tax deadline. It shows that artificial neural systems can construct internal temporal representations that converge with some human-like patterns under a controlled similarity task.
That distinction is not philosophical pedantry. It affects governance.
If executives over-anthropomorphise the result, they may treat model behaviour as mysterious personality. If engineers dismiss it as “just training data”, they may miss a useful internal diagnostic. The better middle position is colder and more productive: LLMs are representational systems that construct compressed internal models from architecture and data. Some of those constructions resemble human cognition. Others may not. Either way, they can affect outputs.
The risk is not that the model secretly becomes human. The risk is that it becomes operationally persuasive while organising the world in ways users did not test.
Where the paper’s evidence stops
The paper’s boundaries are clear enough to be useful.
First, the core behavioural task covers years from 1525 to 2524. That is broad, but it is still a controlled similarity setting. The results do not directly measure performance in real forecasting, contract review, financial modelling, legal analysis, or autonomous planning.
Second, the reference point is not perfectly stable across models. The authors fix 2025 for comparability, while the sliding-window estimates vary across larger models. This supports a recent-present anchor, not a universal clock.
Third, the mechanistic analyses are strongest for open-source models where internal activations can be inspected. Closed models contribute behavioural evidence, but not the same depth of mechanistic evidence.
Fourth, the embedding-model analysis points to information exposure as a plausible contributor, not a complete causal proof. Training data is not a laboratory environment with all variables neatly labelled. If only.
Finally, the paper focuses on temporal similarity, not full temporal reasoning. Similarity is important because it exposes internal structure, but operators should not assume that every temporal decision will follow the same pattern. The right response is targeted evaluation, not panic, worship, or procurement theatre.
The alignment lesson is internal world-building
The paper’s broader argument is that alignment cannot rely only on output policing. If a model constructs internal temporal, causal, social, or strategic frames, then safety work must also inspect those constructions.
This is where the paper’s “experientialist” framing becomes useful, provided we do not turn it into mysticism. The model’s “experience” is not human experience. It is exposure to data through a particular architecture and training process. But that exposure still shapes internal structure. The model does not merely retrieve calendar facts; it builds a compressed representational world in which some dates are sharp, others are blurry, and the recent present may become a privileged centre.
For alignment and enterprise governance, that suggests a shift from output-only evaluation to representation-aware evaluation. Behavioural tests still matter. They are just not enough. A model can pass ordinary date arithmetic and still carry a temporal prior that affects relevance, prioritisation, and planning.
The more agentic the system, the more this matters. A chatbot that compresses future years may write a vague answer. An autonomous planning agent that compresses future years may mis-rank options, under-specify milestones, or treat distant obligations as interchangeable. Same representational weakness, different invoice.
Conclusion: debug the clock before trusting the plan
The most useful contribution of this paper is not the slogan that LLMs have human-like temporal cognition. It is the mechanism chain.
The behavioural task shows that larger models often judge years through a subjective temporal reference rather than pure numerical distance. The neuron analysis finds a small temporal-preferential population in middle-to-late layers, with compressed activation patterns around a reference point. The probe analysis shows a layer-wise transformation from numerical representation to temporal orientation. The embedding analysis suggests that language data itself already contains warped temporal structure, especially around the sparse and semantically blurry future.
Together, these findings make one practical point difficult to ignore: time inside an LLM is not guaranteed to be the clean calendar humans think they supplied in the prompt.
For operators, the response is not to ask whether the machine has a soul. That meeting can be cancelled. The response is to build temporal diagnostics into AI evaluation: year-vs-number controls, anchor sensitivity tests, past-future asymmetry checks, retrieval-aware audits, and domain-specific timeline probes.
The calendar is still objective. The model’s internal clock may not be.
Cognaptus: Automate the Present, Incubate the Future.
-
Lingyu Li, Yang Yao, Yixu Wang, Chubo Li, Yan Teng, and Yingchun Wang, “The Other Mind: How Language Models Exhibit Human Temporal Cognition,” arXiv:2507.15851, 2025. ↩︎