00 / project statement

Words Over Time

is a semantic-frequency research project,
design research, and infographic art.

Words Over Time is made by Dai Pan / 潘岱, a Chinese artist, designer, and design researcher. It treats language as visual material: a field of memory, evidence, attention, and public pressure that can be studied through semantic change, word frequency, search statistics, and source-led interpretation.

This is not a dictionary. It does not define words. It does not claim that frequency reflects importance, that Gutenberg texts represent all historical usage, or that modern news snippets are comparable to historical corpora. What it does claim is that the available evidence is worth making visible, with its sources, limits, and gaps stated alongside the data.

Dai Pan / visual art, photography, printmaking, design research Dai Pan / writing, image-text worlds, poetic research

Intended audience: researchers, writers, educators, designers, artists, and anyone curious about how language carries history.

01 / methodology

methodology

A selected-word research system for historical frequency, semantic grouping, search statistics, scanned evidence, and interpretation.

Not a general search engine. Each word is selected, structured, and annotated before it becomes a public entry.

design research

Grid as argument

The visual structure draws on the Swiss International Typographic Style as developed between the 1950s and 1980s at the Basel School of Design and the HfG Ulm.

The six-column grid is not an aesthetic choice. It is a claim that the six categories of semantic evidence are commensurate and comparable. When a category is missing, the column is empty. The gap is not hidden.

the visual programme / six evidence columns

Signal

source-specific

Attestation

lexical proof

Variant

form policy

Context

snippet evidence

Boundary

claim limit

Rights

attribution

Faded columns = evidence not yet implemented. The absence is structural, not hidden.

Josef Müller-Brockmann

1914-1996

Grid Systems in Graphic Design

1961

The six-column grid that structures every word entry is a direct application of Müller-Brockmann's modular grid principle: a visible, auditable structure that makes the absence of data as legible as its presence.

Karl Gerstner

1930-2017

Designing Programmes

1964

The color token system (ink, wheat, blaze, fire, sun, nice, curious, wine, sail) is a programme in Gerstner's sense: each token is a rule, not a feeling. Orange marks emphasis and interactivity. Blue marks sources and data layers. Green marks confidence.

Emil Ruder

1914-1970

Typographie

1967

Helvetica Neue is used throughout as an information-neutral carrier. Ruder's argument that type must serve communication rather than express the typographer applies here: the typeface does not perform its own historicity.

HfG Ulm

1953-1968

Hochschule für Gestaltung

1953-1968

The Ulm model treated design as an epistemological practice: structure makes claims, not just appearances. The page modules and evidence diagrams are design research artifacts; they argue through their structure that evidence should be shown as a relational system.

This is not an application of Swiss design as historical style. It is an application of the underlying principle: that the structure of a design makes claims, and those claims should be as auditable as the data they present.

layered evidence

Evidence flow

Two information layers share one structure. The source layer stays readable at rest; hover gently brings the output layer forward.

01 / source

long-run book / corpus frequency data

output / Corpus frequency

normalized usage trace

Frequency is read as a long-run signal, with corpus boundaries kept visible.

Corpus size, genre mix, and source breaks stay attached to the line.

02 / source

dictionary / lexical evidence

output / Lexical attestation

earliest attested usage

Dictionary evidence anchors claims about first known use without pretending it is corpus frequency.

Attestation can sit outside the archive corpus and still matter.

03 / source

verified scanned-book page

output / Scanned evidence

earliest scanned-book occurrence

A page image or public-domain snippet gives inspectable context, still with uncertainty attached.

Scan quality, OCR noise, and public-domain limits remain visible.

04 / source

interpretive notes

output / Annotation

variant and uncertainty policy

Notes separate spelling variants, semantic drift, licensing limits, and confidence.

Interpretive decisions are recorded instead of hidden behind the chart.

data sources

Source ledger

Source	Role	Coverage	License
Google Books Ngram Viewer	Frequency time series	English corpora through 2022; queried at smoothing 0 before local transforms	Google Books Ngram terms; attribution required
Project Gutenberg	Public-domain context text	Selected public-domain books, mainly eighteenth to early twentieth century	Project Gutenberg License; public-domain status varies outside the US
Library of Congress / Chronicling America	Historical newspaper evidence	Digitized US newspapers, chiefly 1770s-1960s depending on collection availability	Library of Congress rights statements; item-level rights vary
Wikimedia / Wikinews / MediaWiki APIs	Modern context and attention signals	Contemporary page, article, and context metadata where relevant	CC BY / CC BY-SA family; project-specific terms apply
Lexical references	Attestation and sense-history checks	OED candidate checks, Online Etymology Dictionary, Wiktionary, Merriam-Webster, Cambridge	Publisher-specific; entries are not reproduced
Policy, clinical, and technical references	Domain context anchors	EU AI Act, GDPR/ICO, FTC/NIST/OECD, PubMed/MeSH, WHO/NIMH, APA/DSM-history pointers, Stanford HAI and related pages	Source-specific; used as citation targets and metadata, not republished text
Public law and human-rights repositories	Legal and rights anchors	Wikisource, CourtListener, Justia/Oyez, Cornell Wex, NY Senate, UN, ECHR, EUR-Lex, eCFR, govinfo, DOJ/HHS, FTC, OECD, CPPA and related public pages	Source-specific; legal text, summaries, and court opinions are cited or paraphrased, not redistributed as a corpus
Geographic and demographic context sources	Aggregate context signals	OpenAlex, GDELT, World Bank indicators, Our World in Data fallback values, Open-Elevation, and Google Trends availability checks	Source-specific; used as aggregate metrics, metadata, or unavailable-source audits only

calculation methods

How values are made visible

Source capture

Scripts fetch or ingest source-specific data into generated JSON files. Ngram queries are pulled as yearly series; Gutenberg and LOC material are stored as text or metadata extracts; policy, dictionary, clinical, and technical references are stored as source pointers or curated records when full-text reuse is restricted.

Frequency normalization

Google Ngram values are converted into comparable per-million visibility where needed. Most charts keep smoothing at 0, then apply local period aggregation, rank lookup, or max-normalization so that a chart compares terms within the same source family rather than across incompatible corpora.

Display transformation

Visual scales may use square-root, max-normalized, indexed, or ranked transforms. These transforms are display devices only: the interface labels them as visual intensity, visibility index, or relative signal rather than raw counts.

Phrase and variant policy

Each word page declares which forms belong together and which remain separate. Examples include spelling variants, compounds, X + word phrases, word + X phrases, singular/plural grammar, and domain phrases. Variant aggregation is treated as an editorial decision, not a default.

Semantic grouping

Semantic layers are built from curated phrase sets, keyword/collocate overlaps, source annotations, and domain-specific evidence records. They are not presented as machine-learned sense disambiguation; they are interpretive maps backed by visible source categories and caution language.

Branch and dependency scoring

For pages such as hub, form groups and dependency tiers are computed from curated examples: counts, object-type spread, phrase form, and modifier dependence are converted into visual branch maps. The score indicates how much the attached word specifies the object, not popularity or legal meaning.

Confidence and boundary labels

A claim can be source-supported, corpus-visible, manually attested, derived, pending, or cautionary. The page text must name that status instead of collapsing everything into proof. Absence, sparse data, OCR noise, rights limits, and genre bias remain attached to the claim.

epistemological position

Claim boundaries

The archive makes bounded claims. It can show that a selected form is visible in a named corpus, that a cited source supports an attestation, or that a curated semantic grouping organizes the evidence. It does not turn those signals into universal claims about all English usage, all communities, or all meanings of a word.

claims allowed

+ Corpus frequency can show that selected forms become more or less visible within a named source boundary.
+ Lexical and scanned sources can support attestation claims when source type and uncertainty are named.
+ Semantic charts can show curated interpretive structure when the grouping rule is disclosed.
+ Data collection, transformation, and visualization choices should be visible enough to audit.

claims refused

- Frequency does not equal cultural importance, lived experience, literary value, or legal meaning.
- A selected corpus is not all English usage, all genres, or all communities.
- A semantic group is not an automatic definition and is not mutually exclusive by default.
- A first detected corpus point is not the first historical use of a word.
- A missing result is not evidence that a usage did not exist.

open method

Infographic Editorial Design Skill

The design method behind this archive is also published as an open Codex skill for source-led infographic and editorial design.

The skill is a portable method refined from this design experience: define the claim, expose the evidence contract, design the reading path, and keep uncertainty visible.

github / open skill

infographic-editorial-design-skill

MIT package / Codex skill / research-led infographic method

claim

The artifact starts by naming what the evidence can and cannot support.

contract

Source types, transforms, caveats, rights, and curated decisions stay visible.

review

The skill includes a rubric for overclaiming, hierarchy, accessibility, and publication readiness.

The skill generalizes the method, not the finished identity. The MIT license covers the skill package itself; this archive's research writing, curated datasets, page compositions, visual identity, authorship marks, and third-party source material remain outside that grant.

public repository

Code & data

The data pipeline and visualization components are public for inspection, reproducibility, and citation review.

github / dpan538

Words-Over-Time

Public repository / Data pipeline / All components

What is inspectable

+Data pipeline scripts (TypeScript, Node)
+Generated JSON datasets
+All visualization and UI components
+Calculation methods and stopword lists

What is curated (not generic)

.Word selection and editorial decisions
.Category definitions and heuristics
.Interpretive annotations and pressure anchors
.Visual design and typographic decisions

citation note

Cite the archive and the upstream sources separately. A Words Over Time chart is an editorial synthesis of source retrieval, cleaning, transformation, semantic grouping, and visual design; it is not a replacement citation for Google Books Ngram, Project Gutenberg, Library of Congress, Wikimedia, dictionary publishers, policy pages, or clinical/technical references.

suggested website citation style

Pan, Dai. "[Word page title]." Words Over Time, 2026, [page URL]. DOI: 10.5281/zenodo.20437678. Accessed [day month year].

Example: Pan, Dai. "Hub." Words Over Time, 2026, https://wordsovertime.com/words/hub. DOI: 10.5281/zenodo.20437678. Accessed 27 May 2026.

Project DOI: https://doi.org/10.5281/zenodo.20437678

Rights note: this page is a research and design archive, not legal advice. Public launch should keep source URLs visible, avoid full third-party reproductions, and remove or paraphrase any pending, restricted, or subscription-only excerpt.