00 / project statement

Words Over Time

is a curated editorial archive,
not a search engine.

Each entry begins with a word selected for its cultural weight: a word that has shifted meaning, accumulated associations, or traveled across registers over centuries. The archive traces that word through corpus frequency, lexical attestation, scanned evidence, and interpretive annotation, presenting them as distinct layers rather than a single authoritative answer.

This is not a dictionary. It does not define words. It does not claim that frequency reflects importance, that Gutenberg texts represent all historical usage, or that modern news snippets are comparable to historical corpora. What it does claim is that the available evidence is worth making visible, with its sources, limits, and gaps stated alongside the data.

Intended audience: researchers, writers, educators, and anyone curious about how language carries history.

01 / methodology

methodology

A selected-word archive for historical frequency, attestation, scanned evidence, and interpretation.

Not a general search engine. Each word is selected, structured, and annotated before it becomes a public entry.

02

design research

Grid as argument

The visual structure draws on the Swiss International Typographic Style as developed between the 1950s and 1980s at the Basel School of Design and the HfG Ulm.

The six-column grid is not an aesthetic choice. It is a claim that the six categories of semantic evidence are commensurate and comparable. When a category is missing, the column is empty. The gap is not hidden.

the visual programme / six evidence columns

Signal

source-specific

Attestation

lexical proof

Variant

form policy

Context

snippet evidence

Boundary

claim limit

Rights

attribution

Faded columns = evidence not yet implemented. The absence is structural, not hidden.

Josef Müller-Brockmann

1914-1996

Grid Systems in Graphic Design

1961

The six-column grid that structures every word entry is a direct application of Müller-Brockmann's modular grid principle: a visible, auditable structure that makes the absence of data as legible as its presence.

Karl Gerstner

1930-2017

Designing Programmes

1964

The color token system (ink, wheat, blaze, fire, sun, nice, curious, wine, sail) is a programme in Gerstner's sense: each token is a rule, not a feeling. Orange marks emphasis and interactivity. Blue marks sources and data layers. Green marks confidence.

Emil Ruder

1914-1970

Typographie

1967

Helvetica Neue is used throughout as an information-neutral carrier. Ruder's argument that type must serve communication rather than express the typographer applies here: the typeface does not perform its own historicity.

HfG Ulm

1953-1968

Hochschule für Gestaltung

1953-1968

The Ulm model treated design as an epistemological practice: structure makes claims, not just appearances. The page modules and evidence diagrams are design research artifacts; they argue through their structure that evidence should be shown as a relational system.

This is not an application of Swiss design as historical style. It is an application of the underlying principle: that the structure of a design makes claims, and those claims should be as auditable as the data they present.

03

layered evidence

Evidence flow

Two information layers share one structure. The source layer stays readable at rest; hover gently brings the output layer forward.

01 / source

long-run book / corpus frequency data

output / Corpus frequency

normalized usage trace

Frequency is read as a long-run signal, with corpus boundaries kept visible.

Corpus size, genre mix, and source breaks stay attached to the line.

02 / source

dictionary / lexical evidence

output / Lexical attestation

earliest attested usage

Dictionary evidence anchors claims about first known use without pretending it is corpus frequency.

Attestation can sit outside the archive corpus and still matter.

03 / source

verified scanned-book page

output / Scanned evidence

earliest scanned-book occurrence

A page image or public-domain snippet gives inspectable context, still with uncertainty attached.

Scan quality, OCR noise, and public-domain limits remain visible.

04 / source

interpretive notes

output / Annotation

variant and uncertainty policy

Notes separate spelling variants, semantic drift, licensing limits, and confidence.

Interpretive decisions are recorded instead of hidden behind the chart.

04

data sources

Source ledger

SourceRoleCoverageLicense
Google Books Ngram ViewerFrequency time seriesEnglish corpora through 2022; queried at smoothing 0 before local transformsGoogle Books Ngram terms; attribution required
Project GutenbergPublic-domain context textSelected public-domain books, mainly eighteenth to early twentieth centuryProject Gutenberg License; public-domain status varies outside the US
Library of Congress / Chronicling AmericaHistorical newspaper evidenceDigitized US newspapers, chiefly 1770s-1960s depending on collection availabilityLibrary of Congress rights statements; item-level rights vary
Wikimedia / Wikinews / MediaWiki APIsModern context and attention signalsContemporary page, article, and context metadata where relevantCC BY / CC BY-SA family; project-specific terms apply
Lexical referencesAttestation and sense-history checksOED candidate checks, Online Etymology Dictionary, Wiktionary, Merriam-Webster, CambridgePublisher-specific; entries are not reproduced
Policy, clinical, and technical referencesDomain context anchorsEU AI Act, GDPR/ICO, FTC/NIST/OECD, PubMed/MeSH, WHO/NIMH, APA/DSM-history pointers, Stanford HAI and related pagesSource-specific; used as citation targets and metadata, not republished text
Public law and human-rights repositoriesLegal and rights anchorsWikisource, CourtListener, Justia/Oyez, Cornell Wex, NY Senate, UN, ECHR, EUR-Lex, eCFR, govinfo, DOJ/HHS, FTC, OECD, CPPA and related public pagesSource-specific; legal text, summaries, and court opinions are cited or paraphrased, not redistributed as a corpus
Geographic and demographic context sourcesAggregate context signalsOpenAlex, GDELT, World Bank indicators, Our World in Data fallback values, Open-Elevation, and Google Trends availability checksSource-specific; used as aggregate metrics, metadata, or unavailable-source audits only
05

calculation methods

How values are made visible

Source capture

Scripts fetch or ingest source-specific data into generated JSON files. Ngram queries are pulled as yearly series; Gutenberg and LOC material are stored as text or metadata extracts; policy, dictionary, clinical, and technical references are stored as source pointers or curated records when full-text reuse is restricted.

Frequency normalization

Google Ngram values are converted into comparable per-million visibility where needed. Most charts keep smoothing at 0, then apply local period aggregation, rank lookup, or max-normalization so that a chart compares terms within the same source family rather than across incompatible corpora.

Display transformation

Visual scales may use square-root, max-normalized, indexed, or ranked transforms. These transforms are display devices only: the interface labels them as visual intensity, visibility index, or relative signal rather than raw counts.

Phrase and variant policy

Each word page declares which forms belong together and which remain separate. Examples include spelling variants, compounds, X + word phrases, word + X phrases, singular/plural grammar, and domain phrases. Variant aggregation is treated as an editorial decision, not a default.

Semantic grouping

Semantic layers are built from curated phrase sets, keyword/collocate overlaps, source annotations, and domain-specific evidence records. They are not presented as machine-learned sense disambiguation; they are interpretive maps backed by visible source categories and caution language.

Branch and dependency scoring

For pages such as hub, form groups and dependency tiers are computed from curated examples: counts, object-type spread, phrase form, and modifier dependence are converted into visual branch maps. The score indicates how much the attached word specifies the object, not popularity or legal meaning.

Confidence and boundary labels

A claim can be source-supported, corpus-visible, manually attested, derived, pending, or cautionary. The page text must name that status instead of collapsing everything into proof. Absence, sparse data, OCR noise, rights limits, and genre bias remain attached to the claim.

06

epistemological position

Claim boundaries

The archive makes bounded claims. It can show that a selected form is visible in a named corpus, that a cited source supports an attestation, or that a curated semantic grouping organizes the evidence. It does not turn those signals into universal claims about all English usage, all communities, or all meanings of a word.

claims allowed

  • + Corpus frequency can show that selected forms become more or less visible within a named source boundary.
  • + Lexical and scanned sources can support attestation claims when source type and uncertainty are named.
  • + Semantic charts can show curated interpretive structure when the grouping rule is disclosed.
  • + Data collection, transformation, and visualization choices should be visible enough to audit.

claims refused

  • - Frequency does not equal cultural importance, lived experience, literary value, or legal meaning.
  • - A selected corpus is not all English usage, all genres, or all communities.
  • - A semantic group is not an automatic definition and is not mutually exclusive by default.
  • - A first detected corpus point is not the first historical use of a word.
  • - A missing result is not evidence that a usage did not exist.
07

public repository

Code & data

The data pipeline and visualization components are public for inspection, reproducibility, and citation review.

github / dpan538

Words-Over-Time

Public repository / Data pipeline / All components

->

What is inspectable

  • +Data pipeline scripts (TypeScript, Node)
  • +Generated JSON datasets
  • +All visualization and UI components
  • +Calculation methods and stopword lists

What is curated (not generic)

  • .Word selection and editorial decisions
  • .Category definitions and heuristics
  • .Interpretive annotations and pressure anchors
  • .Visual design and typographic decisions

citation note

Cite the archive and the upstream sources separately. A Words Over Time chart is an editorial synthesis of source retrieval, cleaning, transformation, semantic grouping, and visual design; it is not a replacement citation for Google Books Ngram, Project Gutenberg, Library of Congress, Wikimedia, dictionary publishers, policy pages, or clinical/technical references.

suggested website citation style

Words Over Time. "[Word page title]." Words Over Time, 2026, [page URL]. Accessed [day month year].

Example: Words Over Time. "Hub." Words Over Time, 2026, /words/hub. Accessed 27 May 2026.

Rights note: this page is a research and design archive, not legal advice. Public launch should keep source URLs visible, avoid full third-party reproductions, and remove or paraphrase any pending, restricted, or subscription-only excerpt.