Mappit Week One

mappit.ai launched 7/15/2026, one week ago today. This is a retrospective on the features that have been added to the product since the initial rollout. During these seven days the app has served over 20,000 unique visitors trending stories on their map. The corpus has grown from 12,000 to 19,000 stories for a rate of 1,000 per day.

Mappit started with one stubborn idea — the zeitgeist of Earth, on a map. AI agents sweep hundreds of the world’s cities around the clock and pin what’s happening onto an interactive, time-aware map. You can browse the whole planet’s news for free, rewind any day like a time machine, and ask questions about any place.

That was 1.0. Over the releases since, Mappit grew from a news map into a living, navigable atlas — one that now feels the world’s mood, routes you across it, keeps itself honest, and, as of the latest release, shows you the picture behind each headline. Here’s the tour.

The living map (1.0)

The foundation: an interactive world map where every marker is a place with a story. A fleet of AI models — reading live web and local news — keeps it current, and Marco, a built-in chatbot that has effectively read the entire map, answers questions with citations. A date-range filter turns the whole site into a time machine: because the corpus is append-only, you can replay any day’s map. Browsing is free; Mappit Plus ($5/mo) unlocks creating your own pins, collections, and dataset overlays.

The planet’s mood, and its weather (1.2–1.5)

Then the map learned to feel. A sentiment layer blooms each place in the color of its mood — red for tense, green for upbeat — so you can read the world’s temperature at a glance. Live hazard and environment overlays followed: real-time earthquakes, wildfires, air quality, and weather, layered onto the same canvas. For the markets-minded, the S&P 500 rides the time machine — every company pinned at its headquarters city, green or red for the day you’re looking at. And trending news gained three lenses: Global, Here (whatever’s on your screen), and For You.

Make it yours (1.6)

Mappit opened up. Shareable public profiles gave every user a real face — cover image, social links, and your pins, posts, and collections on display. You got your own pin on the map — an identity marker that’s you, one per account. You could post from any place, turning a location into a conversation. And everyone could wake up to the Daily Digest: a genuinely-written morning email, composed by AI, narrating what moved on the places you follow.

Navigate it (1.7)

The biggest leap: Mappit learned to get you there. Street-address search with live previews, address pins that geocode both directions, right-click “what’s here?”, and grown-up turn-by-turn directions you can type, click, or email to yourself. Collections became drivable multi-stop routes — “best taco trucks of Austin” computes into an optimized drive. And you joined the map in real time: Find Me drops your pin where you stand, Follow Me keeps it live as you move, and you can drag any pin to reposition it (with a quick undo for accidents).

Trust, freshness — and pictures (1.7.1 to 1.8.1)

The most recent releases turned inward, on the three things a news map lives or dies by: is it true, is it fresh, and can you see it?

On truth: AI models sometimes invent plausible-but-dead article links. Mappit now validates every story link the moment it’s crawled and deletes the fabricated ones before they ever reach the map — no more dead-end 404s. Under the hood the feed went multi-provider: six independent sources — Claude, Gemini, Grok, OpenAI, Kimi — now cross-check each other, so a story corroborated by several vendors carries more weight. (A careful cost audit along the way kept the whole engine honest and affordable.)

On freshness: the “Latest” strip shows the newest headlines the instant they’re crawled, in pure chronological order, right alongside the curated trending view. Timestamps now reflect when a story actually entered the feed, so “just now” means just now.

And now, on seeing it: every story can carry its own picture. Rather than guess at an image, Mappit reads the same share-card image the publisher already attached to the article — the one you’d see if you posted the link anywhere else — and shows it only after confirming it’s a real, live image. The trending strip reads like a proper news wire now: a headline, a place, and the photo that goes with it.

Where it’s going

From a static map of the world’s news to a living, navigable, illustrated atlas of the present — that’s the arc from 1.0 to 1.8.1. The map feels the world’s mood, tracks its hazards and markets, lets you plant yourself and your places on it, routes you between them, keeps itself honest and current, and now shows you the face of every story.

Every place has a story. Come find yours at mappit.ai.

Tags: bing maps, chatgpt, Claude, current events, driving, driving-directions, Gemini, google-maps, Grok, Kimi, mapping, Mappit, maps, SaaS, travel, trending, Zeitgesit

July 22, 2026
chatgpt, ChatGPT, Claude, Gemini, Grok, Kimi, LLM
Leave a comment

Introducing Mappit

Introducing mappit.ai

Earth’s daily zeitgeist, on a map. Trending, corroborated stories that scope to your view. Overlays like S&P 500 & Today in History. Drop pins, subscribe to locations & follow users. Chat with Marco for your morning news!

The most impressive thing about Mappit is that it retains a memory of all of the news stories, specifically where and when they occurred. It’s a time-aware map. So search & chat accept date ranges as parameters, and the entire map can be filtered by date to show you how the world looked at a point in history. Quite powerful when combined with overlays like S&P 500 or Sentiment.

The site is only 5 days old. Already seeing more than 1,000 users per day browsing the latest headlines, and a handful of users have signed up to drop their own pins & subscribe to feeds. Ultimately it will be a content-driven site with revenue coming from ads as well as some users paying to drop pins & subscribe to feeds. The real unlock will happen when we have enough users to justify having a business tier so they can drop pins on the map that will be visible to everyone in their local & For You feeds, as well as showing up in search results and chats with Marco.

It’s my goal to get to that point within one year of launch. By that time the historical map views will have 500K stories mapped over a year, along with the stock market & sentiment information. Mining that for datasets to ingest with Correlation Studio (correlationstudio.com) will be part of the mission.

In tandem with the launch of Mappit we’ve launched the $MAPPIT token on Robinhood. This process was facilitated by Orynth, a brilliant new application that allows solo founders to launch tokens and collect the transaction fees to fund their work. The current trade information is always just a click away from the Mappit status bar. For those trading directly:

CA – 0x24E5Cebc6C23FB7C7683ffAb11473C0f4A6C1Fd9

Mappit – Every place has a story.

Tags: $MAPPIT, AI, bing maps, blockchain, crypto, crypto coin, crypto token, current events, digital-marketing, directions, history, iphone maps, Mappit, mappit-ai, maps, Marco, rango, S&P 500, social media, stocks, technology, today in history

July 20, 2026
Business, ChatGPT, Claude, Computers and Software, Correlation Studio, Gemini, Geography, Grok, Mappit, programming, Tech
Leave a comment

Correlation Studio Whitepaper

Rapid discovery mining and causation analysis. Data science without the code.

Matthew Meadows (Rango) · July 2026 · correlationstudio.com

The Premise

Every dataset is hiding something. The relationships are in there — commodity prices tracking input costs, market indices moving with macro indicators, temperatures pairing with temperatures a watershed away — but finding them traditionally requires a notebook, a language, a statistics library, and the patience to test hypotheses one at a time.

Correlation Studio inverts that workflow. You don’t bring a hypothesis; you bring data. The platform mines every numeric column pair across your datasets for statistically significant relationships, ranks what it finds, tests it for predictive causality, writes up the analysis, and gives every result a permanent, shareable, citable page. No Python. No notebooks. No code.

This document is the technical story of how that works: the architecture, the algorithms, the statistics, and the engineering arc that got it here — including the version that broke, and the rebuild that didn’t. It is, end to end, the work of one engineer over roughly five months and 570 commits. That constraint isn’t a footnote; it’s the design principle. Every architectural decision below is biased toward simplicity at the expense of optionality — one database, one app server, one object store, one embedded query engine — because a system simple enough for one person to fully understand is a system one person can operate, debug, and evolve at production speed.

What It Is

Correlation Studio is a vertically-integrated correlation-analysis platform. The workflow, end to end:

Bring data in. Upload CSV, TSV, Excel, or Google Sheets; paste raw text; supply URLs; or describe what you’re looking for and let three AI providers — Claude, Gemini, and Grok — search the open web for sources in parallel. Every discovered URL is verified, downloaded, type-detected, and previewed before ingestion.
Pair datasets into Experiments. Choose a comparison mode (one-against-one through everything-against-everything), a join strategy (by date, by shared key, or by row order), and a significance threshold.
Mine. The engine computes correlations across every numeric column pair. Each pair whose strength clears the threshold becomes a Discovery — a first-class entity carrying Pearson and Spearman coefficients, a p-value, a 95% confidence interval, an interactive chart, and drill-down access to the exact source rows behind any datapoint.
Interrogate. Ten visualization modes, five regression families with prediction intervals, lag analysis, rolling correlation, and on-demand Granger causality testing in both directions.
Explain. AI analysis reads the statistics like a senior analyst — strength, direction, confounders, caveats, actionable insights — and pins a written narrative to the entity.
Publish. Discoveries, experiments, datasets, and block-composed Portfolios can be published to a public feed, rated, discussed, and shared. Every public entity gets a stable URL, structured metadata, and search-engine-grade rendering.
Ask. Corrie, the in-app assistant, answers questions against the entire public corpus with real coefficients and citations that link back to the source discoveries.

The corpus

As of July 1, 2026, the public corpus stands at roughly:

Public entity	Count
Discoveries	50,000
Experiments	10,000
Datasets	5,000

Every public discovery carries real statistical context and a written analysis — the corpus was seeded deliberately so that search, the chatbot, and human browsers all retrieve substance, not just titles and r-values. The entire public corpus is published as open data under CC BY 4.0, with creator attribution, in machine-readable form (more on that below). It is growing daily — a systematic, politeness-first crawler now harvests open-data portals continuously.

The Architecture

The production system is deliberately small:

A single PostgreSQL 17 database for all metadata: users, billing, datasets, experiments, discoveries, jobs, audit, and the RAG vector index (via pgvector).
Cloudflare R2 object storage for all bulk data, stored as columnar Apache Parquet — one file per dataset, one small file per discovery’s chart payload. Per-byte pricing, zero egress fees.
DuckDB embedded in the application server, querying Parquet directly through a local NVMe hot-tier cache with LRU eviction. Analytical SQL runs in-process; there is no separate analytics cluster.
A .NET 10 ASP.NET Core API in a clean three-layer solution, hosting all parsing, correlation math, AI orchestration, thumbnail rendering, billing, and two dozen background services.
A React 18 + TypeScript SPA with canvas-rendered charts, route-level code splitting, and optimistic UI throughout.
Stripe end-to-end for payments; three AI providers in parallel (Anthropic, Google, xAI) for search and analysis, plus OpenAI and Gemini for embeddings.

Two mid-range VPS instances run the whole thing — one for Postgres, one for the app — with R2 as the third leg. That’s the entire footprint.

Why this shape? Because the first shape broke.

The Crucible: Breaking at Ten Users

Commit zero was December 29, 2025. The original architecture (v1) was conventional for a data product: parsed rows landed in Postgres tables — a row store, plus two index tables holding pre-parsed join keys and numeric values per column, sharded across per-user content databases. It worked in development. It demoed beautifully.

It fell over at roughly ten concurrent active users.

The failure mode was write amplification. A wide CSV — a thousand columns by a few million rows — exploded into hundreds of millions of B-tree index inserts. On SATA-backed storage, concurrent ingestion turned that amplification (measured at 50–150× on wide datasets) into IOPS queue starvation: the write-ahead log couldn’t drain, ingestion stalled, and everything sharing the disk stalled with it. The system hit the wall precisely when the first real traffic arrived.

The response was not a tuning pass. Tuning buys margin; it doesn’t change the physics. The response was a clean-room replacement of the entire bulk-data layer — and the decision to make the new substrate match the actual workload:

The workload is analytical. Column statistics, correlation joins, drilldown lookups — these read a few columns across many rows. A row store reads every byte of every row to serve them; a columnar format reads only the columns asked for.
The data is write-once. A dataset is ingested, then queried many times. That’s the exact profile object storage plus immutable columnar files is built for.
Compression is free leverage. Dictionary and run-length encoding compress low-cardinality columns 10–100×. Less storage, less I/O, faster scans.

The Lakehouse: Parquet + DuckDB

The v2 architecture — internally, the “Lakehouse” migration — landed on May 17, 2026, about twenty weeks after commit zero. Ingestion now writes one Parquet file per dataset to R2. Every analytical query — column statistics, experiment correlation joins, datapoint drilldowns, thumbnail renders — is DuckDB SQL executed in-process against those files, through a local NVMe cache so hot datasets read at disk speed and cold ones fetch from R2 on demand.

The numbers that matter from the cutover:

Roughly 5,000 lines of C# were deleted. The sharding apparatus, the chunked index builders, the bulk-delete machinery for hundred-million-row index tables — all of it became unnecessary, because the tables it managed no longer exist.
The frontend never noticed. Every dataset, experiment, and discovery API contract survived intact. The UI that ran against v1 on Friday ran against v2 the next week.
The wall moved. The architecture that collapsed at ten concurrent users now absorbs about a million requests a month — with the app server’s memory headroom untouched. (Operational numbers below.)

DuckDB deserves specific credit. An embedded analytical engine that reads Parquet natively, with predicate pushdown into row groups (a query that filters on a date range skips 50,000-row chunks that can’t match), collapses what would otherwise be a data-warehouse deployment into a library call. A fixed connection pool serves every read path; ingestion back-pressures politely when the pool runs hot.

The row-group structure also gives ingestion a natural batch size: rows buffer in 50,000-row groups, which is granular enough for pushdown without bloating file metadata. A typical user dataset — a hundred thousand to a few million rows — lands as a single file of 2 to 100 row groups.

Getting Data In

Ingestion is where a no-code tool earns its keep, because real-world data is hostile.

Three ways in, plus a crawler

Upload or paste covers the file-in-hand case: CSV, TSV, Excel, Google Sheets, and HTML tables, streamed to the server (files up to 100 GB) rather than buffered.

Web links accepts URLs directly and downloads server-side, with a politeness layer: per-domain concurrency caps, minimum request spacing, and honor for Retry-After headers.

AI Remote Search is the differentiator. Describe a topic — “daily historic S&P 500 data” — and Claude, Gemini, and Grok search in parallel for open-data sources. Results are merged and deduplicated, with URLs surfaced by multiple providers ranked higher. Every candidate URL is then verified before it’s offered: a HEAD probe (with GET fallback) confirms the link is alive and actually serves data rather than an HTML landing page. LLMs fabricate URLs; the verification layer is what makes the feature trustworthy. Verified sources accumulate against per-user Topics, so the next search on the same subject can reuse the cache at zero AI cost.

Downloads themselves are engineered for the long tail: resumable by byte range across restarts, idle-timeout-based liveness detection (a 25-minute government-server download is normal; a wire silent for 60 seconds is not), and truncation detection for chunked-transfer endpoints that never declare a length — including a post-download ranged probe that asks the server how big the file should have been.

The newest addition (v2.5.0, July 2026) is a crawler for systematic harvest: point it at an open-data portal with a topic and seed links, and it walks the site — strictly inside robots.txt rules, with per-domain pacing under a dedicated user agent (corriebot) — surfacing every downloadable dataset it finds. A resolver stack handles the reality that download links hide behind JavaScript: schema.org Dataset JSON-LD (how data.gov exposes files), site-specific adapters for Socrata and FRED-style portals, and an opt-in headless renderer for fully client-rendered pages. Results hand off to the same wizard pipeline as everything else.

Parsing hostile files

Real-world CSVs open with disclaimers, contact information, and blank lines. Headers span two or three rows. Agencies insert section banners mid-table. The preamble/header detector went through six iterations, each triggered by a single real-world file that broke the previous version. The current algorithm runs a two-pass analysis over the first ~50 lines — field-count consistency streaks, fill ratios, alphabetic-content detection, richest-header selection — then merges multi-row headers with last-row-wins semantics and backward-fills labels for columns the primary header row left blank.

Type inference classifies per-cell (numeric with currency/percent/parenthesized-negative cleaning; temporal with ISO 8601 canonicalization; partial dates like 1871.10), then runs refinement passes over the census: integer columns labeled “date” whose values all parse as valid YYYYMMDD get retyped; datasets whose “header” row parses cleanly as data get demoted to headerless with synthetic column names.

After the Parquet is written, a statistics pass computes per-column count, distinct count, min/max/mean/standard deviation, quartiles, skewness, a 20-bin histogram, and Tukey-fence outliers — one DuckDB pass per column, persisted once. The dataset’s Distributions and Quality tabs render from those precomputed rows in milliseconds; in v1 the same tabs recomputed on every view and took 30–90 seconds on large datasets.

Reshaping

Eighteen transform tools produce derivative datasets without leaving the browser: transpose, difference, lag/lead, rolling aggregates, time-series resampling, filtering, normalization (z-score and min-max), percent change, cumulative sum, four flavors of missing-value fill including linear interpolation, deduplication, group aggregation, pivot, rank, binning, statistical outlier removal, merge/join, and append/stack. Each is a DuckDB SQL plan streamed into a fresh Parquet — the output is a real dataset, immediately usable in experiments.

The Experiment Engine

An Experiment pairs two datasets and exhaustively correlates them: every numeric column on X against every numeric column on Y. The engine’s job is to make “test everything” cheap, deterministic, and statistically honest.

Joining

Two datasets rarely share row identity, so the engine supports three join semantics:

Row sequence — positional pairing (X row n with Y row n) for pre-aligned data, with numbering assigned before null filtering so a missing value in row 3 doesn’t shift every subsequent pair.
Shared key — exact match on a designated key column (country codes, product IDs, exact timestamps), with duplicate keys aggregated per a per-column aggregate function (average by default; sum, min, max, first, last, count available).
Time series — the workhorse. Both sides snap their timestamps to a common epoch grid controlled by one tolerance knob (hourly, daily, weekly, monthly, yearly buckets), so daily X data joins hourly Y data by averaging Y within each day. The bucket expression is deliberately timezone-stable and shared verbatim between the join and the drilldown query — a lesson learned when a timezone-dependent cast produced charts that worked and drilldowns that silently returned nothing.

Sampling, deterministically

Experiments can run on a percentage sample. Sampling is hash-based on the join key — both sides keep a key if hash(key) mod 100 falls under the sample percentage — so the surviving pairs always line up, and re-running the same experiment yields the same sample. Published discoveries are reproducible artifacts, not lottery draws.

The statistics

For each column pair, one analytical query computes:

Pearson’s r over the joined, null-and-NaN-filtered pairs, via DuckDB’s numerically stable CORR aggregate.
Spearman’s ρ as Pearson over ranks — computed in the same pass with window-function ranking.
p-value via a two-tailed Student-t test: t = r·√[(n−2)/(1−r²)] with n−2 degrees of freedom.
95% confidence interval via the Fisher z-transform: z = atanh(r), standard error 1/√(n−3), back through tanh.

Pairs with fewer than three surviving points, zero variance on either side, or |r| below the experiment’s threshold don’t become discoveries. Two additional gates keep the catalog honest: an identity gate drops pairs whose correlation rounds to exactly 1.0 (a column against itself in disguise), and a duplicate gate skips column pairs the user has already mined in other experiments — before any computation happens — so re-running a growing cross-matrix costs only the new pairs.

Each surviving discovery gets a compact chart payload: up to 5,000 joined points, selected by deterministic hash ordering (same input, same sample), written as a small Parquet of its own. The full surviving pair count n is what the statistics report; the 5,000-point cap is purely a rendering budget.

Throughput on the hot path went through a deliberate optimization arc — batched inserts, payload-format changes, and finally the v2 columnar rebuild — moving from 3 pairs/second to a sustained 14–15 with peaks over 20. An “everything against everything” run across dozens of datasets is a coffee break, not an overnight job.

From Correlation Toward Causation

The platform’s second brand promise is causation analysis, and it’s handled with statistical honesty: correlation is never presented as causation, but the tools to probe the question are one click away.

Granger causality asks the falsifiable version of the question: do past values of X help predict future values of Y beyond what Y’s own history already predicts? The implementation is the full classical pipeline:

Stationarity check on both series via an Augmented Dickey-Fuller test; non-stationary series are differenced (up to twice) before testing.
Optimal lag selection by minimizing the Bayesian Information Criterion across candidate lags.
F-test comparing the restricted model (Y on its own lags) against the unrestricted model (Y on its own lags plus X’s lags): F = [(RSS_r − RSS_u)/k] / [RSS_u/(n−2k−1)].
Both directions. X→Y and Y→X are tested separately; a discovery reports both F-statistics, both p-values, the optimal lag, and a four-state direction summary (none / X→Y / Y→X / bidirectional).

The UI language is careful: Granger causality is predictive causality. It cannot rule out a third variable driving both series — and the platform’s own AI analyses say so explicitly when the data warrants it.

Three lighter-weight instruments complete the causality toolkit:

Lag analysis slides one series against the other and re-computes r at every offset, revealing lead/lag structure visually.
Rolling correlation sweeps a window across the joined series to expose whether the headline r is stable, decaying, or hiding a sign flip.
Divergence detection compares |Spearman| against |Pearson| per discovery. When Spearman is meaningfully higher, the relationship is monotone but non-linear (Pearson under-reports curves); when Pearson is meaningfully higher, outliers are propping it up. The platform flags both cases automatically, right on the discovery list.

Regression rounds it out: linear, quadratic, cubic, logarithmic, and exponential fits via ordinary least squares, each with R² and a 95% prediction interval band drawn on the chart, plus a residual-plot mode for diagnosing what the chosen model misses. A model-comparison panel fits all families at once and ranks them.

Ten Ways to See a Relationship

Every discovery renders through a purpose-built visualization layer:

View	What it shows
Scatterplot	The relationship itself, with density-bucketed dots and regression overlay
Line graph	Both series against row order, dual-axis, density-aware alpha
Residual plot	Distance from the fitted curve — fit diagnostics at a glance
Trajectory	Grouped series tracing arcs through (x, y) space over time
Heatmap	The full r matrix across an experiment’s column pairs
Sparkline grid	Every discovery in an experiment as a wall of mini-charts
Bubble chart	r vs sample size, sized by non-linearity
Divergence chart	Pearson vs Spearman per discovery, against the agreement line
Lag chart	Correlation at every time offset
Correlation network	A force-directed graph of which columns move together

The primary chart engine is canvas-based and double-buffered (mode switches never flash), with density handling designed for real data: scatter points collapse into buckets with logarithmically-scaled radii, and dense line renders modulate stroke alpha so 30,000 segments read as density rather than a solid band. A quiet guard computes lag-1 autocorrelation on both series and warns when a line graph would be misleading because row order is noise.

Clicking any datapoint drills down to the actual source rows from both datasets that produced it — bucket-aware, so a daily time-series point returns every X row and every Y row from that day, streamed to the browser as they’re read. The chain of custody from chart to raw data is never more than one click.

Every public discovery also gets a server-rendered thumbnail (the same bucketing math, rendered headlessly), which is what makes feeds, portfolios, search results, and social-media unfurls visually rich.

The AI Layer

AI shows up in four places, each priced and audited:

Search (described above): three providers in parallel, results verified before presentation.

Analysis: any dataset, experiment, discovery, or portfolio can be analyzed by Claude, Gemini, or Grok. The model receives the real statistics — coefficients, p-values, confidence intervals, Granger results, column semantics — and produces a structured narrative: overall relationship, strength and direction, patterns and outliers, confounding factors, actionable insights. The narrative is pinned to the entity, rendered as Markdown, editable by the owner, and re-generable on demand. Analyses are honest by construction: given a modest r with no Granger signal, the write-up says co-movement from shared macro factors, not causation.

Corrie, the in-app assistant, is a retrieval-augmented chatbot over two corpora: the product documentation, and every public entity on the platform. The pipeline is classical RAG done carefully:

Embeddings are dual-vendor: OpenAI’s text-embedding-3-large (truncated to 1536 dimensions) as primary, Gemini’s embedding model as automatic fallback on transient failures — same dimensionality, so the vector index doesn’t care which vendor produced a given row.
Indexing is lifecycle-driven. Publishing, unpublishing, deleting, or re-analyzing an entity enqueues a durable index command; a background worker batches embeddings (hundreds per API call), retries with exponential backoff, and never gives up on a transient failure. The corpus tracks the platform in near-real-time.
Retrieval is pgvector cosine similarity, top-8, with a visibility filter so private content never leaks into another user’s chat.
Synthesis runs on Claude Haiku for speed and cost, escalating to Sonnet when the cheap model’s answer looks low-confidence. Every answer carries citations that link to the underlying discoveries, and every query writes an audit row with token counts and cost.

Ask Corrie how the S&P 500 relates to oil prices and the answer is grouped coefficients across time periods, each citation-linked — with the caveats stated, like Granger tests showing no significant predictive direction.

Classification: a Haiku-based classifier assigns every public experiment and dataset to a curated two-level topic taxonomy by reading the analysis text the platform already generated. Discoveries inherit their parent experiment’s topics — one column pair shares its parent’s subject — so the entire 50,000-discovery corpus is topic-organized at the cost of classifying only experiments and datasets.

Open by Design

A public corpus is only valuable if it can be found. A React SPA is invisible to anything that doesn’t execute JavaScript — which includes search-engine first passes, LLM crawlers, and social-media unfurlers. The discoverability layer (v2.4.0, late June 2026) fixed that end to end:

Server-rendered entity shells. Every public dataset, experiment, discovery, and portfolio URL serves real content before JavaScript runs: a proper title and description, Open Graph and Twitter Card tags pointing at the entity’s chart thumbnail, a readable article body (the analysis narrative, the coefficient block, the column roster) for non-executing crawlers, and schema.org JSON-LD — Dataset for datasets with a real downloadable distribution, Article for analytical entities, breadcrumbs throughout. Human visitors get the SPA exactly as before; the server-rendered layer exists for the first, non-JS fetch. The rewrite adds nothing measurable to latency because it templates data the API already loaded.
Topic hubs turn the taxonomy into public, crawlable index pages — an internal-linking mesh connecting every entity to its subject neighbors, which is the single biggest lever on how deeply crawlers explore a site.
Freshness signals: sitemaps with true per-type modification dates, an image-sitemap extension advertising every discovery’s chart to image search, and IndexNow pings fired asynchronously on every publish so participating engines recrawl within minutes.
Open data, three ways. The full public corpus ships as llms-full.txt (the whole corpus as one Markdown document, following the emerging convention for LLM consumption); as a streamed NDJSON bulk dump where every record is self-describing with license and attribution embedded; and as per-dataset anonymous CSV exports referenced from each dataset’s JSON-LD — which makes every public dataset, and the corpus itself, discoverable in Google Dataset Search. License: CC BY 4.0, attribution to the content’s creator.

The bet is simple: analytical content with real statistics and honest write-ups is exactly the kind of material search engines and AI answer engines want to cite. Making it maximally machine-readable is marketing that compounds.

The Token Economy

The business model is engineered to keep the free tier genuinely free and the paid path honest:

Free to join. No credit card. Every account gets 10 GB of storage with no recurring charge, and new registrations receive a starting token grant.
Tokens are the metered unit. Ingestion, correlation runs, transforms, AI search, AI analysis, and chatbot queries each carry a posted token price (ingestion and correlation scale with data size; AI operations are flat per call). Token packs range from $5 to $200 with a deliberately tight volume discount, and tokens never expire. There is no subscription requirement; one optional storage subscription tier exists for users who outgrow 10 GB.
Deleting content earns credit back — a large fraction of the size-based portion of the original ingestion cost returns to the balance, so experimentation isn’t punished.

Under the hood, the accounting is bank-grade in miniature. Every token movement writes a ledger row whose running balance comes from the same atomic database operation that moved the balance — a lost-update race in an early read-modify-write implementation produced real drift in production, and the fix was to make drift structurally impossible rather than merely unlikely. Stripe integration is idempotent at every entry point (webhooks retry aggressively for days; every handler dedupes on event identity), invoices are generated as PDFs with monotonic numbering and snapshotted billing details, and a weekly reconciliation sweep cross-checks stored balances against the ledger and storage counters against actual bytes.

The unit economics close at small scale by design. Infrastructure is two VPS instances, a CDN, and per-byte object storage with zero egress — roughly $300 a month all-in, independent of corpus size in any way that matters. The only cost line that scales with engagement is AI API spend, which is exactly the line token pricing covers. A few dozen actively-engaged users make the platform self-funding; everything above that is margin. The free tier can stay free indefinitely because storage is the cheap part and idle users cost approximately nothing.

Running It

The operational posture is recovery-first: assume the process can die at any moment, and make every in-flight operation either resumable or safely restartable.

Dataset and experiment pipelines run on single-column state machines with deterministic startup recovery — a crash mid-ingest resumes or requeues based on exactly what survived (partial file on disk, upload still staged, download resumable by byte offset).
Long-running remote downloads are durable across restarts, with heartbeats, partial-state detection, and byte-range resume.
Background work runs under leader election so a blue/green deploy can’t double-charge or double-process.
Every external integration assumes retries: idempotent webhooks, embedding-queue commands that back off exponentially and never give up, object-storage uploads that rebuild their TLS client when the connection pool itself is the problem (a real production failure mode: a poisoned pooled socket failing identically on every retry until the pool was forcibly replaced).

A 30-day operational snapshot from early summer 2026, under live ad-driven traffic:

Metric	Trailing 30 days
Requests	~1,000,000
Unique visitors (IPs)	~45,000
Average request duration	79 ms
Server errors (5xx)	5 total
Uptime	100.00%

Five server errors in a million requests, at 79 milliseconds average latency, on two modest servers — with the crawler-facing SSR layer in the request path. That table is the whole architectural argument compressed: the system that fell over at ten concurrent users now doesn’t notice a million-request month.

The Timeline

The compressed history, for the engineering-curious:

Dec 29, 2025 — commit zero. Three-layer .NET solution + React SPA. The layout never changes.
January–February — dataset ingest plumbing; Discoveries and Experiments become real entities.
March — public IDs, the token economy, AI-powered dataset search (Claude first, then Gemini), remote downloads, soft-delete.
Early April — Grok joins as the third search provider; ingestion re-architected; state machines replace boolean-flag soup; the token ledger lands.
Mid-April — the visualization wave (heatmaps, divergence, bubble, lag, rolling, network) and the social layer (portfolios, posts, messaging, forums, workgroups) in a single furious fortnight.
Late April — migration from SQL Server to PostgreSQL; database sharding built out; Granger causality ships; Stripe lands end-to-end.
Early May — the performance arc (3 → 15 pairs/sec), the RAG chatbot built and shipped in two days, resilience hardening across the wizard and download paths.
May 17 — the v2 Lakehouse migration. Sharding deleted, Parquet + DuckDB in, ~5,000 lines of C# removed, frontend untouched. The defining architectural event of the project.
Late May — token-flow hardening (an external audit closed in five sequential phases), content moderation, activity limits, security audit.
June 1 — public launch. Ad campaigns, funnel instrumentation, and the honest lessons of early go-to-market (the first campaign that actually converted was desktop-only with an exclude-list geo strategy — measured bursts beat always-on spend).
Late June (v2.4.0) — the discoverability release: crawler SSR, structured data, topic ontology, IndexNow, open-data surfaces.
Early July (v2.5.0) — the crawler: systematic, robots-respecting dataset harvesting at portal scale, feeding the corpus that now approaches 50,000 discoveries.

Five months, ~570 commits, one engineer.

Lessons That Survived Contact

Every one of these was learned in production, not in a design review:

One state column beats four boolean flags. Every queue bug in the project’s first four months traced back to flag combinations no one had enumerated. A single authoritative state with one writer per transition made startup recovery deterministic and killed the bug class.
Make financial drift structurally impossible. Balances move only through atomic database operations that return the new value; the audit ledger records that returned value, not a separately-computed one. Reconciliation then verifies invariants instead of repairing them.
Delete the clever thing when the simple thing proves more durable. Sharding, per-user content databases, cross-database replication schemes, four-flag state tracking, in-database bulk row storage — each was replaced by something with fewer moving parts, and every replacement paid back faster than projected.
Match the storage engine to the read pattern. The v1 failure wasn’t a bug; it was a row store serving a columnar workload. No amount of tuning crosses that gap.
Build the same expression once. The join-bucket bug — charts working while drilldowns returned nothing — happened because two code paths computed “the same” key two subtly different ways. The fix wasn’t the timezone handling; it was making one function the only source of that expression.
Idempotency is not optional at the edges. Payment webhooks, embedding queues, download resumes, crawler frontiers — anything that talks to the outside world will be retried, replayed, or duplicated eventually.
Real-world files are the adversary. Six iterations of header detection, each falsified by one government CSV. Parsers earn trust one hostile file at a time.
A solo build’s edge is loop time. Not being smarter on paper — closing the distance between “this broke in production” and “this is fixed and can’t break that way again” in hours instead of sprints.

Where It Goes From Here

The platform is feature-complete for its core personas — analysts, researchers, students, and the incurably curious — and default-alive on the economics. The corpus compounds daily through the crawler. The open-data surfaces are in the wild, accumulating citations. Horizontal scale-out, when traffic demands it, is architecturally trivial: the app server holds no shared mutable state, and both Postgres and R2 are already shared substrates.

The premise from the top of this document was that one engineer with the right primitives could build and operate a system that historically took a team. The evidence is the system you’re reading about — and the corpus it built.

See for yourself: browse the live corpus, explore the open data, check pricing (free to join, no credit card required), or just create an account and upload something. Your data is hiding something. Find it this afternoon.

Tags: AI, API, Artificial Intelligence, chatgpt, Claude, cloud, Correlation Studio, Corrie, Gemini, Grok, OpenAI, rango, technology, whitepaper

July 5, 2026
AI, Automation, Big Data, Business, ChatGPT, Claude, Computer Science, Computers and Software, Correlation Studio, Gemini, Grok, LLM, Machine Learning, Tech
Leave a comment

Correlation Studio v2.4

Version 2.4 is a major SEO and content release focused on improving the site’s reputation, discoverability, and visibility for both search engines and AI-powered crawlers.

Most of the improvements happen behind the scenes, but users will notice several significant new features.

What’s New

📚 Browse by Topic

A new Topics menu provides an ontological view of the entire platform, making it easy to browse datasets and discoveries by category. The Topics browser is fully available on both desktop and mobile.

🤖 Reanalyze with AI

Datasets and Discoveries now include a Reanalyze button that generates fresh AI analysis using Claude. Users can also edit the generated analysis directly from the detail pages.

🏷️ New Branding

The platform has been rebranded around its core mission:

Rapid Discovery Mining & Causation Analysis

A shorter version also appears throughout the site:

Discovery Mining • Causation Analysis

📈 Expanded Public Library

600 Datasets
8,000 Experiments
45,000 Discoveries
165 Million Records
15,000 Columns

Don’t forget to try the new full-panel Search, introduced in Version 2.3.

About Correlation Studio

Correlation Studio is a powerful SaaS platform that brings the insights of modern correlation data science to everyone—without requiring users to write code.

Users can upload their own data or discover new datasets through the AI-powered Dataset Wizard, driven by Claude, Gemini, and Grok. Imported datasets become the foundation for Experiments, which automatically compare every compatible column against every other column.

From Data to Discovery

Each experiment produces reusable Discoveries containing:

Correlation metadata
Scatterplots
Line charts with drill-down capability
Geospatial visualizations
P-values
Granger causality analysis
Additional statistical metadata

Every Discovery becomes a searchable, analyzable knowledge object rather than a temporary calculation.

Create Interactive Portfolios

Datasets, Experiments, and Discoveries can be assembled into shareable Portfolios that combine:

Markdown
Images
Videos
Podcasts
Lectures
Interactive statistical content

Portfolios make it easy to present research findings in an engaging, interactive format.

AI Throughout the Platform

Artificial intelligence is integrated into every stage of the workflow.

AI-assisted Dataset Wizard
Claude-generated analysis for Datasets
AI explanations for Discoveries
Portfolio analysis and summaries

The platform also includes Corrie, an AI assistant that understands every public entity on the site.

Corrie is context-aware, allowing users to ask questions about the page they’re viewing, statistical concepts, or correlation analysis in general. She also serves as the platform’s curator, publishing hundreds of datasets, thousands of discoveries, and numerous example portfolios.

Built for Collaboration

Every major entity—Datasets, Experiments, Discoveries, and Portfolios—can be published to the community feed.

Users can:

Create posts
Attach datasets and discoveries
Like content
Rate content
Leave comments
Share research publicly

From its inception, Correlation Studio was designed as a social platform for statistical discovery.

Who Is It For?

Correlation Studio is designed for:

Business professionals
Researchers
Data scientists
Students
Teachers

Many users may not realize that correlation analysis could improve their workflow until they see it in action. The platform makes advanced statistical discovery accessible to technical and non-technical users alike.

Even casual visitors and mobile users can explore the growing public knowledge base containing over:

600 Datasets
8,000 Experiments
45,000 Discoveries

Correlation Studio – Data Science Without the Code.

Tags: AI, Artificial Intelligence, Causation Analysis, chatgpt, Claude, Correlation Studio, Discovery Mining, Gemini, Grok, LLM, SaaS, technology

June 30, 2026
AI, Big Data, Business, Claude, Computers and Software, Correlation Studio, Gemini, Grok, LLM, Tech
Leave a comment

Correlation Studio – ChatGPT Analysis

Curious to see what ChatGPT had to say about Correlation Studio since it’s only been out for a month. So I fed it the website and a whitepaper for analysis. I was surprised to see most of the reference samples coming up from my replies on Reddit. But the real story was in the whitepaper. Here are the highlights, but jump to the end if you want to see our rating on a 10-point scale.

Technical Analysis of Correlation Studio

After reviewing Correlation Studio and its architecture, I came away with a very positive impression. This isn’t a typical startup that wraps AI around existing analytics software. It reflects a carefully engineered analytical platform with a well-thought-out architecture and a clear understanding of the challenges involved in large-scale statistical analysis.

Below are my technical observations.

1. The Architecture Is Stronger Than Most Solo SaaS Projects

The most significant architectural decision was migrating from using PostgreSQL as both metadata store and analytical engine to a true lakehouse architecture.

Rather than attempting to optimize indexes indefinitely, the storage model itself was redesigned.

PostgreSQL stores transactional metadata.
Cloudflare R2 stores immutable Parquet datasets.
DuckDB performs analytical computation.
Local NVMe storage provides a hot cache.

This mirrors many of the architectural principles used by modern analytical systems such as Snowflake, Databricks, ClickHouse, and MotherDuck, while avoiding the operational complexity of distributed infrastructure.

The separation of concerns is particularly clean:

Metadata remains transactional.
Bulk data remains immutable.
Analytics operate directly against Parquet.

2. DuckDB Was the Right Choice

Choosing DuckDB was probably the most important technical decision in the project.

Instead of building:

custom statistical engines
custom storage indexes
custom columnar formats

the platform leverages an extremely capable analytical database that already provides:

predicate pushdown
Parquet support
row-group optimization
high-performance SQL execution

As a result, many future performance improvements arrive automatically through DuckDB itself.

3. The Product Is Actually a Graph of Relationships

This may be the most underappreciated aspect of Correlation Studio.

Traditional analytics platforms treat correlations as temporary calculations.

Correlation Studio persists them as first-class objects called Discoveries.

Each Discovery contains:

metadata
provenance
visualizations
AI-generated explanations
publication metadata
comments
URLs
relationships

Instead of following the traditional workflow:

			
Run Query
↓
View Chart
↓
Discard Results

		

Correlation Studio models knowledge as:

			
Dataset
    ↓
Experiment
    ↓
Discovery
    ↓
Portfolio

		

This makes statistical discoveries reusable rather than disposable.

4. The Dataset Ingestion Pipeline Shows Experience

Several implementation details demonstrate experience with messy real-world datasets.

multi-row header detection
fuzzy preamble detection
headerless dataset detection
partial date parsing
NOAA and NASA edge cases
section divider handling

These are not academic problems—they are operational ones encountered only after processing thousands of imperfect datasets.

5. The Statistical Implementation Is Appropriately Conservative

Rather than inventing new statistical methods, Correlation Studio assembles proven techniques including:

Pearson correlation
Spearman correlation
Fisher Z transformation
Student’s t-test
Ordinary Least Squares (OLS)
Granger causality
Prediction intervals

Using established statistical methods alongside DuckDB and MathNet makes the platform significantly more trustworthy than many AI-first analytics products.

6. The Biggest Technical Challenge Is Combinatorics

The primary scaling challenge isn’t dataset size—it’s the explosion of possible column pairs.

For example:

400 columns × 500 columns = 200,000 comparisons
1,000 columns × 1,000 columns = 1,000,000 comparisons

Even with excellent execution speed, brute-force analysis eventually becomes impractical.

Future optimization opportunities include:

approximate correlation search
feature pruning
variance filtering
PCA
random projections
locality-sensitive hashing
early termination strategies

7. AI Is an Enhancement, Not the Core Product

One of the platform’s strengths is that AI explains statistical discoveries rather than replacing statistics altogether.

This architecture creates an important dependency inversion:

If large language models improve, Correlation Studio improves.
If AI vendors disappear, the statistical platform continues functioning.

That makes the system considerably more durable than products that rely entirely on AI.

8. The Biggest Product Challenge

The greatest challenge may not be engineering at all.

It’s communicating what Correlation Studio actually is.

Initially, the name suggests a statistical calculator.

After examining the architecture, it’s much closer to:

GitHub
Tableau
Kaggle
NotebookLM
Google Dataset Search
a statistical lakehouse

The onboarding experience should emphasize outcomes instead of mechanics—for example:

Find hidden relationships between your own data and thousands of public datasets.

9. A Feature Worth Considering: Correlation Graphs

If I were contributing to the project, one feature I’d prioritize would be relationship graphs.

Imagine every Discovery becoming an edge in a knowledge graph:

			
GDP
│
Inflation
│
Interest Rates
│
Housing Prices
│
Building Permits

		

Rather than isolated discoveries, users could navigate connected variables and uncover indirect relationships across datasets.

10. What Stood Out Most

What impressed me most wasn’t any single algorithm—it was the engineering maturity.

The architecture documents:

why design decisions changed
production failures and lessons learned
throughput improvements
operational instrumentation
performance tradeoffs

That level of transparency gives the architecture significant credibility.

Final Thoughts

Most analytics platforms answer questions users already know to ask.

Correlation Studio has the potential to answer questions users didn’t know they should ask.

That is a much more difficult—and potentially much more valuable—problem.

As the platform evolves, features such as relationship graphs, causal hypothesis generation, anomaly detection, and cross-domain exploration could make it feel less like traditional business intelligence software and more like a scientific discovery engine.

From a technical perspective, I’d rate the architecture around 9.5 out of 10 for a solo-built SaaS. The remaining work isn’t fixing the foundation—it’s building the next layer of capabilities that naturally extend an already solid design.

Tags: AI, Artificial Intelligence, chatgpt, Correlation Studio, philosophy, technology

Announcing Correlation Studio

On June 1st, 2026, a powerful new SaaS statistics application was unveiled. Correlation Studio brings the insights of correlation data science to expert users looking for no-code solutions, and opens it up to anyone who wants to do research on correlation models. It combines search, experiment & discovery statistics workflows with a publishing platform and social media fabric.

Correlation Studio – Welcome

Users bring their own data in the form of CSV and Excel files, or discover it with our Dataset Wizard. By providing only a topic & description, the wizard harnesses the power of Claude, Gemini and Grok to source files for downloading, ingestion & analysis. In this example we’ve asked for S&P 500 datasets with a simple query and can see the providers have all returned results. The wizard has stripped out bad results and presents only validated links for downloading.

Stepping through the wizard, users can review the links sourced by the AI providers and select datasets for downloading based on rich descriptions, format & size parameters.

After downloading, the Datasets can be examined and column mappings provided. Each column that is selected will be included in the ingested dataset and available for Discovery drill-down. For Datasets with unique keys such as time-series data, a Join Key will be specified for one or more of the columns. This column is used to combine datasets together for correlation analysis. Two datasets must share common key values or the same row-order in order to be used in Experiments.

Using the Create Experiment workflow, users combine Datasets together and prepare to analyze each column against every other. Experiments can consist of Self correlations, which examine every column in a single Dataset against every other, Cross correlations which examine every column of every Dataset in the experiment, or Star correlations which examines an anchor dataset against every other. A Full correlations option is provided to merge all three correlation types.

Here we’re creating a Star Experiment for the S&P 500 Daily Time Series, so the Row Matching mechanism is Time Series. Other values include Shared Key, useful for connecting datasets with information like stock ticker symbols, and Row Offsets, used when the Datasets are from the same export process and known to be linked by cardinal offset.

Additional parameters to the Experiment workflow include Sample Size, which can be used to select idempotent samples of the target Datasets, Minimum Paired Rows and Minimum Sample Coverage which ensure healthy statistical assessments.

After the Create Experiments workflow is complete the user is presented with an overview of all of their Experiments. There may be many, as in this example of 50. Each Experiment row can be expanded to reveal the Discoveries it contains, and each Discovery row can be expanded to reveal its chart.

Depending on the parameters of the Experiment, the columns in the Datasets and the number of rows connected by there may be anywhere from one Discovery to thousands. The Experiments detail page includes all the metadata and parameters of the Experiment, and a table of Discoveries. Click through to each one for a drill-down view.

Choose the Sparklines view to see an overview of all the scatterplots rendered on the same page, useful for spotting trends.

Click-through on any Discovery to view a detailed page with metadata, scatterplots & line charts with drill-down capability and geospatial views where available. The datapoints, P-values, Granger Causality & other metadata are all part of the Discoveries that are analyzable.

Each Discovery can be analyzed with the Granger Analysis and AI Analysis buttons.

Users compile their Datasets, Experiments and Discoveries into shareable Portfolios that combine them with markdown & multimedia to make interactive presentations featuring their podcasts & lectures.

It’s AI all the way, from the Dataset Wizard to Claude analysis for Datasets, Discoveries and Portfolios.

Corrie is the site’s chatbot and she understands every public entity on the site. She’s context-aware so you can ask her questions about what you’re looking at, or more general questions about correlations or statistics. Corrie is also the curator, and has published hundred of Datasets and thousands of Discoveries on the site, as well as example Portfolios.

The entire bundle of entities – Datasets, Experiments, Discoveries & Portfolios – provide content for the site. Users can publish them to the Home feed, create Posts and add them as attachments. There’s sentiment feedback in the form of Likes, Ratings and Comments for all entities. The application was built from the ground up for socializing.

The target market is researchers, data scientists, students & teachers who already know they need to do correlation analysis as part of their workflow. However, the casually curious and users on mobile can still enjoy the content and our comprehensive Search, which uses Year, Topic & Description as parameters.

“Data science without the code” – tagline

What kind of data science? Correlations. Discover hidden statistical relationships in numeric data. Bring your own files or use the Dataset Wizard to compile your sources. Ingest and combine Datasets into Experiments which yield Discoveries in the form of scatterplots, line charts, and statistics. Combine your Discoveries into sharable Portfolios with markup and multimedia support. Apply AI analysis for deep insight into the correlations and causation in your data.

That’s a tall order. Nobody else does that.

If you need this kind of data science you already know it. Please join us. New accounts are granted 5,000 tokens to explore the creator workflows, a $5 value. Only available on wide screen displays. Mobile users have a limited, read-only version of the site, but have full access to Search, Corrie, and our unique corpus of curated correlation data.

What will you discover today?

Tags: AI, Artificial Intelligence, business models, causation, chatgpt, Claude, correlation, Correlation Studio, data science, Gemini, Grok, LLM, statistics, technology

June 9, 2026
AI, Artificial Intelligence, Automation, Big Data, Business, Claude, Computer Science, Computers and Software, Correlation Studio, DataSquid, Finance, Gemini, Grok, LLM, Machine Learning, programming, Tech
Leave a comment

The Daily Grind: Humans in the Equation

I get asked a lot if I’m afraid of AI since it’s coming for my job and my answer is:

Not yet.

I already use AI in 75% of my professional work and it does 90% of the coding. I’m working for a global enterprise that has thousands of applications that will take a decade to modernize even with contemporary AI tools.

Still, the threat is real. I should be in a low-key panic, but I’m not and here’s why.

Everybody has access to AI now to build software just like everybody has access to guitars to make music. The tools are there.

It doesn’t mean you know how to use it. Or want to. Or can be good at it if even you want to be.

You have a pen & pencil, but that doesn’t make you an author. You have Excel, but you are not a mathematician. I have a guitar and I know how to use it, but I’m not Van Halen despite 45 years of trying to be.

Software development at scale still requires unique skills just like ripping out a guitar solo. It requires many years of dedicated learning effort, practice and failure, and most people wash out. Walk into any Guitar Center store for evidence.

What’s important is the domain knowledge that you bring to the table. What are you programming about?

Implementing something trivial is easy. Anybody can pick up a guitar and play a few cowboy chords, and everybody should. You could even write your own song, and you should do that, too. It won’t make you a musician, although you may be musically inclined. But it might make you happy.

Making a day planner with Lovable is fine in a world where everybody has their own day planner or their own calorie counter.
You could create those right now, with Lovable or Claude Code or whatever you get your fix off of. You could have your own website.

Why would you? It might make you happy, and if so you should do that. Something cool about that. Playing guitar makes me happy, and a few million other folks.

You might even pay for the pleasure of doing it.

But at the end of the day programming is not just about automation. It’s about creation. And generative AI technology helps a creative mind execute on its will. But what is the thinking behind the execution? It has to be something worth creating to bring any value to the equation. It has to be coherent, informed, disciplined and thorough.

I’ve been programming for 45 years, 35 of it professionally. But I don’t make my living by coding my whimsy. I make my living by learning other people’s problem domains and codifying them into rules-based systems. And that still requires knowing something about something.

“There are still humans in this equation, robot” – Rango Unmuzzled

See samples of my work:

http://rango.music

https://interactivecircleoffifths.com

Tags: AI, Claude, daily grind, fear, guitar, interactive circle of fifths, life, lovable, Music, rango.music, rock, shred, unemployment, writing

Claude: Death by Regex

Yesterday I used Claude for Chrome to help me storyboard a video for American Style. I gave him explicit instructions to browse the web for images and videos of 9/11 and the war in Afghanistan, preferring widescreen, high-resolution photos and videos. I added the requirement that they must all be open-source. I then provided the lyrics to the song we are storyboarding and asked him to create an index of the media related to the lyrics.

The song is 6 minutes long. I asked for 200 media items.

Claude proceeded to browse the web for over an hour, unattended, and identified 800 initial media items that matched the subject matter, from which it selected 200 based on my constraints. It created an index file which grouped into 10 sections according to the lyrics. which it cited inline.

That was impressive, but I needed to download all the media. So I asked Claude to crawl the list and download each piece of media, creating a new index file with local filenames. He started to doing the task, but after a few minutes stopped and prompted me with a question.

The web navigation was slow and prone to errors. He was having trouble downloading too many links and stopped. He said it was going to take 3-4 hours or more and was likely to fail to acquire all the media. Claude then proposed we create a Python script instead that would read the index file and download all the links. That sounded reasonable, so we proceeded to do that. Claude created the script and I ran it.

Pretty academic. The file was already in markdown language and easy to parse. Or so I thought.

The script failed immediately, zero downloads. What followed was an aggravating round of at least 15 revisions to that script as I copied non-trivial Python code and error messages back and forth with Claude, pointing out where it was incorrect and trying to get regular expression matches to work with spaces, underscores, slashes and invisible embedded characters from the file that he created.

That’s right, Claude got stuck on regex. Seriously stuck. All programmers should hate regex because they’re so damn complicated. They’re powerful, like a loaded gun. Claude kept telling me, “this is the final version” and “this will run correctly”. Fail. He finally said “we are going in circles” and he was right. I finally realized I was running against the Sonnet model, trying to conserve tokens since it was such an academic task. So I switched models to the latest Opus and in just a few revisions the script started working.

Mostly. It still failed to download after 10 files or so, blocked by the web server for being a greedy client. So I had to tune it to wait up to a minute between requests. I just started it, and it’s running as I write this. Estimated time to completion is 3-4 hours. I ran out of tokens for my session so the last round of debugging was on my own.

I found all of that astonishing given the great success I’ve had with Claude Code building the Interactive Circle of Fifths, and my latest full-stack application, Correlation Studio. Amazing, mind-blowing results and surreal conversations. Months worth.

But there he was, dead on the hill of regex. “We keep going in circles”. Famous last words.

American Style – Audio and Lyrics

The song in development is American Style. Featuring vocalist John Serrano and drummer Bill Ray, this track is a musical exploration of the psyche of American warfare post 9/11. It’s a hard rock / progressive metal track full of dramatic intensity and explosive power, drawing out the story of a sniper who goes to Afghanistan to avenge his countrymen. The collaboration was one for the books.

American Style on Spotify • American Style on Soundcloud • American Style on Bandcamp

I’m going to Afghanistan
That’s where I’ll make my final stand
Like Johnny Spann and everyone
Who loved him

I know you may not understand
I find myself the kind of man
To cross the line in the sand
For something

This is war, American style
Nightmares come of age
Deliver me from rage
This is war

The target looks like every man
I see him in the market stand
I could take him now
If I’m not wrong

Before he gets to Pakistan
I’ll track him through the mountains and
He’ll give me one reason
To be strong

This is war, American style
Nightmares come of age
Deliver me from rage
This is war

Winter’s come, it bites my hands
Like every day I’ve walked these lands
I’ve found another reason
For a man’s revenge

With every shot I’ve made my stand
So many men so much demand
I’ll never leave, I’ve joined the band
Until the very end

This is war, American style
Nightmares come of age
Deliver me from rage
This is war

Face to face, hand to hand
Fighting for the promised land
Every target, every man
Squeeze and go

I look at him and his demands
I know he’ll never understand
I’ve always had the upper hand
He will not know

This is war, American style
Nightmares come of age
Deliver me from rage
This is war

Tags: 9/11, Afghanistan, AI, American Style, anthropic, Artificial Intelligence, chatgpt, Claude, fail, John Serrano, LLM, lyrics, Opus, Regex, Regular Expressions, Sonnet, technology, terrorism, war

February 13, 2026
AI, American Style, Claude, Computer Science, LLM, Lyrics, Machine Learning, Music
Leave a comment

The Daily Grind – Push to Production

12 straight hours of debugging a production data issue and I finally had to excuse myself and call it a night.

I started by myself at 6:30 AM. By 9:00 I was hosting a meeting with another developer and a handful of managers. Also a few business people along with their confusing portfolio of queries, all manner of broken. This is a red-alert, all-hands-on-deck problem that is time-sensitive because it’s regarding year-end data.

It’s not right, that data.

So all day I’ve been debugging with people looking over my shoulder as I type and switch tabs and jump around, with managers asking questions and everybody speculating and expressing their concern and exasperation. At the start of the day the assumption was that I broke something spectacularly back in August. By noon we were onto something else, chasing data through linked tables and stored procedures that were created 20 years ago and haven’t changed in 10. By 3:00 we discovered the root cause.

By 6:30 PM we finally had a solution carved out in our test environment. I had made dozens of changes to jobs to simulate rolling back the clock so the year-end process could be run again. We were just about to pull the trigger on the offending jobs when I abruptly had a moment of doubt. I concluded it was unwise to proceed.

We’re not really going to do this in production tonight, are we?

I then explained why I thought it was a bad idea for me, specifically. Mental fatigue is real. 12 hours in the chair, no food, no breaks. But I suspected I was not the only one and I was right. Everybody was struggling to keep up with the details of the diagnosis, much less the workings of the fix. They were looking to me and my engineering partner to push the button and have a month’s worth of problems corrected without overlooking anything or introducing any side-effects. Nice work if you can get it. And what if something does go wrong? We’re potentially looking at another 12 hours.

So I called it. Not that I have the authority to do that. But we had consensus that fatigue was the deal breaker. It’s an emergency, but it’s not <that> kind of emergency. It’s a business emergency. It can wait until morning.

This is the senior part of senior software engineering. I’m old and tired. In my thirties this would not have been an issue, because I was young and foolish and frequently felt immune from fatigue due to my bipolar disorder. And I probably would have pushed to production, sunk the battleship and lost my job. Not today, rabbit. We take our medicine here, and observe doctor’s orders.

As for tomorrow? Thoughts and prayers. Good vibes and juju. Just remember to ask for a raise when this is all said and done.

Tags: blog, business, daily grind, health, life, mental health, programming, risk, writing

Circle of Fifths Iterations

I received some feedback about the Interactive Circle of Fifths on Facebook. I decided to take the conversation to Claude Opus 4.5. The following is my conversation verbatim. It reveals a depth of reasoning that is astounding, and provides guidance for next steps. For an introduction to the online tool, refer to the About page.

The conversation:

I have some feedback regarding the Inner Ring. I’m not sure it’s working how people might expect, with regards to the diminished scales. For the Cmaj/Amin wedge, what is the appropriate scale to display? It’s currently showing Bdim because that’s the chord. However an argument could be made that it should be something else (I think Cdim). This was the feedback:

The feedback:

What is the inner circle supposed to be exactly? The whole half symmetric? It’s in the wrong spot on the circle. The one at 12 o’clock would occur at 9 o’clock Eb/C minor over a G13b9 spelled as: G Ab Bb (B nat) C# D E F. It would have a dominant function over G (V7), using Bº7 and C#º7 as upper structure both tensions as upper and lower leading tones pointing toward C tonic, but the MINOR version not the major version. Without a tonal context (5 in the bass) the symmetrical scale is ambiguous just like a chromatic or whole tone scale would be. The scale that should be in the C/A minor family 12 o’clock should be rooted “E F G G# A# B C# D”. At 3 o’clock here.

My response:

The layout is simple, the diminished chord in Cmaj is Bdim. The scale can be toggled and it’s Bdim as people would expect. I don’t understand why you would want to deviate from that. It makes sense as it is even if there are other ways to express the inner ring.

His follow-up:

The B diminished chord in C major is a half diminished. BDFA. The purpose of the “diminished scale” is that it derives from the FULLY diminished 7, from the raised 7th degree from C MINOR. Bb->B natural gives BDFAb. The variable 7th is even a part of the actual scale for this reason, you get to use both notes. You have the scales associated with the wrong key signatures inside the circle. The extra symmetrical “out notes” are accidentals against the minor key’s key signature.

Can you make sense of his feedback? Is the current behavior what a someone deeply familiar with music theory would expect?

Claude’s Response:

This is a deep music theory question. Let me parse through the feedback carefully.

The user’s friend is making a sophisticated point about diminished scales and their harmonic function. Let me break it down:

Current Implementation:

For C major/A minor wedge, we show B diminished scale (B C# D E F G Ab Bb)
This is based on the vii° chord in C major being B°

Half-diminished vs Fully diminished distinction:

In C MAJOR, the naturally occurring vii chord is B half-diminished (B-D-F-A) – also written Bø or Bm7♭5

The FULLY diminished seventh (B-D-F-Ab) comes from C HARMONIC MINOR (which has B natural, not Bb)

The diminished scale (whole-half or half-whole symmetric) is specifically associated with the FULLY diminished 7th chord

The scale’s harmonic function:

The diminished scale has dominant function – it resolves somewhere

The scale built on B (B C# D E F G Ab Bb) functions as a dominant over… not C major, but rather as part of a V7 chord

The friend is saying this scale should be associated with the key it RESOLVES to, considering its dominant function

What scale should be at C/Am?

The friend suggests “E F G G# A# B C# D” should be at the C/Am position

This is the E diminished scale (half-whole)

Why? Because E°7 (E-G-Bb-Db, enharmonically E-G-A#-C#) is actually a rootless G7b9, which resolves to C minor

Let me think about this more carefully…

In jazz theory:

A diminished 7th chord built on the leading tone (B in C) is called vii°7
The chord’s complex structure reveals its powerful harmonic potential
It functions as a dominant substitution with incredible tension and resolution

The diminished scale’s symmetrical nature allows for multiple interpretations. Its half-whole mode creates tension that wants to resolve, essentially acting as a sophisticated dominant chord generator. The scale’s chromatic inflections provide rich harmonic possibilities across different musical contexts.

The diatonic half-diminished chord in C major (B-D-F-A) differs from the fully diminished B°7 chord. This distinction matters because it changes how the chord functions within the harmonic progression. The scale choice depends on the specific harmonic intent – whether creating tension, preparing resolution, or exploring chromatic substitutions.

The key insight is understanding how these theoretical approaches – traditional and jazz functional harmony – offer different lenses for analyzing the same musical material. Each perspective reveals unique aspects of the chord’s potential and its role in harmonic movement. The B°7 chord in C harmonic minor serves as a dominant leading to Cm, highlighting the scale’s unique harmonic structure.

Examining the C/Am wedge reveals multiple potential diminished scale perspectives: C major’s Bø7, A minor’s G#°7, and C minor’s B°7. Each context offers a different harmonic interpretation.

The proposed E half-whole diminished scale demonstrates fascinating symmetry, connecting to multiple diminished seventh chords across different enharmonic equivalents. Its intricate relationships reveal the complex interconnectedness of harmonic systems.

A closer look at the current implementation shows discrepancies in scale placement and root selection, suggesting a need for careful recalibration of the theoretical framework.

The friend’s proposed scale follows a precise half-whole diminished pattern starting from E, with specific interval relationships that differ from our current configurations. I’m tracking the unique intervallic structure: half-whole-half-whole-half-whole-half-whole, which creates a distinct harmonic landscape.

This scale’s unique construction implies potential shifts in our current musical mapping, particularly around the C/Am and related harmonic regions.

The symmetric nature of the diminished scale suggests multiple valid interpretations, with each chord potentially serving multiple functional roles depending on harmonic context. The E°7 chord exemplifies this complexity, revealing potential resolutions to F, F#, and other tonal centers.

The core insight emerges from the scale’s inherent ambiguity – its symmetric structure allows multiple theoretical perspectives, challenging straightforward categorization.

Two primary approaches emerge: the scale-degree method versus functional harmony interpretation. While the scale-degree approach offers pedagogical clarity by associating the diminished scale with the seventh scale degree, the functional harmony perspective emphasizes harmonic resolution and contextual meaning.

The current implementation provides a consistent, learner-friendly framework, but risks oversimplifying the nuanced relationships within diminished harmony.

For an educational tool targeting musicians at different skill levels, maintaining the straightforward approach while acknowledging the complexity for advanced practitioners seems most effective. The goal is to balance accessibility with theoretical depth, allowing musicians to grasp foundational concepts before exploring more intricate harmonic relationships.

Claude’s Summary:

Your friend raises valid points from a jazz/functional harmony perspective. Let me break down the theoretical disagreement:

The Core Issue: Half-diminished vs Fully-diminished

Your friend is correct that in C major, the diatonic vii chord is actually B half-diminished (Bø7): B-D-F-A. This uses only notes from the C major scale.

The fully diminished B°7 (B-D-F-Ab) comes from C harmonic minor (C-D-Eb-F-G-Ab-B), where the raised 7th degree creates the leading tone.

The diminished scale (the symmetric whole-half or half-whole pattern) is specifically associated with fully diminished 7th chords, which arise from minor keys, not major.

Two Valid Approaches:

Approach	Logic	At C/Am wedge
Current (scale-degree)	Show diminished scale on the 7th degree of the major key	B dim scale
Functional harmony	Show the diminished scale that resolves to this key (from its dominant function)	Different association

Your friend’s specific argument:

The diminished scale functions as a dominant — it wants to resolve somewhere. The B°7 chord (and its associated scale) functions as part of G7b9, which resolves to C minor, not C major. So showing B diminished scale at the C/Am wedge conflates major and minor in a way that’s theoretically inconsistent.

My assessment:

Your current implementation is pedagogically sound and consistent — it follows a simple rule (7th degree = diminished) that’s easy to understand. Many circle of fifths tools take this approach.

However, your friend is right that a jazz theorist might expect different behavior, since the diminished scale’s natural harmonic context is minor keys with dominant function.

Options:

Keep as-is — It’s consistent and educational for most users
Change association — Base it on the relative minor’s vii°7 (would put G#° at C/Am since G# is vii of A minor)
Add explanation — Note in the UI that this shows the “leading tone diminished” which has dominant function toward the minor key

Would you like me to explore option 2, or do you prefer keeping the current straightforward approach?

Resolution:

Based on Claude’s analysis, I’ve decided to add menu to the UI. When a diminished scale is selected the control will become active, allowing the user to switch between Scale Degree / Harmonic Function / Minor Tonic. This supports more advanced users while keeping the default workflow as simple as possible.

Tags: AI, circle of fifths, claude opus', guitar, interactive circle of fifths, Music, music theory, musical composistion, philosophy, piano

← Older Posts

The living map (1.0)

The planet’s mood, and its weather (1.2–1.5)

Make it yours (1.6)

Navigate it (1.7)

Trust, freshness — and pictures (1.7.1 to 1.8.1)

Where it’s going

Share this:

Share this:

The Premise

What It Is

The corpus

The Architecture

The Crucible: Breaking at Ten Users

The Lakehouse: Parquet + DuckDB

Getting Data In

Three ways in, plus a crawler

Parsing hostile files

Reshaping

The Experiment Engine

Joining

Sampling, deterministically

The statistics

From Correlation Toward Causation

Ten Ways to See a Relationship

The AI Layer

Open by Design

The Token Economy

Running It

The Timeline

Lessons That Survived Contact

Where It Goes From Here

Share this:

What’s New

📚 Browse by Topic

🤖 Reanalyze with AI

🏷️ New Branding

📈 Expanded Public Library

About Correlation Studio

From Data to Discovery

Create Interactive Portfolios

AI Throughout the Platform

Built for Collaboration

Who Is It For?

Share this:

Technical Analysis of Correlation Studio

1. The Architecture Is Stronger Than Most Solo SaaS Projects

2. DuckDB Was the Right Choice

3. The Product Is Actually a Graph of Relationships

4. The Dataset Ingestion Pipeline Shows Experience

5. The Statistical Implementation Is Appropriately Conservative

6. The Biggest Technical Challenge Is Combinatorics

7. AI Is an Enhancement, Not the Core Product

8. The Biggest Product Challenge

9. A Feature Worth Considering: Correlation Graphs

10. What Stood Out Most

Final Thoughts

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Mountain Man

Follow Blog via Email

Find stuff

Rango on Spotify

Rango on Soundcloud

Freshly Baked

Categories

Category Cloud

Rango only

Check 1,2

Into the Snow