Correlation Studio Whitepaper

Building a statistical discovery engine with DuckDB, Parquet, Cloudflare R2, PostgreSQL, and AI-assisted causation analysis.

Correlation Studio is a vertically integrated correlation-analysis SaaS built to mine relationships across datasets, surface statistically meaningful findings as persistent knowledge objects, and make advanced data science usable without code.

Executive Summary

Most analytics software answers questions users already know to ask. Correlation Studio was built for a different problem: discovering relationships users did not know were there.

The premise is simple but technically demanding. A user uploads or discovers tabular datasets, pairs them into experiments, and the system exhaustively compares numeric columns against one another. When a relationship clears the configured threshold, Correlation Studio promotes it into a first-class object called a Discovery.

A Discovery is not a temporary chart. It is a persistent, publishable, analyzable artifact containing statistical metadata, visualizations, regression overlays, confidence intervals, Granger-causality testing, source-row drilldowns, and AI-generated interpretation.

The Core Bet

A one-person team armed with the right primitives for discovery mining and causation analysis can deliver useful analytical insight faster than a larger team using less integrated tools.

That belief shaped every architectural decision. Correlation Studio favors simplicity over optionality: one database, one app server, one object storage account, one embedded analytical engine, and one consistent set of product abstractions.

Whenever the system had to choose between a flexible-but-complex pattern and a rigid-but-simple one, the simpler design won. Per-user database shards were collapsed into a single PostgreSQL database. Bulk analytical rows were moved out of PostgreSQL and into Parquet. In-row binary payloads were replaced with per-discovery columnar files. Multiple state flags were consolidated into simpler state machines.

The result is a product that is technically ambitious but operationally small enough to run as an independent SaaS.

Default-Alive by Design

Correlation Studio was designed around a severe constraint: it needed to be sustainable before it became large.

The free tier is genuinely free. New users receive 10 GB of storage and can browse the public corpus without a recurring charge. Revenue comes from token packs that fund ingestion, correlation work, AI search, AI analysis, and chatbot usage.

At roughly $5 per month of token spend per actively engaged user, about 60 engaged users cover the platform’s operating run rate: approximately $300 per month for two VPS instances, Cloudflare, R2 object storage, and the domain.

At larger scale, the infrastructure cost profile stays intentionally flat. Object storage is metered by byte with zero egress through Cloudflare R2. Compute grows in fixed steps. AI API spend is the primary engagement-scaled cost, and the token margin exists specifically to cover it.

Capital Efficiency as Architecture

The system is not only optimized for query speed. It is optimized so the business can remain alive while the corpus, audience, and go-to-market motion are still forming.

The Crucible: When v1 Broke

Correlation Studio was built solo over roughly five months. Commit zero landed on December 29, 2025. Around 570 commits later, the project had become a production SaaS.

It did not get there in a straight line.

The original v1 architecture used a central PostgreSQL database plus per-user content shards. Parsed source data lived in relational tables named Rows, KeyIndex, and ValueIndex.

That design worked until realistic ingestion traffic arrived: stress tests designed to break it.

Wide datasets generated enormous write amplification. A CSV with thousands of columns and millions of rows could produce hundreds of millions of B-tree inserts. On SATA-backed shards, that turned into IOPS queue starvation. Under realistic ingestion load, the system collapsed at about 10 concurrent active jobs.

The fix was not a tuning pass.

It was a clean-room replacement of the analytical substrate.

The v2 Lakehouse migration, completed on May 17, 2026, deleted all content sharding, retired the bulk-data PostgreSQL tables, moved source data into columnar Parquet files on Cloudflare R2, and replaced row-range index scans with DuckDB queries against Parquet.

Roughly 5,000 lines of C# disappeared in the cutover. More importantly, the frontend never noticed. Dataset, experiment, and discovery API contracts survived the migration intact.

v1: PostgreSQL as data lake
Upload
Rows / KeyIndex / ValueIndex
B-tree write amplification
IOPS starvation under ingestion load

The original architecture failed because row-oriented storage was doing analytical storage’s job.

v2: Lakehouse substrate
Upload
Parquet on Cloudflare R2
Local SSD hot cache
DuckDB analytical queries
PostgreSQL metadata only

The v2 architecture separates metadata, bulk storage, and analytical execution.

The Current Architecture

Correlation Studio’s current architecture is deliberately compact. It uses a single PostgreSQL database for metadata, Cloudflare R2 for immutable bulk data, DuckDB for analytical execution, and a .NET application server to coordinate ingestion, statistics, AI workflows, billing, and the public web app.

PostgreSQL 17

Stores users, billing, datasets, experiments, discoveries, jobs, audit records, and RAG vectors through pgvector.

Cloudflare R2

Stores bulk row data as columnar Parquet files: one Parquet file per dataset and one per discovery datapoint payload.

DuckDB

Runs embedded inside the .NET app server and queries Parquet through a local SSD hot cache or R2’s S3-compatible interface.

.NET 10 API

Provides the ASP.NET Core backend in a three-project solution: Core, Infrastructure, and Api.

React + TypeScript

Powers the SPA with Vite, canvas-rendered charts, route-level code splitting, React Query, and Zustand.

AI Providers

Claude, Gemini, and Grok run in parallel for dataset search, AI analysis, and Corrie, the in-app RAG chatbot.

The architecture is not cloud-native in the fashionable sense. It is not a swarm of microservices. It is not built around Kubernetes. It is a vertically integrated system with a small number of moving parts, each chosen because it eliminates an entire class of complexity.

Product Overview

Correlation Studio turns datasets into reusable statistical knowledge. The user workflow is intentionally linear: create data, combine data, discover relationships, explain relationships, and publish the results.

  1. Sign In Users can sign in with email and password, Google OAuth, or Microsoft OAuth. New accounts receive 10 GB of storage and zero tokens. The platform remains free until the user wants to ingest data, run AI search, or perform AI analysis.
  2. Create a Dataset Datasets can be created by uploading local files, pasting URLs to remote files, or using the AI-powered Remote Search workflow. Remote Search asks Claude, Gemini, and Grok for relevant open-data URLs. Every dataset is assigned a Topic, receives inferred column types, and produces both a metadata row in PostgreSQL and a Parquet file in R2.
  3. Create an Experiment An experiment pairs two datasets, chooses a join strategy, and sets an absolute correlation threshold. The experiment ingester evaluates every numeric-column pair and emits one Discovery for each pair whose absolute correlation exceeds the threshold.
  4. Browse Discoveries Discoveries can be explored through scatterplots, line graphs, residual plots, heatmaps, Pearson-vs-Spearman divergence charts, bubble charts, sparklines, lag analysis, rolling correlation, regression overlays, and correlation networks.
  5. Drill Into Source Rows Every chart datapoint can be traced back to the source rows on both sides of the join. For time-series joins, the drilldown is bucket-aware: a single day can collect all X and Y rows whose timestamps fall inside that bucket.
  6. Analyze and Reanalyze Datasets, experiments, discoveries, and portfolios can be analyzed with Claude, Gemini, or Grok. The resulting Markdown narrative is pinned to the entity. Once analysis exists, the action becomes Reanalyze, replacing the prior model output with a fresh one. Datasets and Discoveries also expose an edit path so owners, and in some cases moderators, can refine the analysis text manually.
  7. Publish Datasets, experiments, discoveries, portfolios, posts, and forum questions can be published to the Home feed and made searchable.
  8. Compose Portfolios Portfolios are block-based pages that combine Markdown, images, embedded datasets, experiments, discoveries, analysis output, audio, and video into interactive presentations.
  9. Transform Data The Tools page supports derivative dataset operations such as pivot, melt, normalize, resample, rank, bin, outlier removal, missing-value filling, merge, join, and append workflows.
  10. Ask Corrie Corrie, the in-app RAG chatbot, can answer product questions over the user guide and content questions over public or user-owned entities. A user can ask how to use the product or ask Corrie to summarize a set of published experiments.

The Product Is a Relationship Graph

Most analytics tools treat correlations as temporary calculations. A user runs a query, sees a chart, and the result disappears unless someone manually saves it.

Correlation Studio uses a different abstraction. It treats each meaningful relationship as an entity.

Dataset
Experiment
Discovery
Portfolio

The core entity chain turns raw data into reusable, publishable knowledge.

That model matters because it allows discoveries to accumulate. A Discovery can be visualized, analyzed, commented on, published, embedded, rated, searched, indexed for retrieval, and assembled into a portfolio.

The result is not just an analytics app. It is a corpus of statistical relationships.

The Token Economy

Correlation Studio prices expensive operations through tokens rather than forcing all users into recurring subscriptions.

OperationToken Pricing
Dataset ingestion25 tokens + 1 token per MB
Correlation experiments25 tokens + 1 token per MB of joined data
Tool transformsSame rate as ingestion
AI search125–250 tokens per call, depending on provider
AI analysis125 tokens per call
AI chatbot25 tokens per call

Token packs range from $5 to $200 and do not expire. Storage subscriptions are optional and exist only to raise the default 10 GB storage cap.

This keeps the free tier useful while ensuring the costly parts of the system—large ingestion jobs and AI calls—pay for themselves.

System Architecture

Correlation Studio is intentionally built as a small, vertically integrated system. The architecture avoids distributed complexity wherever possible, but it is not simplistic. Each layer has a narrow responsibility.

PostgreSQL owns transactional truth. R2 owns immutable bulk storage. DuckDB owns analytical execution. The .NET app server orchestrates the workflows that turn user input into datasets, experiments, discoveries, AI analysis, thumbnails, billing records, and searchable public content.

Design Principle

Keep the number of operational primitives small, but make each primitive do the job it is naturally good at.

Solution Layout

The backend is split into three .NET projects with a React/Vite frontend beside it. The split is conventional by design: domain contracts at the center, infrastructure around them, and API orchestration at the edge.

CorrelationStudio.Core/
Entities, DTOs, Enums, Interfaces, Utilities, Exceptions
No dependencies. Pure POCOs and contracts.
CorrelationStudio.Infrastructure/
EF Core data access, ingestion services, R2 / Parquet /
DuckDB plumbing, RAG subsystem.
Depends on Core.
CorrelationStudio.Api/
Controllers, business services, middleware, background
hosted services, logging.
Depends on Core + Infrastructure.
correlationstudio.client/
React 18 + TypeScript SPA.
Vite dev server proxies to the API and produces a static SPA bundle for production.

This layout keeps the domain model independent of persistence and hosting concerns. It also keeps the solo-developer ergonomics sane: most feature work touches the same predictable layers rather than scattering through a service mesh or multiple deployable applications.

The v2 Lakehouse Topology

The production topology consists of two VPS instances and one Cloudflare R2 account. One server runs PostgreSQL. One server runs the application, static frontend, ingestion workers, DuckDB queries, AI orchestration, thumbnail rendering, Stripe webhooks, and background services.

RolePlanCoresRAMDiskTransfer
Central DBSSD VPS1632 GB1.2 TB SSD10 TB
App serverSSD VPS1664 GB1.2 TB SSD10 TB
Cloudflare R2Object storageMetered$0 egress

The split is intentionally boring. The database server holds PostgreSQL only. The app server handles product logic and analytical computation. R2 stores bulk row data and media assets: dataset Parquet files, per-discovery datapoint Parquets, profile images, thumbnails, and portfolio attachments.

Browser
React + Vite SPA
ASP.NET Core API
├── PostgreSQL 17 metadata, billing, jobs, RAG vectors, audit
├── DuckDB embedded analytical execution
├── SSD hot cache local Parquet working set
└── Cloudflare R2 durable Parquet + media storage

The production topology keeps transactional metadata, analytical compute, and durable bulk storage separate without introducing distributed orchestration.

The app server’s local SSD disk also acts as an LRU hot tier for Parquet. Frequently accessed dataset Parquets and per-discovery datapoint Parquets live locally for sub-millisecond DuckDB reads. Cold datasets remain in R2 and are fetched on demand.

The cache is bounded to 80% of disk, and its hit-rate degradation is surfaced in the Admin system metrics. That matters because the cache is not just a performance optimization; it is what allows the system to feel interactive while retaining cheap, durable object storage as the source of truth.

Why DuckDB on Parquet

The central architectural lesson from v1 was that PostgreSQL should not be asked to act as an analytical file format.

The v1 design materialized every parsed row into relational tables. For narrow, modest datasets, that was acceptable. For wide datasets, it was fatal. A CSV with more than a thousand columns and millions of rows produced hundreds of millions of B-tree inserts across row and index tables. Under concurrent ingestion, write amplification consumed the shard’s disk I/O budget and collapsed throughput.

DuckDB-on-Parquet inverted the workload.

Ingestion now writes one Parquet file per dataset. Analytical queries then issue DuckDB SQL directly against that Parquet. Column statistics, experiment correlations, drilldown lookups, thumbnails, and chart payloads all operate against columnar files rather than row-oriented database tables.

Column Pruning

Scanning one numeric column reads bytes proportional to that column, not the whole row.

Compression

Dictionary and RLE encoding compress low-cardinality columns aggressively, often by 10–100×.

Predicate Pushdown

Row groups allow DuckDB to skip entire chunks that do not satisfy a filter.

Vectorized SQL

The app retains SQL expressiveness without implementing a custom analytical engine.

The DuckDB pool is sized to approximately 16 connections. Every read path that needs analytical data checks out a connection: drilldown, thumbnails, sparklines, chart loading, experiment ingestion, and RAG corpus building.

Under load, connection pressure is observable. Checkout wait time appears in per-request logs as duckdbPoolWaitEma, and ingestion uses that signal to back off through a square-root yield curve. Instead of blindly flooding the local analytical engine, the ingest loop lets pool pressure slow it down.

The Important Shift

The v2 migration did not merely make the old design faster. It moved the analytical workload onto the substrate it wanted all along: columnar files queried by an embedded analytical database.

Connection Resilience and Observability

PostgreSQL access uses Npgsql with Entity Framework Core and a custom resilient retry strategy. Every DbContext resolved through dependency injection carries the same transient-error policy.

Each process also stamps an application name onto its PostgreSQL connections. On restart, ServerLifecycleService can identify and terminate orphaned backends left behind by crashed prior instances. That small operational detail keeps restarts from accumulating stale database connections.

Slow-query observability is handled by a singleton command interceptor. Any database command slower than the configured threshold, defaulting to 1000 ms, is logged with a snapshot of in-flight commands at the time of slowness.

Those logs surface live in the Admin Runtime tab. For a solo-operated system, that visibility is essential. The goal is not perfect distributed tracing. The goal is immediate operational attribution when a request, background job, or query path starts to misbehave.

Large Uploads and Long-Lived Requests

Correlation Studio accepts large datasets. That requirement shows up in places that are easy to overlook.

Kestrel’s maximum request body size is raised to the configured maximum file size, which defaults to 100 GB. Keep-alive timeout is set to four hours so legacy synchronous remote-download requests are not reaped while the server is still fetching a large file. Request header timeout is extended to two minutes.

Development needed the same treatment. The Vite dev proxy forwards API calls with four-hour timeouts, and a custom large-upload plugin disables Node’s default request, header, socket, and keep-alive timeouts on the browser-to-Vite side.

Operational Lesson

Supporting multi-gigabyte user files is not only a storage problem. It affects request limits, proxy defaults, development tooling, cancellation semantics, and background-job contracts.

Data Model

Correlation Studio’s schema is broad because the product is not only an analytical engine. It is also a publishing system, a social graph, a billing system, a job queue, an audit trail, and a RAG corpus.

The full PostgreSQL DDL is roughly 3,200 lines and is organized into predictable domains:

  1. Lookup tables for enum values materialized as rows.
  2. Identity tables for users, external logins, workgroups, and registrations.
  3. Communication tables for email templates, password changes, notifications, and blacklists.
  4. Configuration tables for runtime settings, prompt templates, and AI-provider metadata.
  5. Jobs tables for queues, history, and durable active downloads.
  6. Geocode caches for resolved geographic column values.
  7. Billing tables for invoices, subscriptions, token packs, ledgers, refunds, and operation histories.
  8. Topics and references for dataset discovery and reusable source URLs.
  9. Content tables for datasets, columns, stats, experiments, discoveries, and datapoints.
  10. Social tables for portfolios, posts, forum content, messages, and the Home feed.
  11. Cross-cutting feedback tables for tags, comments, ratings, and sentiment.
  12. Audit tables for events, connection logs, server sessions, and system metrics.
  13. RAG tables for chunks, embeddings, chat conversations, messages, query audits, and embed audits.
  14. System tables for terms, privacy policy, and other platform records.

The schema reflects the product strategy: every entity that can become part of the public knowledge base needs metadata, ownership, visibility, social feedback, searchable text, and deletion semantics.

The PublicId Pattern

Public-facing entities use short, opaque identifiers instead of exposing integer primary keys.

Users, datasets, experiments, discoveries, portfolios, topics, references, jobs, and invoices carry a PublicId: a unique, indexed, 10-character base62 string. API routes and frontend URLs use the public ID. Internal foreign-key relationships remain ordinary integers.

This provides two benefits at once. Public URLs are compact and shareable, and the system does not leak growth rate through sequential IDs. Internally, joins remain efficient and simple.

Public route:
/discoveries/a7Xk92LmQp
Internal tables:
Discoveries.DiscoveryId = 18427
DiscoveryDatasets.DiscoveryId = 18427
Datapoints.DiscoveryId = 18427

Public IDs protect external surfaces without complicating internal relational joins.

States, Jobs, and Unit of Work

Dataset and experiment states are represented as lookup-backed enum values. Some v1 states remain defined for compatibility but are no longer used by the v2 ingestion flow. For example, v2 has no separate indexing phase because indexing bulk rows into PostgreSQL no longer exists.

DatasetStateType:
Pending=1
Queued=2
Reading=3
Read=4
Waiting=5
Indexing=6
Ready=7
Failed=8
Cleaning=9
Downloading=10
ExperimentStateType:
Pending=1
Queued=2
Running=3
Ready=4
Failed=5
Cleaning=6

Repository access is centralized through IUnitOfWork, which exposes one repository property per entity type. The pre-v2 split between central and per-shard units of work was removed when sharding disappeared.

The unit of work and its CorrelationStudioDbContext are scoped per request. Background services that need scoped dependencies create fresh service scopes through IServiceScopeFactory. That pattern is used by deletion sweeps, cleanup runners, RAG workers, ingestion dispatchers, and background download tasks.

This is not glamorous architecture, but it is the kind of boring correctness that keeps long-running jobs from holding onto request-lifetime dependencies after the request is gone.

Storage Layer

The storage layer is where the v2 architecture earns most of its leverage. PostgreSQL stores metadata and transactional state. Cloudflare R2 stores durable bulk data. Local SSD provides a hot working set. DuckDB bridges the two by querying Parquet files as analytical tables.

That separation lets Correlation Studio keep the operational model small while avoiding the failure mode that broke v1: using a row-oriented transactional database as a high-volume analytical substrate.

Cloudflare R2 as Durable Bulk Storage

Correlation Studio stores bulk data in Cloudflare R2. The abstraction is IR2Storage, backed by an AmazonS3Client configured against R2’s S3-compatible endpoint.

R2 needs a few specific settings:

  • ForcePathStyle = true, because R2 does not use virtual-host style addressing.
  • AuthenticationRegion = "auto".
  • UseChunkEncoding = false on PUT requests.
  • DisablePayloadSigning = true for uploads, while SigV4 still signs headers and HTTPS encrypts the body.

Upload and deletion concurrency are controlled separately with two SemaphoreSlim gates. That prevents a deletion sweep from competing with active ingestion uploads for the same R2 per-prefix throttle budget.

Why R2 Works Here

The workload wants cheap durable storage, high read availability, and zero-egress economics more than it wants a full managed data warehouse. R2 fits that shape well.

SChannel-Aware Retry

Large R2 uploads exposed a real production edge case.

Dataset Parquets can be gigabytes in size. Uploads are single PUT requests and may run for several minutes. Under burst load on Windows, pooled HTTPS connections could occasionally fall into a bad SChannel state. The symptom was an IOException saying the decryption operation failed, wrapping Win32Exception 0x8009030F SEC_E_MESSAGE_ALTERED from SslStream.ReadAsyncInternal.

The problem was not that one upload failed. The problem was that the pooled connection could remain poisoned. Retrying through the same client risked receiving the same broken socket again.

The upload path now runs three attempts beyond the SDK’s own retry budget. If the exception chain contains a Win32Exception, the system atomically swaps the AmazonS3Client instance using Interlocked.Exchange. The replacement client carries a fresh HttpClient, handler, connection pool, TLS handshake, and socket.

The displaced client is disposed after a 30-second delay so any concurrent upload that already captured it can finish cleanly.

Production Scar

A 1.39 GB Array of Things Chicago dataset hit the same SChannel failure on every retry because the pool kept returning the poisoned socket. The client-rebuild path exists because that happened in production.

The Parquet Writer

Dataset ingestion writes Parquet through IParquetWriter.WriteAsync, implemented with Parquet.Net.

The writer builds a schema from ParquetColumnDefinition records: one Parquet field per inferred dataset column. Type mapping is intentionally conservative.

Correlation Studio TypeParquet Type
IntegerInt64
DecimalDouble
DateString using ISO 8601 canonical form
TimeString using ISO 8601 canonical form
DateTimeString using ISO 8601 canonical form
StringString

Row groups are buffered in per-column staging arrays and flushed at a default 50,000 rows per group. That size gives DuckDB useful predicate pushdown granularity without creating an oversized footer index.

Typical user datasets in the 100,000 to 5 million row range become 2 to 100 row groups.

Values are coerced at write time through typed conversion helpers. Non-nullable columns throw on null. Nullable columns use nullable staging arrays. The writer is always disposed before returning so the Parquet footer is flushed, and output size is measured from stream position.

The Local Hot-Tier Cache

R2 is the durable store, but local SSD is the interactive working set.

IParquetCache implements an LRU disk cache for dataset Parquets and per-discovery datapoint Parquets. Its public surface is small:

  • EnsureLocalAsync(r2Key, ct) returns a local file path, downloading from R2 on cache miss.
  • SeedAsync(r2Key, sourceFile, ct) moves a freshly written ingest temp file into the cache so the first query is already local.
  • Pin(r2Key) and Unpin(r2Key) reference-count files during active scans, ingests, drilldowns, and thumbnail renders.

Pinned files are never evicted. Eviction runs in the background and uses last access time. The cache is bounded by configuration, defaulting to 80% of disk free space at startup.

This turns the app server’s SSD disk into a practical lakehouse hot tier. Cold data stays cheap in R2. Hot data behaves like local analytical storage.

The DuckDB Connection Pool

DuckDB sessions are not thread-safe, so Correlation Studio gates analytical work through a fixed-size connection pool. The default size is 16.

The IDuckDbQueryService surface is deliberately small:

  • ParquetTable(localPath) returns a safe read_parquet('...') table-valued expression.
  • QueryAsync<T>(sql, projector, ct) runs a query and materializes rows through a projector.
  • StreamAsync<T>(sql, projector, ct) yields rows as an IAsyncEnumerable<T> without materializing the full result.
  • RecentCheckoutWaitMs exposes an exponential moving average for pool checkout wait time.

That last metric is especially useful. When a request emits a slow-load warning, the logs include pool wait time so the operator can distinguish slow SQL from pool contention.

The DuckDB AWS extension is loaded at startup so the engine can read from s3:// URIs directly, but in practice almost all hot paths go through the local cache.

Image Bytes Through Cloudflare-Fronted R2

Images live in the same R2 bucket as Parquet data, under an images/ prefix. That includes profile avatars, banners, thumbnails, discovery scatterplot PNGs, portfolio thumbnails, post attachments, message images, and forum images.

User-facing image URLs exist in two forms.

Custom-Domain Path

In production, image URLs use:

https://images.correlationstudio.com/images/<key>

Bytes flow from R2 to Cloudflare’s edge cache to the browser. The app server never touches them. Egress through Cloudflare’s CDN is free, and the custom domain is attached at the R2 bucket level.

Proxy Path

The legacy and development fallback path is:

/api/storage/images/<key>

In that mode, bytes flow from R2 through the ASP.NET storage controller to the browser. The controller consults image safety verdicts, serves bytes through a bounded memory cache, and applies a hard five-second timeout to R2 reads so a bad object-store hop returns a fast 504 instead of riding the AWS SDK retry budget.

The proxy path remains alive for legacy URLs embedded in social previews, RSS feeds, email screenshots, bookmarks, and external caches.

Both URL shapes are accepted during the transition. A storage URL migration rewrote persisted proxy URLs to the custom-domain form across user images, discovery thumbnails, portfolio thumbnails, portfolio block images, and image storage paths.

Why This Matters

Serving public images through Cloudflare lets the product keep app-server bandwidth focused on application work while still retaining a safe fallback for moderation and legacy URLs.

Dataset Ingestion

Dataset ingestion is the pipeline that turns messy user input into typed, queryable, columnar data.

It is also where many analytics products fail. Real-world datasets are rarely clean. They contain preambles, footers, multi-row headers, sparse section labels, partial dates, currency-formatted numbers, fake headers, long sentence-like labels, and server-side download failures.

Correlation Studio’s ingestion path is designed around that reality.

Pipeline Overview

DatasetIngestionService.ProcessQueuedIngestionAsync orchestrates four phases:

  1. Source Resolution Prefer the local uploaded file from the wizard. If it is no longer available, fall back to the raw upload archived in R2 and download it to a temp file.
  2. Parse and Write Parquet Select a parser through IRowParser, stream rows from CSV, Excel, or HTML sources, buffer them through IParquetWriter, and emit a temp Parquet file.
  3. Upload and Seed Cache Upload the Parquet file to R2 at datasets/{userPublicId}/{datasetPublicId}/data.parquet. On success, move the temp file into the local hot cache so the first DuckDB query is local.
  4. Compute Stats and Debit Tokens Run DuckDB queries to populate column statistics. Then flip the dataset to Ready and debit tokens in the same EF Core transaction.

The v2 state machine is simple:

Pending → Queued → Reading → Ready / Failed

The older v1 states such as Read, Waiting, and Indexing remain defined but are not set by the v2 pipeline. There is no separate indexing phase because the row-index tables no longer exist.

Per-dataset mutual exclusion is enforced with an in-memory ConcurrentDictionary<int, Task>. Only one ingest for a dataset ID may run at a time. A losing dispatcher logs the conflict, closes the orphan job, and exits. If the process crashes, the dictionary disappears and the next dispatcher tick handles recovery.

CSV Parsing and Preamble Detection

CSV ingestion uses a raw StreamReader and a custom line parser rather than CsvHelper. That choice was made for speed on wide files.

The harder problem is not splitting lines. It is finding where the table actually starts.

Many public CSV files begin with metadata, comments, blank rows, source descriptions, license text, or sparse heading blocks. CsvPreambleDetector.DetectAsync buffers the first roughly 50 lines and runs a two-pass heuristic.

  1. First Pass Parse each line’s field count and fill ratio. Mark rows containing alphabetic content, comment-only rows, and blank rows.
  2. Richest Header Scan In the first eight lines, choose the row with the most non-empty fields that also contains alphabetic content and at least two populated columns.
  3. Consecutive Field-Count Streak Find the first point where two successive rows share compatible field counts and sufficient fill ratio. One blank or comment row may break the streak, which handles NOAA and NASA CSVs that put blank lines between headers and data.
  4. Decision If the streak begins at or before the richest candidate, use the streak. Otherwise use the richest candidate. If neither works, fall back to the first line with at least two fields.

This detector has been revised repeatedly against real dataset failures. Each iteration represents one ugly public CSV that broke the previous heuristic.

The Lesson

Ingestion quality is not determined by how well the parser handles perfect CSV. It is determined by how gracefully it handles the thousand strange almost-tables people upload from the real world.

Excel Parsing

Excel files have their own failure modes. A workbook can contain multiple sheets, decorative title rows, sparse metadata, merged-looking header regions, and footers that look enough like data to confuse a naïve parser.

Correlation Studio handles Excel ingestion with ExcelDataReader in two passes.

  1. Find the Data Region ExcelPreambleDetector.FindDataStart scans sheets and chooses the one with the widest tabular region.
  2. Reopen and Stream ExcelPreambleDetector.SeekTo reopens the workbook at the selected sheet and header row, then streams rows forward.

Empty rows are dropped. Parsing stops when the detector sees a blank-followed-by-sparse data pattern that indicates a footer rather than more table content.

The Excel preamble detector runs both exact and fuzzy passes. The exact pass looks for rigid column-count blocks. The fuzzy pass allows ±1 column of slop, which helps with exports that carry inconsistent trailing empty cells. If both passes find a predecessor, the stronger match wins. Otherwise the denser region wins, with ties broken toward the earliest plausible row.

Multi-Row Header Merge

A surprising number of real-world CSV and Excel exports have two or three header rows. Some use year / quarter / metric stacks. Some repeat section names. Some carry patterns such as Total | Total | 2024.

Treating only one row as the header loses useful labels. Concatenating every row can create unreadable column names. Correlation Studio uses a more selective merge.

  1. Forward Merge After identifying the primary header row, the parser scans forward for rows with compatible field counts and no corrupting numerics. Numeric cells are considered corrupting if they appear in columns that already have labels; numeric year-style labels are allowed in columns that are still blank.
  2. Section Divider Skip A row with a single short cell in a column that already has a label is treated as a stratification banner, such as Male or Female, not as a header row.
  3. Last Row Wins When a merge row is accepted, non-empty labels replace prior labels rather than being concatenated. Multi-row metadata stacks collapse to the most useful bottom label.
  4. Backward Fill For columns whose labels remain blank, the parser walks backward from the selected header row and fills missing labels from alphabetic rows with compatible field counts. Existing non-empty labels are never overwritten.

Why “Last Row Wins” Works

In many public datasets, the bottom header row contains the actual variable name, while rows above it contain broad grouping labels. Replacing rather than concatenating produces shorter, more useful column names.

Datatype Inference

Once rows and headers are identified, the platform has to infer column types. The classifier works cell by cell first, then refines the result after sampling the column.

ParseValueHelper.ParseValue follows a fixed order of operations.

1. Numeric Detection

The numeric cleaner strips currency symbols, percent signs, commas, and accounting-style parentheses for negatives.

($1,234.56) → -1234.56

If double.TryParse succeeds and the result is finite, the value is classified as either Integer or Decimal.

2. Temporal Detection

Temporal parsing runs on the uncleaned value. If DateTime.TryParse succeeds, the result is classified as Time, DateTime, or Date depending on the shape of the parsed value.

  • Time-only strings become Time.
  • Values with a non-zero time-of-day become DateTime.
  • Plain calendar values become Date.
  • All are canonicalized to ISO 8601.

3. Partial Dates

Public data often contains partial dates such as 1992.12, 2020-3, 1999/06, or simply 2023. The partial date parser recognizes these shapes and canonicalizes them as dates.

4. Fallback

Anything that does not parse as numeric or temporal becomes String.

Refinement Passes

After the sampled-row type census, Correlation Studio runs additional passes to catch cases the cell-level parser cannot safely resolve alone.

  • YYYYMMDD integers become dates when the column label looks like a date label and every sampled value parses as a valid calendar date.
  • Partial-date numerics become dates for year-month values such as 1871.10 in date-labeled columns.
  • Headerless tables are detected when multiple numeric or date columns have type-consistent “headers.” In that case the supposed header row is returned to the data and columns are renamed Col1, Col2, and so on.
  • Long headers are demoted when average column-label length exceeds 80 characters, because URLs and sentences are usually data rather than real column names.

Column Statistics Precomputation

After a dataset is written to Parquet, Correlation Studio immediately computes column statistics with DuckDB and stores the results in PostgreSQL.

For numeric columns, the core query computes:

  • count
  • distinct count
  • minimum
  • maximum
  • average
  • sample standard deviation
  • quartiles
  • skewness

Additional fan-out queries compute 20-bin histograms and top outliers by distance from the mean. For non-numeric columns, the system stores counts and top categorical values.

Results are upserted into ColumnStats, one row per column.

This replaced an expensive v1 behavior where opening a dataset’s Distributions or Quality tab could trigger 30–90 seconds of recomputation on a 10-million-row dataset. In v2, those tabs render from precomputed rows in under 100 ms.

Precompute What Users Reopen

Column statistics are not a rare analytical side path. Users repeatedly inspect quality, distributions, and summaries. Precomputing them turns a recurring interaction from a scan into a metadata read.

Token Accounting

Token accounting has to be correct under concurrency. A billing system that is usually right is not good enough.

Ingestion debits use an atomic PostgreSQL UPDATE … RETURNING statement against Users.TokenCount.

UPDATE "Users"
SET "TokenCount" = GREATEST(0, "TokenCount" - @amount),
"ModifiedDate" = NOW()
WHERE "UserId" = @userId
RETURNING "TokenCount"

PostgreSQL takes a row lock for the duration of the transaction. Concurrent debits queue behind the lock and each receives a consistent post-debit balance via RETURNING.

The ledger row uses that returned balance. The operation history row commits in the same save. Token count, ledger, operation history, dataset state, and any prior-ingest refund all move atomically.

This replaced a read-modify-write implementation that could lose updates under concurrency. In production, the old path created drift where the ledger balance could exceed the stored user token count.

The ingestion cost formula is:

BaseTokenCost + ceil(fileSizeMB) × TokenCost

As of the v2 migration period, ingestion costs 25 base tokens plus 1 token per MB. That rate was reduced after a real 45-file COVID ingestion test proved the previous per-MB rate could make useful workflows too expensive.

Storage Quota

Every storage-consuming endpoint is gated by IStorageQuotaService: uploads, paste flows, ingestion, remote downloads, and dataset transforms.

On every successful Parquet upload, User.StorageUsedBytes is incremented by the Parquet size. The default cap is 10 GB and can be raised by an active storage subscription. Over-quota requests return HTTP 402, which the frontend renders as an over-quota banner.

A weekly reconciliation job compares recorded storage usage against the sum of dataset and experiment sizes, then writes drift metrics into system metrics.

Remote Dataset Search and Download

One of Correlation Studio’s core product bets is that users should not need to bring every dataset themselves. They should be able to ask for data, let AI providers suggest public sources, verify those sources, and ingest them into the same pipeline as uploaded files.

That workflow has two halves: AI-powered URL search and robust remote download.

AI-Powered URL Search

Remote Search runs Claude, Gemini, and Grok in parallel. Each provider receives a prompt assembled from database-backed prompt templates with the user’s topic and description interpolated.

Providers are asked for structured JSON containing:

  • url
  • title
  • description
  • format
  • estimatedSizeMB

Each provider has its own integration path: Anthropic SDK for Claude, raw HttpClient against Gemini’s REST endpoint, and raw HttpClient against xAI’s Responses API.

Search progress is tracked per user and provider. The frontend polls a progress endpoint every two seconds and updates provider status chips as each one moves through searching, verifying, done, or error.

Results are merged and deduplicated by URL. A URL returned by more than one provider is boosted because independent agreement is a useful quality signal.

Token costs are tracked per provider and debited when results are produced.

URL Verification

LLMs can produce plausible but unusable links. Every returned URL is therefore verified before it is offered as an ingest candidate.

Verification starts with a HEAD request and falls back to GET for servers that reject HEAD with 405 or 501 responses. A domain throttler controls politeness and prevents one search from hammering a public data host.

  • Per-domain concurrency defaults to two simultaneous requests.
  • Requests maintain at least one second between hits to the same domain.
  • HTTP 429 and 503 responses honor Retry-After.
  • Bad URL verdicts are cached so fabricated links are not rechecked repeatedly.
  • Content-Length values from HEAD are cached for later truncation detection during GET.

Rejected URLs are still shown in a details panel with the reason: HTML returned instead of CSV, HTTP error, dead link, and so on. The user is still charged for the AI search call because the provider performed the work; sourcing failure is a separate outcome from model usage.

Reference Reuse

Topics accumulate verified references over time. Those references can be reused instead of asking the AI providers to rediscover the same sources.

Remote Search supports three modes:

ModeBehavior
ExistingOnlySkip AI search and return only verified references already associated with the topic.
NewRun a fresh AI search without reference reuse.
BothRun a fresh AI search and merge existing references into the result set.

Link verification is triggered on user login rather than running as a giant timed background sweep. The original 24-hour batch model did not scale: at 10,000 users and 100 references each, one full cycle would take years. Login-triggered checks keep the verification work close to active users.

References follow a three-strike failure rule. A URL that repeatedly fails is marked dead. If it later comes back online, the dead flag is cleared.

The Remote Download Pipeline

After a URL is selected, Correlation Studio downloads the file server-side. That path needs to be resilient because public data servers are inconsistent. Some throttle. Some disconnect mid-stream. Some omit content lengths. Some support range requests. Some ignore them.

The download pipeline uses a named HttpClient configured to look like a browser, with automatic decompression, HTTP/1.1, and manual timeout management.

A domain throttler wraps each download so multiple users or tabs do not overwhelm the same host. The cancel button in the UI propagates through the throttler and can release a waiting download before it even starts.

Timeouts

The old single total timeout was replaced with three separate timers:

  • Headers timeout, defaulting to 30 seconds, covers DNS, TCP, TLS, and waiting for response headers.
  • Idle timeout, defaulting to 60 seconds, is rearmed after every successful chunk. This is the real “is the wire alive?” guard.
  • Maximum total seconds is optional and defaults to disabled.

Resume by URL

If a previous attempt for the same user and URL left bytes on disk, the next attempt reads the number of bytes already present and sends a range request:

Range: bytes={N}-

This replaced an earlier stubbed resume-upload path with a simpler and more durable invariant: a partially downloaded remote file can resume by URL and byte count.

Duplicate In-Flight Protection

The server also guards against duplicate downloads for the same user and URL. If another browser tab or retry path is already downloading the same source, the second request is refused instead of starting a duplicate transfer.

Truncation Detection for Chunked Transfers

Some of the hardest remote-download bugs came from chunked-transfer endpoints. Socrata-backed sources such as data.cdc.gov can stream CSV data with Transfer-Encoding: chunked. In that mode, the GET response may not expose a content length.

That creates a nasty ambiguity. At the read layer, a mid-stream disconnect and a clean end-of-file can both look like ReadAsync returning zero bytes.

Correlation Studio defends against this with three layers.

  1. HEAD-Cached Size During URL verification, HEAD responses often include Content-Length. That value is cached and reused later. If the GET response arrives without a content length, the HEAD-cached size becomes the expected total.
  2. Post-Download Range Probe When neither HEAD nor GET reveals a total, the downloader issues a Range: bytes=0-0 request with Accept-Encoding: identity. Servers that omit content length on ordinary requests frequently return Content-Range for ranged requests.
  3. Unknown-Size Resume Probe If the total is still unknown, the downloader performs up to eight resume probes using the number of bytes already read. A 416 confirms completion. A 206 with a declared total hands off to known-size resume. A 206 without a declared total appends data and probes again.

The Edge Case

A clean EOF and a severed chunked stream can look identical unless the system establishes the truth from another signal: HEAD, Content-Range, or a resume probe.

The Async Download Contract

The original remote-download endpoint was synchronous. The client posted a URL, the server downloaded the file, and the server returned an UploadDto when the download finished.

That worked for small files. It failed for long downloads.

On a 25-minute server-side fetch, the browser-to-server connection had to stay open the entire time. If the proxy or Kestrel reaped the connection, the client saw a network error, the server-side bytes became orphaned, and the wizard had no durable way to recover.

The replacement contract starts the download asynchronously.

  • POST /api/datasets/download-remote/start returns HTTP 202 immediately with a download key.
  • The server runs the download in a background task created against a fresh dependency injection scope.
  • GET /api/datasets/download-remote/status?url=... polls live state.
  • Terminal states include completed, partial, failed, cancelled, and not_found.

The frontend wraps the start-and-poll sequence behind the same conceptual surface: await a completed upload or receive an error. Internally, it tolerates transient poll failures and threads cancellation through to the server.

Cross-Restart Durability

Active downloads are stored durably in the ActiveDownloads table, one row per user and URL. The row records filename, path, bytes on disk, status, heartbeat, and other recovery state.

During the chunk loop, database writes are throttled to once every 10 seconds or 50 MB. On startup, a reconciliation service waits for a short grace period, then finds stale non-terminal downloads. If bytes exist on disk, the row becomes partial. If the file is missing, the row becomes failed. Old terminal rows are evicted.

The result is a download model that survives client disconnects, proxy timeouts, process restarts, and ordinary public-data-server unreliability.

The Experiment Engine

The experiment engine is where Correlation Studio turns datasets into discoveries. It takes two datasets, selects a join strategy, enumerates numeric-column pairs, and computes correlation statistics for every compatible pair.

The central challenge is combinatorics. If dataset X has 400 numeric columns and dataset Y has 500, the experiment has 200,000 pairs. At 1,000 by 1,000, it has one million.

The engine therefore has to be fast, resumable, deterministic, and selective about what becomes a persisted Discovery.

Experiment Pipeline

ExperimentIngestionService.ProcessQueuedExperimentAsync orchestrates the experiment flow.

  1. Resolve Inputs Load datasets X and Y, selected columns, join type, tolerance, sample percent, and absolute correlation threshold.
  2. Enumerate Column Pairs Build the cross product of numeric X columns and numeric Y columns. TotalColumnPairs = |X_numeric| × |Y_numeric|
  3. Compute Each Pair For every pair, build a DuckDB query that joins both sides on the bucketed key, computes Pearson, Spearman, p-value, confidence interval, and paired row count.
  4. Emit Discoveries If the pair clears the threshold and all filter gates, insert a Discovery and write a per-discovery Datapoint Parquet.
  5. Batch Database Writes Discoveries are inserted in batches of 50 to amortize EF Core overhead.
  6. Run Bounded Parallel Work Inner branches run concurrently, defaulting to four, while respecting DuckDB pool pressure.

The performance arc is instructive. Throughput improved from roughly 3 pairs per second to 14–15 pairs per second, with peaks around 21, through a sequence of targeted changes:

  • batching Discovery inserts instead of writing one at a time,
  • finding a better flush-window size,
  • moving datapoint payloads away from PostgreSQL JSONB / bytea hot paths,
  • deferring autovacuum pressure during ingest-heavy operations,
  • and finally moving per-discovery datapoints into Parquet on R2.

In v2, the Datapoints row stores metadata and a Parquet path. The chart payload itself lives beside the dataset Parquets in object storage.

Join Types

Every correlation requires paired X and Y values. Correlation Studio supports three join strategies: row sequence, shared key, and time series.

RowSequence

RowSequence is a positional zip. The engine assigns a synthetic row number to both datasets and joins on that position.

x_agg AS (
SELECT CAST(rn AS VARCHAR) AS k,
CAST("xval" AS DOUBLE) AS v
FROM (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn, "xval"
FROM read_parquet('x.parquet')
)
)

Numbering happens before null filtering. That preserves row alignment: X row 5 still pairs with Y row 5 even if row 3 has a missing value.

SharedKey

SharedKey joins both sides on an exact string key. Values are aggregated per key before correlation.

x_agg AS (
SELECT CAST("xkey" AS VARCHAR) AS k,
AVG(CAST("xval" AS DOUBLE)) AS v
FROM read_parquet('x.parquet')
WHERE "xval" IS NOT NULL
GROUP BY 1
)

Date and DateTime columns are stored in canonical ISO 8601 form during ingestion, so equivalent dates join cleanly as strings.

TimeSeries

TimeSeries joins snap timestamps to a tolerance bucket, such as hourly, daily, weekly, monthly, or yearly.

x_agg AS (
SELECT strftime(
to_timestamp(
FLOOR(EPOCH(TRY_CAST("xkey" AS TIMESTAMP)) / 86400) * 86400
) AT TIME ZONE 'UTC',
'%Y-%m-%dT%H:%M:%S'
) AS k,
AVG(CAST("xval" AS DOUBLE)) AS v
FROM read_parquet('x.parquet')
WHERE TRY_CAST("xkey" AS TIMESTAMP) IS NOT NULL
AND "xval" IS NOT NULL
GROUP BY 1
)

Both sides snap to the same epoch grid. A single tolerance knob controls the bucket size.

The Timezone Bug

A prior implementation cast to_timestamp(...) directly to a string. DuckDB returns TIMESTAMPTZ, and string output included the local timezone offset. Drilldown queried plain timestamps and produced a different string shape for the same logical day. The UTC strftime expression fixed the mismatch across machines and drilldown paths.

Deterministic Sampling

Sampling must preserve pairs. Independent random sampling on X and Y would break the join by selecting different key sets. Correlation Studio instead samples deterministically on the bucketed join key.

WHERE hash({bucketedKey}) % 100 < {samplePercent}

Both sides compute hash() on identical key values. The surviving key sets therefore line up exactly.

Re-running an experiment on the same datasets with the same sample percent produces the same sample. That stability matters because published Discoveries should not change simply because an experiment was reprocessed.

Aggregation

When multiple rows fall into the same key or time bucket, each selected column’s aggregate function determines the per-bucket value.

AggregateExpression
AvgAVG(CAST(col AS DOUBLE))
SumSUM(CAST(col AS DOUBLE))
MinMIN(CAST(col AS DOUBLE))
MaxMAX(CAST(col AS DOUBLE))
FirstFIRST(col)
LastLAST(col)
CountCAST(COUNT(col) AS DOUBLE)

Average is the default because it fits most exploratory datasets. Everything is cast into a numeric shape that DuckDB’s CORR() can consume.

Correlation Computation

After both sides have been joined and aggregated into matching key-value pairs, DuckDB computes Pearson and Spearman correlation from the joined set.

joined AS (
SELECT x.k AS k, x.v AS xv, y.v AS yv
FROM x_agg x INNER JOIN y_agg y USING (k)
WHERE x.v IS NOT NULL AND y.v IS NOT NULL
AND NOT isnan(x.v) AND NOT isnan(y.v)
AND NOT isinf(x.v) AND NOT isinf(y.v)
),
ranked AS (
SELECT RANK() OVER (ORDER BY xv) AS xr,
RANK() OVER (ORDER BY yv) AS yr
FROM joined
)
SELECT CORR(xv, yv) AS pearson,
(SELECT CORR(xr, yr) FROM ranked) AS spearman,
COUNT(*) AS n
FROM joined

Pearson correlation uses DuckDB’s CORR() aggregate. Spearman correlation is implemented as Pearson correlation over ranked values. Pairs with fewer than three joined observations are discarded. Pairs with zero variance on either side produce no Discovery.

The p-value is computed in C# using a two-tailed Student-t test:

t = r · √[(n − 2) / (1 − r²)]
p = 2 · (1 − CDF_t(|t|; df = n − 2))

Confidence intervals use Fisher’s z-transform:

z = atanh(r) = 0.5 · ln((1 + r) / (1 − r))
SE = 1 / √(n − 3)
z_lo = z − 1.96 · SE
z_hi = z + 1.96 · SE
CI = [tanh(z_lo), tanh(z_hi)]

Perfect correlation is handled explicitly. When |r| = 1, the p-value path avoids divide-by-zero and the confidence interval degenerates to [r, r].

Per-Discovery Datapoint Parquet

When a pair clears the threshold, the system writes a small Parquet file containing the joined datapoints used for charting.

COPY (
SELECT k, xv, yv FROM joined
ORDER BY hash(k)
LIMIT 5000
) TO '{tempFile}' (FORMAT PARQUET)

The deterministic ORDER BY hash(k) keeps chart sampling stable across reruns. The hard cap of 5,000 rows ensures the frontend can render a discovery without an additional downsampling pass.

The file is uploaded to R2 under the discovery’s path and seeded into the local cache. The database row stores the Parquet path, counts, and data size. The payload itself no longer lives inside PostgreSQL.

Granger Causality

Granger causality is available on demand from the Discovery detail page. It is not computed during bulk experiment ingestion because doing so added a 30–60% throughput penalty to the hot path.

The on-demand algorithm has four stages.

  1. Stationarity A simplified ADF-style test fits Δy_t = α + δ·y_{t-1} + ε. The t-statistic is δ / SE(δ), using a 5% critical value of −2.86 for larger samples. Non-stationary series are differenced up to twice.
  2. Lag Selection Candidate lags are scored with BIC: k · log(n) + n · log(RSS / n) The search considers lags from 1 to min(10, n/4) and chooses the best score.
  3. F-Test Restricted and unrestricted models are compared with: F = [(RSS_restricted − RSS_unrestricted) / k] / [RSS_unrestricted / (n − 2k − 1)] p = 1 − CDF_F(F; k, n − 2k − 1)
  4. Bidirectional Evaluation The service tests X→Y and Y→X separately. The resulting direction can be none, X causes Y, Y causes X, or bidirectional.

OLS is implemented through MathNet matrix inversion. Singular systems fall back to full residual sum of squares and report no causality.

Why It Is On Demand

Granger causality is valuable, but it is not cheap enough to run for every candidate pair during brute-force experiment ingestion. Deferring it keeps discovery mining fast while still making causal testing available when a user cares about a specific relationship.

Discovery Filter Gates

Not every computed pair becomes a Discovery. The ingestion loop applies a series of gates, each designed to avoid storing obvious noise or redundant relationships.

  1. Duplicate Gate Before computing a pair, the engine checks whether the same user already has a Discovery for that exact X-column and Y-column pair in another experiment. If so, the pair is skipped without running DuckDB, writing Parquet, or uploading to R2. This makes the common “rerun the same cross-matrix as the corpus grows” workflow cost-proportional to new pairs only.
  2. Identity Gate After computation, near-perfect identities are skipped. If the rounded displayed correlation is 1.0000, the pair is treated as column-vs-self, duplicate column, or perfect linear transform.
  3. Minimum Evidence Gates Pairs are skipped if they have too few joined rows, fall below the configured absolute correlation threshold, or fail the minimum paired-row percentage.

Older duplicate history created before the gate can be cleaned by an admin utility. The cleanup keeps the oldest Discovery in each duplicate group so established permalinks, thumbnails, search records, and published states are preserved.

Statistical Algorithms

Correlation Studio intentionally uses standard statistical methods. The goal is not to invent new statistics. The goal is to make proven statistical workflows available at corpus scale and wrap them in useful product primitives.

Pearson Correlation

r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] /
√[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]

Pearson appears in three places: DuckDB during experiment ingestion, MathNet for one-off server-side computations, and a TypeScript implementation for client-side rolling windows.

Spearman Rank Correlation

Spearman is Pearson applied to ranks:

ρ = Pearson(rank(x), rank(y))

DuckDB computes ranks with window functions and then correlates them. MathNet handles one-off server paths. Ties are handled by the respective rank implementations.

P-Value

t = r · √[(n − 2) / (1 − r²)]
df = n − 2
p = 2 · (1 − CDF_t(|t|; df))

The p-value path uses MathNet’s Student-t distribution. Correlations with fewer than three observations return no p-value.

Confidence Interval

z = atanh(r)
SE = 1 / √(n − 3)
z_lo = z − 1.96 · SE
z_hi = z + 1.96 · SE
CI = [tanh(z_lo), tanh(z_hi)]

Confidence intervals use Fisher z without Olkin-Pratt small-sample bias correction. The interval is reported as-is.

Regression Models

The frontend regression utility implements four model families: linear, polynomial, logarithmic, and exponential.

Linear

slope = (n·Σxy − Σx·Σy) / (n·Σx² − (Σx)²)
intercept = (Σy − slope·Σx) / n

Degenerate X variance invalidates the model. R² is computed as 1 − SS_resid / SS_total.

Polynomial

Degree-2 and degree-3 polynomial fits build normal-equation matrices from power sums and solve them with Gaussian elimination and partial pivoting. Singular pivots invalidate the model.

Logarithmic

The logarithmic model fits y = a + b·ln(x). Values where x ≤ 0 are filtered out.

Exponential

The exponential model fits y = a·e^(bx) by fitting a linear model over (x, ln(y)), then exponentiating the intercept. Values where y ≤ 0 are filtered out.

Every model returns a prediction function, equation string, coefficients, and R². The equation formatter uses adaptive precision so extremely large, normal-sized, small, and tiny coefficients remain readable.

Prediction Intervals

Regression overlays include a 95% prediction interval. For each x value on the regression curve, the margin is:

margin = 1.96 · SE · √[1 + 1/n + (x − x̄)² / SS_x]
SE = √(SS_resid / (n − p))

The chart draws the band as a filled polygon above and below the curve. For nonlinear models, the prediction function is sampled across the x-range.

Pearson vs. Spearman Divergence

Correlation Studio uses the difference between Spearman and Pearson strength to flag relationships that may be nonlinear or outlier-influenced.

Δ = |Spearman| − |Pearson|
"non-linear" if Δ > 0.15
"outlier-influenced" if Δ < −0.15
null otherwise

A large positive delta suggests a monotonic but nonlinear relationship. Spearman preserves the rank order while Pearson underreports the curve. A large negative delta suggests outliers may be propping up Pearson more than the rank relationship supports.

This drives the small divergence indicators in discovery lists and powers the dedicated divergence visualization.

NaN-Safe Rolling Correlation

Rolling correlation runs client-side for interactive exploration. The implementation skips non-finite values, returns zero for windows with fewer than three valid pairs, and guards zero-variance denominators.

function pearson(xs: number[], ys: number[], start: number, end: number): number {
let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0, sumY2 = 0, n = 0;
for (let i = start; i < end; i++) {
if (!Number.isFinite(xs[i]) || !Number.isFinite(ys[i])) continue;
sumX += xs[i];
sumY += ys[i];
sumXY += xs[i] * ys[i];
sumX2 += xs[i] * xs[i];
sumY2 += ys[i] * ys[i];
n++;
}
if (n < 3) return 0;
const denom = Math.sqrt(
(n * sumX2 - sumX * sumX) *
(n * sumY2 - sumY * sumY)
);
if (denom === 0) return 0;
const r = (n * sumXY - sumX * sumY) / denom;
return Number.isFinite(r) ? r : 0;
}

Users can adjust the window size. The chart overlays strength bands and reports mean and volatility below the line.

Downsampling and Density

Correlation Studio does not implement Largest-Triangle-Three-Buckets. Instead, it uses three practical strategies tailored to the product.

  • Per-discovery cap: every Datapoint Parquet is capped at 5,000 rows with deterministic hash ordering.
  • Scatter bucket deduplication: the chart computes a 1000×600 data-value bucket grid. Multiple points in the same bucket render as a single dot with radius scaled by log2(count).
  • Density-aware line rendering: dense line graphs lower alpha as segments-per-column increases, letting opacity accumulate where many segments overlap.

The goal is not to preserve every rendered primitive. The goal is to preserve the visual meaning of dense data without making the browser draw thousands of redundant pixels.

Visualizations

The primary chart engine is DiscoveryChart.tsx: a canvas-based, double-buffered renderer that handles scatterplots, line graphs, residual plots, trajectory mode, and regression overlays.

Device-pixel-ratio scaling keeps output crisp on high-DPI displays. Double buffering keeps redraws clean: a new frame renders into an offscreen canvas, then atomically replaces the visible canvas.

ComponentTechnologyPurpose
MiniChartSVGSparkline previews in feed cards and grids
CorrelationHeatmapSVG gridPairwise correlation matrix across columns
DivergenceChartSVG scatterPearson vs. Spearman per discovery
BubbleChartSVGCorrelation vs. sample size with bubble size as divergence
LagAnalysisChartSVG polylineCorrelation at positive and negative lags
RollingCorrelationChartSVG polylineWindowed correlation across a series
CorrelationNetworkD3 force graphNode-link graph of inter-column correlations
DatasetMapViewLeafletGeographic scatter for geo-typed datasets
ColumnHistogramSVGPer-column distribution histogram
ColumnBoxPlotSVGQuartiles and outliers

DiscoveryChart Render Pipeline

The main chart pipeline starts with memoized data preparation. It recomputes when the selected datapoint payload or visualization mode changes.

Scatterplot and residual modes use a 1000×600 bucket grid. Each point is mapped into a bucket by its normalized x and y position. Duplicate bucket hits collapse into a single rendered dot, with radius scaled by density.

Line graph mode uses arrays indexed by row order. Trajectory mode parses compound join keys into group and time pairs, then renders grouped alpha-blended lines.

Residual mode transforms the plotted y value before bucketing:

plotY = y − (slope · x + intercept)

Autocorrelation Gate

Row order is sometimes meaningful and sometimes noise. Before recommending ordered visualizations, the chart computes lag-1 autocorrelation for both series. If both series have weak autocorrelation, the UI warns that row order may not carry signal and suggests scatterplot or trajectory mode instead.

Double-Buffered Rendering

Rendering happens on an offscreen canvas. When the frame is complete, the offscreen buffer is copied into the visible canvas using globalCompositeOperation = 'copy'. This avoids the white flash that can occur when resizing or clearing the visible canvas directly.

Resize Throttling

Resize events are handled with a leading-edge throttle and a short cooldown. The first resize applies immediately. Subsequent resize events coalesce into a pending size and apply when the cooldown expires.

Performance Instrumentation

The chart accumulates stage timings for data preparation, trajectory grouping, and regression. If a render exceeds 100 ms, it logs a warning with stage-level detail.

Regression Overlay Rendering

In scatterplot mode, users can choose among linear, polynomial, logarithmic, and exponential regression overlays. The selected model is fitted client-side and rendered directly over the discovery datapoints.

The overlay draws four elements:

  • a 95% prediction interval as a filled polygon band,
  • the regression curve as a dashed line,
  • the equation in a corner badge,
  • and R² alongside the discovery’s correlation coefficient.

Nonlinear curves are sampled at 80 points across the visible x-range. Linear fits use only the two endpoints.

Server-Side Thumbnails

Discovery thumbnails are rendered server-side with SkiaSharp. The server renderer mirrors the client chart logic closely enough that thumbnails and interactive charts have the same visual language.

The thumbnail pipeline is:

  1. Read Discovery Data Resolve the Discovery, load its Datapoint Parquet, and query it through DuckDB while filtering null, NaN, and non-finite values.
  2. Render PNG Draw a 640×360 PNG with SkiaSharp using the same scatter-bucket logic as the client.
  3. Upload to R2 Store the PNG under the user’s discovery thumbnail path. URLs are versioned with the modified date for CDN cache-busting.
  4. Trigger Refresh Experiment ingestion enqueues a thumbnail refresh job after each discovery’s datapoint Parquet is uploaded. The experiment detail page can also trigger server-side renders for visible discoveries that are missing thumbnails.

Portfolio thumbnails use a similar renderer, composing mini previews for dataset, experiment, and discovery blocks.

Datapoint Drilldown

One of the most important product affordances is the ability to click a chart point and inspect the source rows behind it.

The client sends selected keys to:

POST /api/discoveries/{id}/datapoint-rows

Multi-select is supported through lasso and Shift-click. The endpoint streams NDJSON: one JSON object per line. The client parses rows lazily instead of waiting for a giant JSON array.

Streaming uses:

  • Content-Type: application/x-ndjson; charset=utf-8
  • X-Accel-Buffering: no
  • Response.Body.DisableBuffering()
  • explicit FlushAsync calls

X and Y sides are scanned concurrently. Each side queries its source dataset Parquet through DuckDB:

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rid, c1, c2, ...
FROM read_parquet('dataset.parquet')
WHERE {bucketed-key-expression} IN ('key1', 'key2', ...)
ORDER BY {bucketed-key-expression}
LIMIT {DrilldownPerSideRowCap + 1}

The bucketed key expression is generated by the same helper used during experiment ingestion. That consistency is what makes time-series drilldown correct: a daily bucket returns every X row and every Y row whose timestamp falls in that day.

The per-side row cap is 5,000. The query fetches one extra row so truncation can be detected without issuing a separate count query.

Dataset Tools

Dataset tools let users produce derivative datasets without leaving the platform. They are implemented by DatasetToolService and share the same storage model as ordinary ingestion: DuckDB transforms source Parquet, the result is written to a new Parquet file, and the new dataset becomes a normal first-class dataset.

The transformation pipeline mirrors ingestion:

  1. Resolve source dataset or datasets and materialize their Parquets to the local cache.
  2. Build a transformation plan from tool type and options.
  3. Save output columns to acquire fresh column IDs.
  4. Run DuckDB SQL and stream transformed rows through type coercion into the Parquet writer.
  5. Upload the output Parquet to R2 and seed the cache.
  6. Flip the new dataset to Ready and debit tokens in one transaction.

Output Parquet column names use the same c{ColumnId} convention as ordinary ingested datasets.

Derivative Datasets Are Real Datasets

A transformed dataset is not a temporary view. It has columns, stats, storage usage, tokens, join-key detection, publication state, and can participate in future experiments.

Auto Join-Key Heuristic

If no output column is explicitly marked as a join key, the tool pipeline scans for the first Date or DateTime column. If none exists, it falls back to the first String column. This mirrors the dataset wizard’s detection behavior so downstream experiments can time-series join derivative datasets with minimal user intervention.

The 18 Dataset Tools

The tool catalog covers common reshaping, time-series, normalization, filtering, and joining workflows.

Transpose

Converts wide data to long form using an unpivot-style operation. Value columns are cast to double before unnesting.

Difference

Computes row-over-row deltas using LAG() over deterministic row order.

LagLead

Shifts a column forward or backward with LAG or LEAD.

RollingAggregate

Computes rolling SUM, MIN, MAX, COUNT, FIRST, LAST, or AVG over a row window.

ResampleTimeSeries

Groups timestamps by period using DATE_TRUNC and aggregates numeric columns per period.

FilterRows

Builds a WHERE clause from user conditions such as equality, range, contains, starts with, ends with, null, and not-null.

Normalize

Supports z-score and min-max normalization with zero-variance guards.

PercentChange

Computes percent change from the prior row and returns null when the prior value is zero.

CumulativeSum

Uses a windowed cumulative sum from the beginning of the row order to the current row.

FillMissing

Supports forward-fill, backward-fill, numeric interpolation, and constant fill.

Deduplicate

Ranks rows by key group and keeps the first or last row per group.

GroupAggregate

Groups by one or more key columns and applies an aggregate function.

Pivot

Discovers distinct pivot values and builds one output column per value.

Rank

Computes rank over a selected column with nulls placed last.

Bin

Computes numeric buckets from precomputed min and max values.

OutlierRemoval

Filters rows whose target values exceed a configured standard-deviation threshold.

MergeJoin

Performs inner, left, right, or full outer joins on string-cast key columns.

AppendStack

Unions two datasets by discovering a case-insensitive union schema and projecting missing columns as null.

Shared Tool Infrastructure

Tools that depend on row order inject a ROW_NUMBER() OVER () AS rowid CTE. Window frames use row-based frames rather than value-based ranges.

The streaming transform path enforces DuckDB reader output into the target Parquet type. Integer outputs become Int64. Decimal outputs become Double. Temporal and string outputs become canonical strings.

Missing values are handled explicitly. Forward-fill and backward-fill use IGNORE NULLS. Interpolation tracks null positions and neighboring non-null values. Rank uses NULLS LAST. Aggregates coalesce to safe defaults where needed to avoid division-by-zero and invalid numeric output.

Corrie: The RAG Chatbot

Corrie is Correlation Studio’s in-app RAG chatbot. The name is user-facing; the historical frontend component remains StudioChat for compatibility, but the UI, empty states, button labels, error banners, and system prompt all identify the assistant as Corrie.

Corrie answers two kinds of questions:

  • product and “how do I…” questions over the documentation corpus,
  • and content questions over public or user-visible datasets, experiments, discoveries, portfolios, topics, and forum questions.

The chatbot is not bolted on as a generic assistant. It is wired into the same entity model and publication lifecycle as the rest of the platform.

RAG Architecture

The RAG system has three core database entities.

  • RagChunk: one retrievable unit with corpus type, source type, source entity ID, title, URL, chunk text, embedding, and creation date.
  • ChatConversation and ChatMessage: persistent chat history per user.
  • ChatQueryAudit: per-query audit records containing token counts, cost estimates, retrieved chunks, and escalation flags.

Chunks belong to several corpus types: documentation, entities, and pre-seeded content. Source types cover document sections, discoveries, experiments, portfolios, topics, datasets, forum questions, and seed content.

This gives the chatbot a uniform retrieval interface across static docs and user-generated analytical content.

Embedding and Retrieval

Embedding is dual-vendor. OpenAI’s text-embedding-3-large is the primary path, reduced to 1536 dimensions. Gemini text-embedding-004 is the fallback.

The 1536-dimensional shape was retained because the existing pgvector column already used 1536 dimensions. That made the cutover to OpenAI primary a service change rather than a schema migration.

Each embedding service owns its own retry loop with jittered backoff and Retry-After support. The composite fallback service tries OpenAI first, then Gemini on transient failure. Vendor and cost data are captured separately so admin reporting can distinguish normal and fallback spend.

Documentation chunks are built from the user guide by splitting Markdown sections on H2 and H3 headings. The docs corpus is hashed at startup and re-chunked only when the hash changes.

Entity chunks are built by source-specific chunk builders. Publishing, unpublishing, deletion, and analysis changes enqueue RAG upsert or remove commands.

The RAG Index Queue

Index commands are durable rows, not in-memory events. The worker polls due rows, coalesces multiple commands for the same source, embeds upserts in vendor batches, and writes precomputed embeddings into RagChunks.

Failed rows are retried with exponential backoff and are never silently dropped.

Retrieval

Retrieval uses pgvector cosine similarity:

SELECT * FROM "RagChunks"
WHERE "CorpusType" = 1 OR "SourceUserId" = @userId
ORDER BY "Embedding" <=> @queryEmbedding
LIMIT @k

The default retrieval count is eight chunks. Visibility filtering ensures private entities are not retrieved for other users.

Chat Synthesis

ChatService orchestrates the full query lifecycle:

  1. Embed the user’s query with the fallback embedding service.
  2. Retrieve top-k chunks from RagChunks.
  3. Build a system prompt containing retrieved chunks and provenance.
  4. Call Claude Haiku for synthesis.
  5. Escalate to Claude Sonnet if response confidence appears low.
  6. Stream the answer to the client.
  7. Write a ChatQueryAudit row with estimated cost.

Haiku is the default because it is fast and inexpensive. Sonnet is reserved for ambiguous or low-confidence cases where a stronger model noticeably improves answer quality.

Citations are rendered inline by the Corrie panel. Provenance links are resolved from stored entity URLs at service startup.

Chat Billing and Rate Limits

Corrie uses a flat token price per query. Each chat query debits 25 tokens regardless of whether the answer used Haiku alone or escalated to Sonnet.

That pricing model lets cheaper Haiku queries subsidize occasional Sonnet escalations, while the daily per-user query cap bounds worst-case spend.

Exceeding the query cap returns HTTP 429. Insufficient token balance returns HTTP 402. The chat panel renders both as inline banners.

Token Economy and Billing

Correlation Studio’s billing model is built around non-expiring token packs. Tokens fund usage-based operations: ingestion, correlation, transforms, AI search, AI analysis, and Corrie.

The design goal is to keep browsing and lightweight exploration free while making expensive compute and AI work pay for itself.

Token Packs

Users can buy token packs in four tiers. Packs never expire and stack into one user balance.

PackPriceTokensRate
Small$55,000$1.00 / 1K tokens
Medium$2025,000$0.80 / 1K tokens
Large$5065,000$0.77 / 1K tokens
Bulk$200300,000$0.67 / 1K tokens

The volume discount is intentionally modest. A bulk-pack buyer should receive a better rate, but not enough to destroy margin on expensive AI-backed operations.

Operation Costs

Token pricing is tied to the kind of work being performed.

OperationBasePer UnitUnitApproximate Retail Cost
Ingest Dataset251per MB$0.025 + $0.001/MB
Run Correlation251per MB$0.025 + $0.001/MB
Tool Transform251per MB outputShares ingest rate
AI Analysis0125per call$0.125
AI Search: Claude0250per call$0.25
AI Search: Gemini0125per call$0.125
AI Search: Grok0125per call$0.125
AI Chatbot025per call$0.025

The per-MB compute rate was reduced after real ingestion testing showed that the old price could make legitimate multi-file workflows feel punitive. The per-MB charge is compute markup, not storage rent; R2 storage is cheap enough that the ingest fee covers many years of storage at retail rates.

AI Analysis was also repriced after bulk-analysis workflows exposed a power-user failure mode. One click could analyze hundreds of child discoveries. Reducing analysis cost kept that path viable without damaging margin.

Atomic Debits and the Ledger

All token-spending call sites route through the same accounting service. Dataset search, AI analysis, Corrie, ingestion, experiment ingestion, and dataset transforms all use the atomic debit path.

Every token movement writes a ledger row: purchases, grants, debits, refunds, and reclamation credits. The ledger row stores a running balance, and that balance comes from the same RETURNING TokenCount value produced by the atomic PostgreSQL update.

By construction, the ledger balance matches the stored user balance.

A reconciliation path can recompute token balances from the ledger. It finds the most recent token-reset sentinel row per user, then sums all later ledger movements. Older users without a sentinel fall back to the latest ledger balance.

Deletion Reclaim

Deleting a dataset, experiment, or discovery can return part of the original per-MB token cost to the user.

The reclaim rule credits back 70% of the per-MB portion. The 25-token base fee is retained, and the remaining 30% acts as a reclamation surcharge. Very small entities whose reclaim amount would round below one token receive no credit.

Credits appear in the ledger as token reclamation entries tied to the deleted entity title. The deletion service writes the ledger entry in the same transaction as the soft-delete flag.

Stripe Integration

Stripe handles saved cards, subscriptions, token-pack purchases, and refunds.

  • Saved cards use Stripe Checkout setup sessions. The application stores the payment method ID and card brand for display.
  • Subscriptions fund storage tiers. Upgrades and downgrades prorate against the current period.
  • Token packs use Stripe Checkout in payment mode. Webhooks are idempotent on payment intent.
  • Refunds flow from user request to admin approval to Stripe refund and ledger reversal.

The Stripe webhook controller is the canonical entry point for billing state changes. It verifies webhook signatures, deduplicates events by Stripe event ID, and delegates side effects to transactional billing-service methods.

Invoices are rendered as PDFs with monotonic invoice numbers. Each invoice stores an HTML snapshot at issue time so future re-rendering cannot drift from what the user originally received.

Authentication and Authorization

Correlation Studio supports email/password login, Google OAuth, Microsoft OAuth, and JWT session tokens.

  • Email/password accounts use BCrypt hashing and email verification codes.
  • Password reset uses an email-token flow.
  • Google OAuth uses client-side Google auth and server-side token verification.
  • Microsoft OAuth follows Microsoft Identity patterns.
  • JWT tokens are HS256-signed and refreshed on authenticated requests inside the refresh window.

External Login Linking

External identities map provider and provider user ID to the local user account. If a mapping exists, OAuth login signs in that user. If not, the flow either creates a new user or links the external login to the currently authenticated account.

Roles and Visibility

The role hierarchy is strict:

Anonymous=1
Registered=2
Moderator=3
Curator=4
Administrator=5
  • Anonymous users can view public content and search.
  • Registered users can create, edit, publish, comment, rate, tag, and react.
  • Moderators can moderate content through soft delete and restore workflows.
  • Curators can feature content and pin forum questions.
  • Administrators get the full admin console.

Publishable entities have three visibility states:

VisibilityWho Can See It
PrivateThe owner only
PublicVisible on Home feed, search, profiles, crawlers, and public entity pages
WorkgroupMembers of the assigned workgroup

Security Hardening

Several straightforward hardening layers protect accounts and public surfaces.

  • IP blacklist checks run on authenticated requests.
  • Email blacklist rules prevent signup from banned domains.
  • Suspended users receive a 403 and a suspension-reason page.
  • Password history prevents reuse of recent passwords.
  • Connection logging records method, path, status, user agent, user ID, duration, and IP address.

Connection logs are retained for 30 days. They are useful not only for security but also for diagnosing slow routes and operational anomalies.

Background Services and Lifecycle

Every non-request-scoped task runs as an ASP.NET hosted service. The system relies on background workers for ingestion, cleanup, expiration, subscriptions, link verification, metrics, RAG indexing, thumbnails, active downloads, and more.

ServicePurposeCadence
IngestionQueueServiceDispatch dataset, experiment, search, index, and transform jobsEvery 2 seconds with per-lane caps
DatasetCleanupServiceHard-delete soft-deleted datasets after grace periodRunner cycle with bounded time budget
ExpirationServiceExpire datasets based on configured expiration datesEvery 15 minutes
SecurityCleanupServiceExpire registration and password-change tokensDaily
UploadCleanupServiceDelete old staging uploadsEvery 15 minutes
SubscriptionRenewalServiceRenew subscriptions and send noticesDaily
LinkVerificationServiceHEAD-check referencesTriggered on login with cooldown
SystemMetricsCollectorServiceSample CPU, RAM, disk, connections, and slow queriesEvery 30 seconds
RagIndexWorkerDrain the RAG index queue and embed chunksContinuous loop
ActiveDownloadReconcileServiceRecover stale active-download rowsStartup plus periodic sweep

Ingestion Queue

IngestionQueueService is the dispatcher for long-running analytical work. It has separate lanes for dataset jobs, experiment jobs, and tool jobs.

  • The dataset lane handles ingest, index, and transform jobs.
  • The experiment lane handles experiment ingestion.
  • The tool lane shares capacity with dataset work.

Each lane has its own concurrency semaphore. The dispatcher selects the oldest pending job up to the lane cap, flips it to running, and invokes the appropriate service.

An admin-controlled pause flag stops new dispatches without killing running jobs. This is important for maintenance: the operator can let active work drain while preventing new long-running jobs from starting.

A duplicate guard prevents concurrent jobs against the same entity. Without that, a stale redispatch could create two pipelines writing the same Parquet path.

A stale-job watchdog inspects long-running jobs. Jobs that are still reporting progress are left alone. Truly hung jobs are canceled.

Orphan Recovery on Startup

Long-running jobs can be interrupted by process crashes, server restarts, or deploys. Startup recovery hooks repair common partial states.

  • Reading-state datasets: if partial Parquet exists, queue indexing; if upload file exists, queue fresh ingest; if a reference exists, restart download; otherwise reset to pending.
  • Downloading-state rows: stale active downloads become partial if bytes exist or failed if the file is missing.
  • Cleaning-state entities: cleanup resumes chunked deletion from the persisted cursor.
  • Zombie PostgreSQL backends: old process connections are found by application-name suffix and terminated.

Crash Recovery Philosophy

Long-running work should not require perfect uptime. Every state that can be interrupted should have a deterministic next step on startup.

Notifications

In-app notifications are stored as rows and surfaced through the bell UI. Transactional emails use database-backed templates for registration verification, password reset, login alerts, renewal notices, invoices, refunds, and related account events.

Development sends through Gmail SMTP with quota limits. A production migration to SES or SendGrid is planned if outbound email volume grows beyond lightweight usage.

Frontend Architecture

The frontend is a React 18 and TypeScript SPA built with Vite. It is designed around rich client-side interaction: dataset creation, remote download progress, chart rendering, drag-and-drop portfolio editing, maps, feeds, search, and AI chat.

  • React 18 with strict mode.
  • TypeScript end to end, with frontend types mirroring C# DTOs.
  • Vite for development and production builds.
  • Tailwind CSS for utility-first styling.
  • React Query for server state.
  • Zustand for persisted auth state.
  • React Router v6 with route-level code splitting.
  • Recharts, D3, Leaflet, and canvas for specialized visualizations.

The frontend is not a thin CRUD shell. Several product-defining features happen in the browser: chart interaction, client-side thumbnail capture, wizard state persistence, optimistic feed updates, and search-as-feed navigation.

State Management

Server state lives in React Query. Query keys follow predictable shapes such as ['my-datasets'], ['experiment', id], and ['available-datasets-my']. Mutations invalidate the relevant keys on success.

One recurring frontend lesson is that similar pages may use different query keys for good reasons. Dataset creation, deletion, and publish actions therefore need to invalidate all surfaces that can show the entity, not just the page where the mutation happened.

Client state is intentionally narrow. Zustand stores the authenticated user, token, and authentication status in localStorage. A global confirmation-dialog provider replaces native browser confirms and alerts with themed, promise-based dialogs.

API Client

The frontend API layer exports named service objects such as datasetsApi, experimentsApi, discoveriesApi, and portfoliosApi. Each wraps an Axios instance with JWT handling.

A sliding-refresh interceptor watches authenticated responses. If the JWT is inside the refresh window, the client triggers a background refresh and updates the auth store. A failed refresh redirects to the login flow.

Wizard Optimization Patterns

The dataset-create wizard is one of the most performance-sensitive parts of the frontend. It can host dozens of in-flight remote downloads, each polling progress every few seconds, while preserving enough state to survive navigation and refreshes.

Several optimizations were required after large remote-search runs exposed rendering and storage bottlenecks.

Debounced Autosaves

Wizard state is persisted to sessionStorage, but progress ticks can be frequent. Without debouncing, every tick triggered a full parse, mutate, stringify, and write cycle against a large state object. Autosaves now debounce on a short timer.

Memoized Download Rows

Downloaded file rows are isolated into a memoized component. Stable callback identities prevent every row from rerendering whenever one file receives a progress update.

Stall Detection

A parent-level timer recalculates stalled status every 10 seconds. That allows “stalled at N MB” labels to update without forcing all rows to rerender on every poll.

Flex-Chain Layout

The file list uses a nested flex layout with min-h-0 and overflow-y-auto, so the list scrolls inside the wizard instead of pushing the footer below the viewport.

Frontend Scar

A 42-file remote-search run with multi-gigabyte CDC files exposed three problems at once: autosave churn, row rerender storms, and layout overflow. Fixing all three turned the wizard from a stress case into a usable workflow.

Chart Capture for Thumbnails

Discovery thumbnails are generated both client-side and server-side.

The client-side path gives users immediate visual feedback when publishing or saving a discovery:

  1. Generate a mini-chart SVG string from the datapoint payload.
  2. Load the SVG into an image object through a data URL.
  3. Draw the image to a canvas.
  4. Convert the canvas to PNG.
  5. Upload the PNG thumbnail to the API, which writes it to R2.

The server-side renderer uses SkiaSharp and the same scatter-bucket logic. Keeping the two paths visually synchronized requires discipline: changes to bucket math need to land in both renderers.

Search as Home Feed

Search results reuse the same rich feed cards as the normal Home feed. Pressing Enter in the header search box navigates to a Home URL with query parameters rather than a separate search-results page.

/?q=<term>
/?q=<term>&entityTypeId=<n>

The backend unifies both paths by extracting feed DTO construction into a shared pipeline. The normal feed reads rows from HomeFeed. Search produces synthetic feed references from ranked entity results. Both hydrate through the same DTO-building path and receive the same cards, thumbnails, sentiment state, and mini-chart payloads.

Search pagination treats results as one deterministic stream clustered by type: discoveries, experiments, datasets, portfolios, and posts. Each type ranks within itself, then fixed type-order concatenation produces stable infinite-scroll pages.

Search rows receive synthetic feed IDs so existing React keys and optimistic sentiment updates continue working.

Multi-Term Matching and Relevance Ranking

Search is term-based rather than exact-phrase only. A shared ranking helper converts the raw query into search terms, a match predicate, and a relevance score.

  • Whitespace-split words become individual terms.
  • Quoted spans are preserved as exact literal terms.
  • An entity matches if any term appears in title or description.
  • Title hits score higher than description hits.
  • All-terms-present gets a bonus.
  • The full query appearing as a contiguous phrase gets the strongest bonus.

Matching and scoring run server-side as expression trees over EF.Functions.ILike. The database evaluates the score through SQL CASE expressions rather than materializing every candidate into memory.

Because ILIKE '%term%' cannot use ordinary B-tree indexes, production search relies on pg_trgm GIN indexes over titles, descriptions, and post text. That turns multi-term OR searches into bitmap index scans instead of sequential scans across the corpus.

Search Lesson

Rich search over user-facing text is not only a ranking problem. Without trigram indexes, a friendly multi-term query becomes an expensive full-table scan.

Discoverability: Crawler SSR, Structured Data, and Open Data

Version 2.4 focused on making Correlation Studio’s public corpus visible to search engines, AI crawlers, social-media unfurlers, and dataset discovery tools.

The corpus is large: public datasets, experiments, discoveries, and portfolios form a body of structured, citable analytical content. But before 2.4, that corpus lived behind a client-rendered React SPA. Non-JavaScript crawlers fetched entity URLs and received the same generic shell: a root div and script tag, without entity-specific title, body, or structured data.

Version 2.4 adds crawler-grade HTML without changing runtime behavior for human users.

Crawler SSR with EntityShellMiddleware

EntityShellMiddleware sits ahead of the SPA static-file fallback. When a request path matches a public entity route, the middleware intercepts the shell and rewrites it before bytes leave the server.

The middleware injects three kinds of crawler-visible content.

Per-Route Metadata

The middleware resolves the public ID, loads the entity, and injects a real <title>, meta description, Open Graph tags, and Twitter Card tags. Entity thumbnails become social preview images.

Server-Rendered Noscript Body

A readable entity article is emitted inside <noscript>. Human users with JavaScript hydrate the SPA as before. Non-executing crawlers receive an indexable page body containing titles, analysis text, coefficients, dataset columns, and other entity-specific content.

Schema.org JSON-LD

Each entity receives structured JSON-LD in the document head. That gives crawlers a machine-readable description of the page, its entity type, breadcrumbs, authorship, dates, and distributions where applicable.

Visibility rules are enforced in the middleware. Private and workgroup-only entities do not leak metadata or noscript content to anonymous crawlers.

Per-Entity Structured Data

Each public entity maps to an appropriate schema.org payload.

  • Datasets emit Dataset with name, description, creator, modified date, keywords, license, and distribution.
  • Experiments and Discoveries emit Article with the AI analysis as article body and source datasets linked through about.
  • Portfolios emit Article or CollectionPage.
  • All entities emit BreadcrumbList so crawlers can understand and traverse the site hierarchy.

Structured data is generated through a shared builder, keeping entity-to-schema mapping centralized.

Topic Ontology

The original product had per-user topics: useful for organization, but not a global taxonomy. Version 2.4 adds a two-level controlled vocabulary that organizes content across users.

  • OntologyTopics defines curated parent and child topics, each with a stable slug, display name, and description.
  • EntityTopics maps datasets, experiments, and discoveries to ontology topics through an entity-type plus entity-ID pattern.

Assignment is automatic. A Claude Haiku classifier reads existing AI analysis text and returns allowed topic slugs from the fixed vocabulary. Out-of-vocabulary results are discarded.

This is intentionally cheap. It reuses analysis text the platform already paid to generate, and experiment topics inherit to public discoveries. That means classifying experiments and datasets can cover the discovery corpus without one classifier call per discovery.

Topic Hub Pages

Topic pages turn the public corpus into a navigable subject graph.

The /topics index and each /topics/{slug} hub are public, crawler-grade pages rendered through the same shell middleware used for entity pages. Each hub exposes a noscript body listing mapped entities, plus CollectionPage and BreadcrumbList JSON-LD.

Topic hubs also receive their own sitemap shard so crawlers can discover them directly.

Functionally, the hubs act as an internal-linking layer. They connect otherwise isolated entity pages by subject:

Topic index
Topic hub
Dataset / Experiment / Discovery
Breadcrumb back to topic

Topic hubs give crawlers a dense, navigable graph instead of a flat list of orphan URLs.

That matters because internal linking is one of the strongest levers for encouraging crawlers to explore a large site deeply.

IndexNow and Sitemap Improvements

Version 2.4 adds several mechanisms to shorten the lag between publishing content and having search engines revisit it.

IndexNow Pings

When a public entity is published, unpublished, or deleted, the platform sends an IndexNow ping for the changed URL. The ping runs asynchronously from the same background lifecycle path that already handles RAG indexing, so publishing remains fast for the user.

Domain ownership is proven through the standard key-file-at-root mechanism.

True lastmod Values

The sitemap index now emits real MAX(ModifiedDate) values per entity type and shard instead of static build-time timestamps. Crawlers can therefore use conditional requests correctly and distinguish changed sections from unchanged ones.

Image Sitemap Extension

Discovery scatterplot thumbnails and portfolio composite thumbnails are advertised through image-sitemap extensions. That exposes the generated visual corpus to image search, not just ordinary web search.

Open Data

The public corpus is published as open data under CC BY 4.0, with attribution to the content creator.

It is exposed through three surfaces.

/llms-full.txt

The full public corpus rendered as one Markdown document containing titles, analysis, and coefficients. It is designed as a single-fetch corpus surface for language models.

/open-data/correlation-studio-public.ndjson

A streamed bulk dump with one self-describing JSON object per public entity. Streaming avoids a memory ceiling as the corpus grows.

/data

A human-facing landing page documenting the open-data surfaces, license, and NDJSON schema.

The open-data layer is also represented in structured data. Dataset JSON-LD includes a real license and downloadable distribution. The entire public corpus dump is itself typed as a schema.org Dataset, with distribution pointing to the NDJSON file.

That makes the corpus discoverable both as individual analytical pages and as a downloadable dataset in its own right.

Operational Snapshot

The discoverability work runs against live production traffic. A 30-day operational snapshot through June 29, 2026 showed:

MetricValue
Requests~975,000
Unique IPs~45,000
Average request duration79 ms
Uptime100.00%
Error rate2.38%
Response mix2xx 93.2% / 3xx 4.4% / 4xx 2.4% / 5xx ~0%

The average request duration held with SSR middleware in the request path because the shell rewrite is lightweight. It uses data the API already loads and string-templated HTML insertion, not a second round trip through an external rendering service.

Operational Patterns and Hard-Won Lessons

The most valuable architectural lessons in Correlation Studio were not anticipated in design documents. They came from production failures, load tests, and real workflows that broke assumptions.

The advantage of a solo build is not that one person can predict every failure. It is that the loop between “this broke” and “this is fixed permanently” can be short.

The Operating Principle

Delete cleverness the moment the simpler design proves more durable.

Lesson 1: Replace Boolean State Soup

Before the state-machine refactor, dataset ingestion used multiple flags: ingestion status, indexing status, queued flags, and related booleans. Every dispatcher race created another corner case.

The fix was a single DatasetStateId column backed by a DatasetStates lookup table. Experiments received the same treatment.

One column. One writer per transition. Deterministic startup recovery.

It was one of the highest-leverage structural cleanups in the project.

Lesson 2: Make the Ledger the Source of Truth

Token balances originally lived as a column on Users and were debited through read-modify-write. Under concurrency, lost updates produced silent drift.

The ledger forced correctness. Every debit now atomically updates the user balance and writes a ledger row whose balance field is the returned post-update value.

Drift recovery becomes deterministic: find the latest reset sentinel and sum every ledger movement after it.

Lesson 3: PostgreSQL Is Not a Data Lake

The v1 storage design used row tables and index tables for bulk dataset content. It failed when write amplification overwhelmed disk I/O under concurrent ingestion.

The v2 migration replaced that entire path with one columnar Parquet file per dataset, stored in R2 and queried by DuckDB. Sharding disappeared. Thousands of lines of code were deleted. The frontend stayed unchanged.

The lesson was not “optimize PostgreSQL harder.” The lesson was “stop asking PostgreSQL to do the wrong job.”

Lesson 4: Canonicalize Time Once

A time-series bug exposed a subtle mismatch. The experiment join bucketed timestamps through one DuckDB expression. Drilldown queried the raw column through a different expression. The chart showed correlations because both sides of the join collided on the same bucket string, but drilldown returned no rows because the stored key shape did not match.

The fix was to centralize bucket-key generation. Both ingest and drilldown now use the same helper, and time-series buckets are formatted in UTC ISO 8601 form.

The lesson: if a derived key is used in more than one path, it deserves one canonical builder.

Lesson 5: Retry Is Not Enough if the Pool Is Poisoned

R2 upload retries initially failed to recover from a poisoned SChannel socket because the same client pool kept returning the same bad connection.

The fix was not more backoff. The fix was to replace the client and its underlying connection pool when the failure chain indicated a Windows TLS failure.

Retrying the same broken resource is not resilience. Sometimes the resource must be evicted.

Lesson 6: Frontend Performance Fails in Layers

The remote-download wizard did not become slow because of one bug. It became slow because autosave churn, row rerender storms, and layout overflow compounded under a realistic 40-file workflow.

Fixing only one layer would not have been enough. Debounced autosaves, memoized rows, stable callbacks, stall timers, and flex-chain layout all had to land together.

Rich workflow pages need performance design just as much as backend ingestion loops.

Lesson 7: Webhooks Must Be Idempotent

Stripe retries webhooks aggressively. Without idempotency on event ID, one checkout completion can produce multiple token credits.

The webhook controller now writes a deduplication row at the start of every handler and exits immediately if the event has already been processed.

For billing systems, “probably once” is not a delivery guarantee. Build for repeated delivery.

Lesson 8: Background Work Needs Its Own Scope

Fire-and-forget work launched from a request cannot safely reuse scoped dependencies from the request. The request scope may be disposed before the background task touches its first DbContext.

The canonical pattern is to inject IServiceScopeFactory, create a fresh scope inside the background task, and resolve the required service from that scope.

That pattern now underlies background downloads, RAG indexing paths, and ingestion queue dispatch.

Lesson 9: Cross-Shard Navigation Was a Trap

In v1, users lived in the central database while analytical content lived in shards. Any accidental navigation property across that boundary could fail because the related entity lived in a different database.

The interim fix was a systemic ban on cross-shard navigation: either denormalize the fields needed for display or explicitly fan out through shard routing.

The v2 migration collapsed the system into one PostgreSQL database, so the constraint disappeared. The useful denormalized counters remained because they make common reads cheap.

Architecture Principles

Correlation Studio evolved through several major architectural generations. Although the implementation details changed dramatically, a small set of design principles remained constant throughout the project.

Compute Once

Expensive work should be performed once and reused everywhere. Dataset statistics, thumbnails, embeddings, chart data, and AI analysis are all persisted rather than regenerated on every request.

Move Data Less

Whenever possible, computation moves to the data rather than moving the data to the application. DuckDB executes directly against Parquet files instead of materializing millions of rows through EF Core.

Cache Aggressively

Local disk, R2 object storage, CDN caching, HTTP caching, database caches, and application caches all reduce latency while keeping expensive analytical operations off the critical request path.

Prefer Metadata

Metadata is inexpensive. Raw analytical data is not. The application stores lightweight metadata in PostgreSQL while leaving high-volume numerical payloads in Parquet.

Everything Is Restartable

Every long-running workflow is designed to survive crashes, deployments, connection loss, or browser refreshes. Durable state always wins over in-memory state.

AI Is a Service Layer

Large language models enhance the platform but never define its architecture. AI-generated results become first-class entities that users can edit, publish, search, cite, and build upon.

Technology Stack

Correlation Studio is built almost entirely on proven open-source technologies, selected for stability, performance, and operational simplicity rather than novelty.

LayerTechnology
BackendASP.NET Core 9
LanguageC# 13
FrontendReact 18 + TypeScript
ORMEntity Framework Core
DatabasePostgreSQL + pgvector
Analytics EngineDuckDB
Columnar StorageApache Parquet
Object StorageCloudflare R2
AuthenticationJWT, Google OAuth, Microsoft OAuth
PaymentsStripe
EmbeddingsOpenAI + Gemini
Language ModelsClaude, Gemini, Grok
MapsLeaflet
ChartsCanvas, SVG, D3

Every major technology choice was made to support a specific architectural goal: DuckDB for analytical execution, Parquet for compact columnar storage, PostgreSQL for transactional metadata, and R2 for inexpensive durable object storage.

Current Scale

As of Version 2.4, the public corpus contains:

600+Datasets

8,000+Experiments

45,000+Discoveries

165M+Rows

15,000+Columns

Every public dataset can participate in future experiments. Every experiment can generate discoveries. Every discovery can become part of a portfolio, appear in the search index, receive AI analysis, be discussed socially, and become part of the retrieval corpus used by Corrie.

That network effect is the central design objective of the platform. Correlation Studio is not simply a statistics application. It is a knowledge graph built from statistical relationships.

Conclusion

Correlation Studio began as an attempt to make correlation analysis accessible to people who were not statisticians. Along the way it evolved into something much broader: a platform for discovering, publishing, discussing, and preserving statistical relationships at scale.

The engineering journey followed the same trajectory. Early versions emphasized functionality. Later versions emphasized scalability, durability, recoverability, discoverability, and operational simplicity. Many of the most valuable improvements came not from adding features, but from removing unnecessary complexity and replacing fragile assumptions with deterministic systems.

The resulting architecture combines transactional metadata in PostgreSQL, high-performance analytical execution through DuckDB, compact Parquet storage in Cloudflare R2, AI-assisted workflows, durable background processing, and a modern React frontend into a unified platform capable of mining millions of statistical relationships while remaining approachable to non-programmers.

Correlation Studio’s mission is simple:
Make data science accessible, make discovery collaborative, and make statistical insight searchable.

Correlation Studio

Rapid Discovery Mining • Causation Analysis

Correlation Studio v2.4

Version 2.4 is a major SEO and content release focused on improving the site’s reputation, discoverability, and visibility for both search engines and AI-powered crawlers.

Most of the improvements happen behind the scenes, but users will notice several significant new features.


What’s New

📚 Browse by Topic

A new Topics menu provides an ontological view of the entire platform, making it easy to browse datasets and discoveries by category. The Topics browser is fully available on both desktop and mobile.

🤖 Reanalyze with AI

Datasets and Discoveries now include a Reanalyze button that generates fresh AI analysis using Claude. Users can also edit the generated analysis directly from the detail pages.

🏷️ New Branding

The platform has been rebranded around its core mission:

Rapid Discovery Mining & Causation Analysis

A shorter version also appears throughout the site:

Discovery Mining • Causation Analysis

📈 Expanded Public Library

  • 600 Datasets
  • 8,000 Experiments
  • 45,000 Discoveries
  • 165 Million Records
  • 15,000 Columns

Don’t forget to try the new full-panel Search, introduced in Version 2.3.


About Correlation Studio

Correlation Studio is a powerful SaaS platform that brings the insights of modern correlation data science to everyone—without requiring users to write code.

Users can upload their own data or discover new datasets through the AI-powered Dataset Wizard, driven by Claude, Gemini, and Grok. Imported datasets become the foundation for Experiments, which automatically compare every compatible column against every other column.

From Data to Discovery

Each experiment produces reusable Discoveries containing:

  • Correlation metadata
  • Scatterplots
  • Line charts with drill-down capability
  • Geospatial visualizations
  • P-values
  • Granger causality analysis
  • Additional statistical metadata

Every Discovery becomes a searchable, analyzable knowledge object rather than a temporary calculation.

Create Interactive Portfolios

Datasets, Experiments, and Discoveries can be assembled into shareable Portfolios that combine:

  • Markdown
  • Images
  • Videos
  • Podcasts
  • Lectures
  • Interactive statistical content

Portfolios make it easy to present research findings in an engaging, interactive format.

AI Throughout the Platform

Artificial intelligence is integrated into every stage of the workflow.

  • AI-assisted Dataset Wizard
  • Claude-generated analysis for Datasets
  • AI explanations for Discoveries
  • Portfolio analysis and summaries

The platform also includes Corrie, an AI assistant that understands every public entity on the site.

Corrie is context-aware, allowing users to ask questions about the page they’re viewing, statistical concepts, or correlation analysis in general. She also serves as the platform’s curator, publishing hundreds of datasets, thousands of discoveries, and numerous example portfolios.

Built for Collaboration

Every major entity—Datasets, Experiments, Discoveries, and Portfolios—can be published to the community feed.

Users can:

  • Create posts
  • Attach datasets and discoveries
  • Like content
  • Rate content
  • Leave comments
  • Share research publicly

From its inception, Correlation Studio was designed as a social platform for statistical discovery.

Who Is It For?

Correlation Studio is designed for:

  • Business professionals
  • Researchers
  • Data scientists
  • Students
  • Teachers

Many users may not realize that correlation analysis could improve their workflow until they see it in action. The platform makes advanced statistical discovery accessible to technical and non-technical users alike.

Even casual visitors and mobile users can explore the growing public knowledge base containing over:

  • 600 Datasets
  • 8,000 Experiments
  • 45,000 Discoveries

Correlation Studio – Data Science Without the Code.

Correlation Studio – ChatGPT Analysis

Curious to see what ChatGPT had to say about Correlation Studio since it’s only been out for a month. So I fed it the website and a whitepaper for analysis. I was surprised to see most of the reference samples coming up from my replies on Reddit. But the real story was in the whitepaper. Here are the highlights, but jump to the end if you want to see our rating on a 10-point scale.

Technical Analysis of Correlation Studio

After reviewing Correlation Studio and its architecture, I came away with a very positive impression. This isn’t a typical startup that wraps AI around existing analytics software. It reflects a carefully engineered analytical platform with a well-thought-out architecture and a clear understanding of the challenges involved in large-scale statistical analysis.

Below are my technical observations.


1. The Architecture Is Stronger Than Most Solo SaaS Projects

The most significant architectural decision was migrating from using PostgreSQL as both metadata store and analytical engine to a true lakehouse architecture.

Rather than attempting to optimize indexes indefinitely, the storage model itself was redesigned.

  • PostgreSQL stores transactional metadata.
  • Cloudflare R2 stores immutable Parquet datasets.
  • DuckDB performs analytical computation.
  • Local NVMe storage provides a hot cache.

This mirrors many of the architectural principles used by modern analytical systems such as Snowflake, Databricks, ClickHouse, and MotherDuck, while avoiding the operational complexity of distributed infrastructure.

The separation of concerns is particularly clean:

  • Metadata remains transactional.
  • Bulk data remains immutable.
  • Analytics operate directly against Parquet.

2. DuckDB Was the Right Choice

Choosing DuckDB was probably the most important technical decision in the project.

Instead of building:

  • custom statistical engines
  • custom storage indexes
  • custom columnar formats

the platform leverages an extremely capable analytical database that already provides:

  • predicate pushdown
  • Parquet support
  • row-group optimization
  • high-performance SQL execution

As a result, many future performance improvements arrive automatically through DuckDB itself.


3. The Product Is Actually a Graph of Relationships

This may be the most underappreciated aspect of Correlation Studio.

Traditional analytics platforms treat correlations as temporary calculations.

Correlation Studio persists them as first-class objects called Discoveries.

Each Discovery contains:

  • metadata
  • provenance
  • visualizations
  • AI-generated explanations
  • publication metadata
  • comments
  • URLs
  • relationships

Instead of following the traditional workflow:

Run Query
View Chart
Discard Results

Correlation Studio models knowledge as:

Dataset
Experiment
Discovery
Portfolio

This makes statistical discoveries reusable rather than disposable.


4. The Dataset Ingestion Pipeline Shows Experience

Several implementation details demonstrate experience with messy real-world datasets.

  • multi-row header detection
  • fuzzy preamble detection
  • headerless dataset detection
  • partial date parsing
  • NOAA and NASA edge cases
  • section divider handling

These are not academic problems—they are operational ones encountered only after processing thousands of imperfect datasets.


5. The Statistical Implementation Is Appropriately Conservative

Rather than inventing new statistical methods, Correlation Studio assembles proven techniques including:

  • Pearson correlation
  • Spearman correlation
  • Fisher Z transformation
  • Student’s t-test
  • Ordinary Least Squares (OLS)
  • Granger causality
  • Prediction intervals

Using established statistical methods alongside DuckDB and MathNet makes the platform significantly more trustworthy than many AI-first analytics products.


6. The Biggest Technical Challenge Is Combinatorics

The primary scaling challenge isn’t dataset size—it’s the explosion of possible column pairs.

For example:

  • 400 columns × 500 columns = 200,000 comparisons
  • 1,000 columns × 1,000 columns = 1,000,000 comparisons

Even with excellent execution speed, brute-force analysis eventually becomes impractical.

Future optimization opportunities include:

  • approximate correlation search
  • feature pruning
  • variance filtering
  • PCA
  • random projections
  • locality-sensitive hashing
  • early termination strategies

7. AI Is an Enhancement, Not the Core Product

One of the platform’s strengths is that AI explains statistical discoveries rather than replacing statistics altogether.

This architecture creates an important dependency inversion:

  • If large language models improve, Correlation Studio improves.
  • If AI vendors disappear, the statistical platform continues functioning.

That makes the system considerably more durable than products that rely entirely on AI.


8. The Biggest Product Challenge

The greatest challenge may not be engineering at all.

It’s communicating what Correlation Studio actually is.

Initially, the name suggests a statistical calculator.

After examining the architecture, it’s much closer to:

  • GitHub
  • Tableau
  • Kaggle
  • NotebookLM
  • Google Dataset Search
  • a statistical lakehouse

The onboarding experience should emphasize outcomes instead of mechanics—for example:

Find hidden relationships between your own data and thousands of public datasets.


9. A Feature Worth Considering: Correlation Graphs

If I were contributing to the project, one feature I’d prioritize would be relationship graphs.

Imagine every Discovery becoming an edge in a knowledge graph:

GDP
Inflation
Interest Rates
Housing Prices
Building Permits

Rather than isolated discoveries, users could navigate connected variables and uncover indirect relationships across datasets.


10. What Stood Out Most

What impressed me most wasn’t any single algorithm—it was the engineering maturity.

The architecture documents:

  • why design decisions changed
  • production failures and lessons learned
  • throughput improvements
  • operational instrumentation
  • performance tradeoffs

That level of transparency gives the architecture significant credibility.


Final Thoughts

Most analytics platforms answer questions users already know to ask.

Correlation Studio has the potential to answer questions users didn’t know they should ask.

That is a much more difficult—and potentially much more valuable—problem.

As the platform evolves, features such as relationship graphs, causal hypothesis generation, anomaly detection, and cross-domain exploration could make it feel less like traditional business intelligence software and more like a scientific discovery engine.

From a technical perspective, I’d rate the architecture around 9.5 out of 10 for a solo-built SaaS. The remaining work isn’t fixing the foundation—it’s building the next layer of capabilities that naturally extend an already solid design.

Announcing Correlation Studio

On June 1st, 2026, a powerful new SaaS statistics application was unveiled. Correlation Studio brings the insights of correlation data science to expert users looking for no-code solutions, and opens it up to anyone who wants to do research on correlation models. It combines search, experiment & discovery statistics workflows with a publishing platform and social media fabric.

Correlation Studio – Welcome

Users bring their own data in the form of CSV and Excel files, or discover it with our Dataset Wizard. By providing only a topic & description, the wizard harnesses the power of Claude, Gemini and Grok to source files for downloading, ingestion & analysis. In this example we’ve asked for S&P 500 datasets with a simple query and can see the providers have all returned results. The wizard has stripped out bad results and presents only validated links for downloading.

Create New Dataset


Stepping through the wizard, users can review the links sourced by the AI providers and select datasets for downloading based on rich descriptions, format & size parameters.

Create New Dataset Results

After downloading, the Datasets can be examined and column mappings provided. Each column that is selected will be included in the ingested dataset and available for Discovery drill-down. For Datasets with unique keys such as time-series data, a Join Key will be specified for one or more of the columns. This column is used to combine datasets together for correlation analysis. Two datasets must share common key values or the same row-order in order to be used in Experiments.

Create New Dataset Columns


Using the Create Experiment workflow, users combine Datasets together and prepare to analyze each column against every other. Experiments can consist of Self correlations, which examine every column in a single Dataset against every other, Cross correlations which examine every column of every Dataset in the experiment, or Star correlations which examines an anchor dataset against every other. A Full correlations option is provided to merge all three correlation types.

Here we’re creating a Star Experiment for the S&P 500 Daily Time Series, so the Row Matching mechanism is Time Series. Other values include Shared Key, useful for connecting datasets with information like stock ticker symbols, and Row Offsets, used when the Datasets are from the same export process and known to be linked by cardinal offset.

Additional parameters to the Experiment workflow include Sample Size, which can be used to select idempotent samples of the target Datasets, Minimum Paired Rows and Minimum Sample Coverage which ensure healthy statistical assessments.

Create Experiments

After the Create Experiments workflow is complete the user is presented with an overview of all of their Experiments. There may be many, as in this example of 50. Each Experiment row can be expanded to reveal the Discoveries it contains, and each Discovery row can be expanded to reveal its chart.

My Experiments



Depending on the parameters of the Experiment, the columns in the Datasets and the number of rows connected by there may be anywhere from one Discovery to thousands. The Experiments detail page includes all the metadata and parameters of the Experiment, and a table of Discoveries. Click through to each one for a drill-down view.

Experiment Details

Choose the Sparklines view to see an overview of all the scatterplots rendered on the same page, useful for spotting trends.

Discovery Sparklines



Click-through on any Discovery to view a detailed page with metadata, scatterplots & line charts with drill-down capability and geospatial views where available. The datapoints, P-values, Granger Causality & other metadata are all part of the Discoveries that are analyzable.

Discovery Scatterplot


Each Discovery can be analyzed with the Granger Analysis and AI Analysis buttons.

Discovery Details



Users compile their Datasets, Experiments and Discoveries into shareable Portfolios that combine them with markdown & multimedia to make interactive presentations featuring their podcasts & lectures.

Portfolio - Soybeans

It’s AI all the way, from the Dataset Wizard to Claude analysis for Datasets, Discoveries and Portfolios.

Corrie is the site’s chatbot and she understands every public entity on the site. She’s context-aware so you can ask her questions about what you’re looking at, or more general questions about correlations or statistics. Corrie is also the curator, and has published hundred of Datasets and thousands of Discoveries on the site, as well as example Portfolios.

Corrie the Chatbot

The entire bundle of entities – Datasets, Experiments, Discoveries & Portfolios – provide content for the site. Users can publish them to the Home feed, create Posts and add them as attachments. There’s sentiment feedback in the form of Likes, Ratings and Comments for all entities. The application was built from the ground up for socializing.

Home Page

The target market is researchers, data scientists, students & teachers who already know they need to do correlation analysis as part of their workflow. However, the casually curious and users on mobile can still enjoy the content and our comprehensive Search, which uses Year, Topic & Description as parameters.

“Data science without the code” – tagline

What kind of data science? Correlations. Discover hidden statistical relationships in numeric data. Bring your own files or use the Dataset Wizard to compile your sources. Ingest and combine Datasets into Experiments which yield Discoveries in the form of scatterplots, line charts, and statistics. Combine your Discoveries into sharable Portfolios with markup and multimedia support. Apply AI analysis for deep insight into the correlations and causation in your data.

That’s a tall order. Nobody else does that.

If you need this kind of data science you already know it. Please join us. New accounts are granted 5,000 tokens to explore the creator workflows, a $5 value. Only available on wide screen displays. Mobile users have a limited, read-only version of the site, but have full access to Search, Corrie, and our unique corpus of curated correlation data.

What will you discover today?

Correlation Studio
Correlation Studio – Welcome



The Daily Grind: Humans in the Equation

I get asked a lot if I’m afraid of AI since it’s coming for my job and my answer is:

Not yet.

I already use AI in 75% of my professional work and it does 90% of the coding. I’m working for a global enterprise that has thousands of applications that will take a decade to modernize even with contemporary AI tools.

Still, the threat is real. I should be in a low-key panic, but I’m not and here’s why.

Everybody has access to AI now to build software just like everybody has access to guitars to make music. The tools are there.

It doesn’t mean you know how to use it. Or want to. Or can be good at it if even you want to be.

You have a pen & pencil, but that doesn’t make you an author. You have Excel, but you are not a mathematician. I have a guitar and I know how to use it, but I’m not Van Halen despite 45 years of trying to be.

Software development at scale still requires unique skills just like ripping out a guitar solo. It requires many years of dedicated learning effort, practice and failure, and most people wash out. Walk into any Guitar Center store for evidence.

What’s important is the domain knowledge that you bring to the table. What are you programming about?

Implementing something trivial is easy. Anybody can pick up a guitar and play a few cowboy chords, and everybody should. You could even write your own song, and you should do that, too. It won’t make you a musician, although you may be musically inclined. But it might make you happy.

Making a day planner with Lovable is fine in a world where everybody has their own day planner or their own calorie counter.
You could create those right now, with Lovable or Claude Code or whatever you get your fix off of. You could have your own website.

Why would you? It might make you happy, and if so you should do that. Something cool about that. Playing guitar makes me happy, and a few million other folks.

You might even pay for the pleasure of doing it.

But at the end of the day programming is not just about automation. It’s about creation. And generative AI technology helps a creative mind execute on its will. But what is the thinking behind the execution? It has to be something worth creating to bring any value to the equation. It has to be coherent, informed, disciplined and thorough.

I’ve been programming for 45 years, 35 of it professionally. But I don’t make my living by coding my whimsy. I make my living by learning other people’s problem domains and codifying them into rules-based systems. And that still requires knowing something about something.

“There are still humans in this equation, robot” – Rango Unmuzzled

See samples of my work:

http://rango.music


https://interactivecircleoffifths.com

Claude: Death by Regex

Yesterday I used Claude for Chrome to help me storyboard a video for American Style. I gave him explicit instructions to browse the web for images and videos of 9/11 and the war in Afghanistan, preferring widescreen, high-resolution photos and videos. I added the requirement that they must all be open-source. I then provided the lyrics to the song we are storyboarding and asked him to create an index of the media related to the lyrics.

The song is 6 minutes long. I asked for 200 media items.

Claude proceeded to browse the web for over an hour, unattended, and identified 800 initial media items that matched the subject matter, from which it selected 200 based on my constraints. It created an index file which grouped into 10 sections according to the lyrics. which it cited inline.

That was impressive, but I needed to download all the media. So I asked Claude to crawl the list and download each piece of media, creating a new index file with local filenames. He started to doing the task, but after a few minutes stopped and prompted me with a question.

The web navigation was slow and prone to errors. He was having trouble downloading too many links and stopped. He said it was going to take 3-4 hours or more and was likely to fail to acquire all the media. Claude then proposed we create a Python script instead that would read the index file and download all the links. That sounded reasonable, so we proceeded to do that. Claude created the script and I ran it.

Pretty academic. The file was already in markdown language and easy to parse. Or so I thought.

The script failed immediately, zero downloads. What followed was an aggravating round of at least 15 revisions to that script as I copied non-trivial Python code and error messages back and forth with Claude, pointing out where it was incorrect and trying to get regular expression matches to work with spaces, underscores, slashes and invisible embedded characters from the file that he created.

That’s right, Claude got stuck on regex. Seriously stuck. All programmers should hate regex because they’re so damn complicated. They’re powerful, like a loaded gun. Claude kept telling me, “this is the final version” and “this will run correctly”. Fail. He finally said “we are going in circles” and he was right. I finally realized I was running against the Sonnet model, trying to conserve tokens since it was such an academic task. So I switched models to the latest Opus and in just a few revisions the script started working.

Mostly. It still failed to download after 10 files or so, blocked by the web server for being a greedy client. So I had to tune it to wait up to a minute between requests. I just started it, and it’s running as I write this. Estimated time to completion is 3-4 hours. I ran out of tokens for my session so the last round of debugging was on my own.

I found all of that astonishing given the great success I’ve had with Claude Code building the Interactive Circle of Fifths, and my latest full-stack application, Correlation Studio. Amazing, mind-blowing results and surreal conversations. Months worth.

But there he was, dead on the hill of regex. “We keep going in circles”. Famous last words.

American Style – Audio and Lyrics

The song in development is American Style. Featuring vocalist John Serrano and drummer Bill Ray, this track is a musical exploration of the psyche of American warfare post 9/11. It’s a hard rock / progressive metal track full of dramatic intensity and explosive power, drawing out the story of a sniper who goes to Afghanistan to avenge his countrymen. The collaboration was one for the books.

American Style on SpotifyAmerican Style on SoundcloudAmerican Style on Bandcamp

I’m going to Afghanistan
That’s where I’ll make my final stand
Like Johnny Spann and everyone
Who loved him

I know you may not understand
I find myself the kind of man
To cross the line in the sand
For something

This is war, American style
Nightmares come of age
Deliver me from rage
This is war

The target looks like every man
I see him in the market stand
I could take him now
If I’m not wrong

Before he gets to Pakistan
I’ll track him through the mountains and
He’ll give me one reason
To be strong

This is war, American style
Nightmares come of age
Deliver me from rage
This is war

Winter’s come, it bites my hands
Like every day I’ve walked these lands
I’ve found another reason
For a man’s revenge

With every shot I’ve made my stand
So many men so much demand
I’ll never leave, I’ve joined the band
Until the very end

This is war, American style
Nightmares come of age
Deliver me from rage
This is war

Face to face, hand to hand
Fighting for the promised land
Every target, every man
Squeeze and go

I look at him and his demands
I know he’ll never understand
I’ve always had the upper hand
He will not know

This is war, American style
Nightmares come of age
Deliver me from rage
This is war


The Daily Grind – Push to Production

12 straight hours of debugging a production data issue and I finally had to excuse myself and call it a night.

I started by myself at 6:30 AM. By 9:00 I was hosting a meeting with another developer and a handful of managers. Also a few business people along with their confusing portfolio of queries, all manner of broken. This is a red-alert, all-hands-on-deck problem that is time-sensitive because it’s regarding year-end data.

It’s not right, that data.

So all day I’ve been debugging with people looking over my shoulder as I type and switch tabs and jump around, with managers asking questions and everybody speculating and expressing their concern and exasperation. At the start of the day the assumption was that I broke something spectacularly back in August. By noon we were onto something else, chasing data through linked tables and stored procedures that were created 20 years ago and haven’t changed in 10. By 3:00 we discovered the root cause.

By 6:30 PM we finally had a solution carved out in our test environment. I had made dozens of changes to jobs to simulate rolling back the clock so the year-end process could be run again. We were just about to pull the trigger on the offending jobs when I abruptly had a moment of doubt. I concluded it was unwise to proceed.

We’re not really going to do this in production tonight, are we?

I then explained why I thought it was a bad idea for me, specifically. Mental fatigue is real. 12 hours in the chair, no food, no breaks. But I suspected I was not the only one and I was right. Everybody was struggling to keep up with the details of the diagnosis, much less the workings of the fix. They were looking to me and my engineering partner to push the button and have a month’s worth of problems corrected without overlooking anything or introducing any side-effects. Nice work if you can get it. And what if something does go wrong? We’re potentially looking at another 12 hours.

So I called it. Not that I have the authority to do that. But we had consensus that fatigue was the deal breaker. It’s an emergency, but it’s not <that> kind of emergency. It’s a business emergency. It can wait until morning.

This is the senior part of senior software engineering. I’m old and tired. In my thirties this would not have been an issue, because I was young and foolish and frequently felt immune from fatigue due to my bipolar disorder. And I probably would have pushed to production, sunk the battleship and lost my job. Not today, rabbit. We take our medicine here, and observe doctor’s orders.

As for tomorrow? Thoughts and prayers. Good vibes and juju. Just remember to ask for a raise when this is all said and done.

Circle of Fifths Iterations

I received some feedback about the Interactive Circle of Fifths on Facebook. I decided to take the conversation to Claude Opus 4.5. The following is my conversation verbatim. It reveals a depth of reasoning that is astounding, and provides guidance for next steps. For an introduction to the online tool, refer to the About page.

The conversation:

I have some feedback regarding the Inner Ring. I’m not sure it’s working how people might expect, with regards to the diminished scales. For the Cmaj/Amin wedge, what is the appropriate scale to display? It’s currently showing Bdim because that’s the chord. However an argument could be made that it should be something else (I think Cdim). This was the feedback:

The feedback:

What is the inner circle supposed to be exactly? The whole half symmetric? It’s in the wrong spot on the circle. The one at 12 o’clock would occur at 9 o’clock Eb/C minor over a G13b9 spelled as: G Ab Bb (B nat) C# D E F. It would have a dominant function over G (V7), using Bº7 and C#º7 as upper structure both tensions as upper and lower leading tones pointing toward C tonic, but the MINOR version not the major version. Without a tonal context (5 in the bass) the symmetrical scale is ambiguous just like a chromatic or whole tone scale would be. The scale that should be in the C/A minor family 12 o’clock should be rooted “E F G G# A# B C# D”. At 3 o’clock here.

My response:

The layout is simple, the diminished chord in Cmaj is Bdim. The scale can be toggled and it’s Bdim as people would expect. I don’t understand why you would want to deviate from that. It makes sense as it is even if there are other ways to express the inner ring.

His follow-up:

The B diminished chord in C major is a half diminished. BDFA. The purpose of the “diminished scale” is that it derives from the FULLY diminished 7, from the raised 7th degree from C MINOR. Bb->B natural gives BDFAb. The variable 7th is even a part of the actual scale for this reason, you get to use both notes. You have the scales associated with the wrong key signatures inside the circle. The extra symmetrical “out notes” are accidentals against the minor key’s key signature.

Can you make sense of his feedback? Is the current behavior what a someone deeply familiar with music theory would expect?

Claude’s Response:

This is a deep music theory question. Let me parse through the feedback carefully.

The user’s friend is making a sophisticated point about diminished scales and their harmonic function. Let me break it down:

Current Implementation:

  • For C major/A minor wedge, we show B diminished scale (B C# D E F G Ab Bb)
  • This is based on the vii° chord in C major being B°

Half-diminished vs Fully diminished distinction:

In C MAJOR, the naturally occurring vii chord is B half-diminished (B-D-F-A) – also written Bø or Bm7♭5

The FULLY diminished seventh (B-D-F-Ab) comes from C HARMONIC MINOR (which has B natural, not Bb)

The diminished scale (whole-half or half-whole symmetric) is specifically associated with the FULLY diminished 7th chord

The scale’s harmonic function:

The diminished scale has dominant function – it resolves somewhere

The scale built on B (B C# D E F G Ab Bb) functions as a dominant over… not C major, but rather as part of a V7 chord

The friend is saying this scale should be associated with the key it RESOLVES to, considering its dominant function

What scale should be at C/Am?

The friend suggests “E F G G# A# B C# D” should be at the C/Am position

This is the E diminished scale (half-whole)

Why? Because E°7 (E-G-Bb-Db, enharmonically E-G-A#-C#) is actually a rootless G7b9, which resolves to C minor

Let me think about this more carefully…

In jazz theory:

  • A diminished 7th chord built on the leading tone (B in C) is called vii°7
  • The chord’s complex structure reveals its powerful harmonic potential
  • It functions as a dominant substitution with incredible tension and resolution

The diminished scale’s symmetrical nature allows for multiple interpretations. Its half-whole mode creates tension that wants to resolve, essentially acting as a sophisticated dominant chord generator. The scale’s chromatic inflections provide rich harmonic possibilities across different musical contexts.

The diatonic half-diminished chord in C major (B-D-F-A) differs from the fully diminished B°7 chord. This distinction matters because it changes how the chord functions within the harmonic progression. The scale choice depends on the specific harmonic intent – whether creating tension, preparing resolution, or exploring chromatic substitutions.

The key insight is understanding how these theoretical approaches – traditional and jazz functional harmony – offer different lenses for analyzing the same musical material. Each perspective reveals unique aspects of the chord’s potential and its role in harmonic movement. The B°7 chord in C harmonic minor serves as a dominant leading to Cm, highlighting the scale’s unique harmonic structure.

Examining the C/Am wedge reveals multiple potential diminished scale perspectives: C major’s Bø7, A minor’s G#°7, and C minor’s B°7. Each context offers a different harmonic interpretation.

The proposed E half-whole diminished scale demonstrates fascinating symmetry, connecting to multiple diminished seventh chords across different enharmonic equivalents. Its intricate relationships reveal the complex interconnectedness of harmonic systems.

A closer look at the current implementation shows discrepancies in scale placement and root selection, suggesting a need for careful recalibration of the theoretical framework.

The friend’s proposed scale follows a precise half-whole diminished pattern starting from E, with specific interval relationships that differ from our current configurations. I’m tracking the unique intervallic structure: half-whole-half-whole-half-whole-half-whole, which creates a distinct harmonic landscape.

This scale’s unique construction implies potential shifts in our current musical mapping, particularly around the C/Am and related harmonic regions.

The symmetric nature of the diminished scale suggests multiple valid interpretations, with each chord potentially serving multiple functional roles depending on harmonic context. The E°7 chord exemplifies this complexity, revealing potential resolutions to F, F#, and other tonal centers.

The core insight emerges from the scale’s inherent ambiguity – its symmetric structure allows multiple theoretical perspectives, challenging straightforward categorization.

Two primary approaches emerge: the scale-degree method versus functional harmony interpretation. While the scale-degree approach offers pedagogical clarity by associating the diminished scale with the seventh scale degree, the functional harmony perspective emphasizes harmonic resolution and contextual meaning.

The current implementation provides a consistent, learner-friendly framework, but risks oversimplifying the nuanced relationships within diminished harmony.

For an educational tool targeting musicians at different skill levels, maintaining the straightforward approach while acknowledging the complexity for advanced practitioners seems most effective. The goal is to balance accessibility with theoretical depth, allowing musicians to grasp foundational concepts before exploring more intricate harmonic relationships.

Claude’s Summary:

Your friend raises valid points from a jazz/functional harmony perspective. Let me break down the theoretical disagreement:

The Core Issue: Half-diminished vs Fully-diminished

Your friend is correct that in C major, the diatonic vii chord is actually B half-diminished (Bø7): B-D-F-A. This uses only notes from the C major scale.

The fully diminished B°7 (B-D-F-Ab) comes from C harmonic minor (C-D-Eb-F-G-Ab-B), where the raised 7th degree creates the leading tone.

The diminished scale (the symmetric whole-half or half-whole pattern) is specifically associated with fully diminished 7th chords, which arise from minor keys, not major.

Two Valid Approaches:

ApproachLogicAt C/Am wedge
Current (scale-degree)Show diminished scale on the 7th degree of the major keyB dim scale
Functional harmonyShow the diminished scale that resolves to this key (from its dominant function)Different association

Your friend’s specific argument:

The diminished scale functions as a dominant — it wants to resolve somewhere. The B°7 chord (and its associated scale) functions as part of G7b9, which resolves to C minor, not C major. So showing B diminished scale at the C/Am wedge conflates major and minor in a way that’s theoretically inconsistent.

My assessment:

Your current implementation is pedagogically sound and consistent — it follows a simple rule (7th degree = diminished) that’s easy to understand. Many circle of fifths tools take this approach.

However, your friend is right that a jazz theorist might expect different behavior, since the diminished scale’s natural harmonic context is minor keys with dominant function.

Options:

  1. Keep as-is — It’s consistent and educational for most users
  2. Change association — Base it on the relative minor’s vii°7 (would put G#° at C/Am since G# is vii of A minor)
  3. Add explanation — Note in the UI that this shows the “leading tone diminished” which has dominant function toward the minor key

Would you like me to explore option 2, or do you prefer keeping the current straightforward approach?

Resolution:

Based on Claude’s analysis, I’ve decided to add menu to the UI. When a diminished scale is selected the control will become active, allowing the user to switch between Scale Degree / Harmonic Function / Minor Tonic. This supports more advanced users while keeping the default workflow as simple as possible.

About the Interactive Circle of Fifths

If you’ve ever wondered why certain chords sound good together, why some key changes feel smooth while others feel jarring, or how musicians seem to instinctively know which notes will work over a chord progression, the Circle of Fifths holds the answers.

This post is about the Interactive Circle of Fifths. To provide feedback on the application simply reply to this post with your questions and comments. To see a video of the application in use, watch The Interactive Circle of Fifths on YouTube.

What Is the Circle of Fifths?

The Circle of Fifths is a visual map of all 12 major and minor keys, arranged to reveal the relationships between them. It’s been a cornerstone of Western music theory for nearly 300 years, and once you understand it, you’ll start seeing patterns everywhere in music.

The Circle of Fifths

The name comes from how the keys are arranged: as you move clockwise around the circle, each key is a “perfect fifth” higher than the one before it. A perfect fifth is the interval of seven half-steps—from C to G, from G to D, from D to A, and so on. This interval has a natural, consonant quality that our ears find pleasing, which is why it forms the backbone of so much music.

The Three Rings

The Outer Ring: Major Keys

The outer ring displays all 12 major keys. Starting from C major at the top (which has no sharps or flats), moving clockwise adds one sharp to the key signature with each step:

  • C major: no sharps or flats
  • G major: 1 sharp (F#)
  • D major: 2 sharps (F#, C#)
  • A major: 3 sharps (F#, C#, G#)
  • And so on…

Moving counter-clockwise from C adds flats instead:

  • F major: 1 flat (Bb)
  • Bb major: 2 flats (Bb, Eb)
  • Eb major: 3 flats (Bb, Eb, Ab)
  • And so on…

Click any major key on the outer ring to hear its major chord and see the notes highlighted on the piano or guitar below.

The Middle Ring: Relative Minor Keys

Every major key has a “relative minor”—a minor key that shares the exact same notes and key signature. The relative minor is always found three half-steps below the major key’s root note.

For example, A minor is the relative minor of C major. Both keys use only the white keys on a piano (no sharps or flats), but they have different tonal centers. C major sounds bright and resolved when you land on C; A minor sounds darker and resolved when you land on A.

This relationship is powerful for songwriters. You can shift between a major key and its relative minor within a song to change the emotional color while keeping all the notes familiar.

The Inner Ring: Scales and Chords

The inner ring changes based on the dropdown menu selection:

Diatonic shows the complete seven-note major or minor scale. These are the notes that “belong” to the key and form the basis for melody and harmony.

Pentatonic displays the five-note pentatonic scale. This scale omits the two notes that create the most tension in the full scale, making it incredibly versatile for melodies and improvisation. The major pentatonic works over major chords; the minor pentatonic is the foundation of blues and rock soloing.

Blues shows the six-note blues scale. This is pentatonic scale with the the addition of a flatted-fifth. This extra note gives a distinctive bluesy sound.

Diminished shows the diminished triad built on the seventh degree of the scale. This tense, unstable chord naturally wants to resolve and is useful for creating movement in chord progressions.

Click the inner ring when a scale is selected to hear the scale played ascending and descending.

Key Signatures on the Staff

The small staff notations around the circle show each key’s signature—the sharps or flats that apply throughout a piece in that key. Rather than writing a sharp or flat symbol every time you encounter that note, the key signature tells you once at the beginning: “Every F in this piece is F#” (in G major, for instance).

Notice the elegant pattern: sharps accumulate in a specific order (F#, C#, G#, D#, A#, E#) and flats in the reverse order (Bb, Eb, Ab, Db, Gb, Cb). Musicians memorize these patterns, making it quick to identify any key at a glance.

The Harmonized Scale

Below the circle, the harmonized scale staff shows something essential: the seven chords that naturally occur within a key.

When you build a triad on each note of a major scale using only notes from that scale, you get a specific pattern of chord qualities:

Scale DegreeRoman NumeralQualityExample in C
1stIMajorC
2ndiiMinorDm
3rdiiiMinorEm
4thIVMajorF
5thVMajorG
6thviMinorAm
7thvii°Diminished

This pattern—Major, minor, minor, Major, Major, minor, diminished—holds true for every major key. The Roman numerals (uppercase for major, lowercase for minor) let musicians discuss chord progressions in a universal language. When someone says “it’s a I-IV-V progression,” that means the same chord relationships whether you’re in C major (C-F-G) or E major (E-A-B).

Click any chord on the harmonized staff to hear it and see its notes on the instrument display.

For minor keys, the pattern shifts to: minor, diminished, Major, minor, minor, Major, Major (i, ii°, III, iv, v, VI, VII).

Why Adjacent Keys Matter

When you select a key, notice that the adjacent keys on either side are highlighted. These are your closest musical neighbors—the keys most closely related to your current key.

Moving one position clockwise adds one sharp (or removes one flat). Moving counter-clockwise adds one flat (or removes one sharp). These neighboring keys share six of their seven notes with your current key, differing by only one note.

This is why chord progressions often “borrow” chords from neighboring keys—they fit naturally because most of the notes are already part of your key. It’s also why modulating (changing keys) to an adjacent key sounds smooth, while jumping to a distant key on the circle sounds dramatic or surprising.

The Piano and Guitar Views

Toggle between piano and guitar displays to see how scales and chords map onto each instrument.

Color coding:

  • Blue indicates the root note (the note the chord or scale is built on)
  • Green marks chord tones—the 1st, 3rd, and 5th that form the triad
  • Orange shows additional scale tones when viewing pentatonic or diatonic scales

On guitar, you’ll notice that scale patterns repeat in predictable shapes due to the instrument’s tuning. On piano, the relationship between black and white keys makes certain keys feel quite different under your fingers, even though the theoretical relationships are identical.

Practical Applications

Finding chords that work together: Any chord from the harmonized scale will sound good with any other chord from that same key. The I, IV, and V chords form the backbone of countless songs. Adding the vi chord gives you the ubiquitous I-V-vi-IV progression heard in hundreds of pop songs.

Transposing songs: If a song is too high or low for your voice, use the circle to find a new key. The relationships between chords stay the same—a IV chord is still a IV chord—only the letter names change.

Understanding songs you hear: When you recognize that a song moves from I to vi to IV to V, you’re hearing the underlying structure that makes it work. This pattern recognition accelerates learning new music.

Improvising and soloing: The pentatonic scale is your safety net. Staying within those five notes virtually guarantees you won’t hit a “wrong” note over chords in that key.

Writing chord progressions: Experiment with the harmonized scale. Try moving between chords that share notes (like C and Am, which share C and E). Try the tension of the vii° chord resolving to I. Try borrowing the IV chord from the parallel minor for a melancholy color.

A Tool for Exploration

The Circle of Fifths isn’t just a reference chart—it’s a tool for discovery. Click around, listen to how keys relate to each other, notice which chords share notes, hear how scales differ between major and minor.

Music theory sometimes gets a reputation for being dry or restrictive, but it’s really a map of possibilities. The Circle of Fifths shows you why certain things work, freeing you to make informed creative choices rather than stumbling in the dark.

Whether you’re writing songs, learning an instrument, studying for a theory exam, or just curious about how music fits together, spend some time with the circle. The patterns you discover will start appearing everywhere you listen.

Release Notes

11/15/2025 – Initial build.
11/18/2025 – Initial release v1.0.0 Supporting all features in the prompt.
11/25/2025 – Feature release v1.1.0 Added samples & new diatonic scale mode, updated pentatonic mode.
11/27/2025 – Feature release v1.2.0 Added 15 frets to guitar fretboard, stacked triads for chords.
11/28/2025 – Feature release v1.3.0 Added harmonized scale section and new guitar samples.
11/30/2025 – Feature release v1.4.0 Added complete set of .mp3 guitar samples for all 48 notes E1 to D6.
12/15/2025 – Feature release v1.5.0 Added Diminished Scale to Inner Ring menu.
12/18/2025 – Feature release v1.6.0 Added legend for dots on piano and guitar, swapped green & blue colors.
12/21/2025 – Feature release v1.7.0 Added support for Blues Scale in Inner Ring.
12/31/2025 – Feature release v1.8.0 Added Chord Sequencer panel sharing space with the Harmonized Scale control.
01/04/2026 – Feature release v1.9.0 Added Function selector for Diminished Chord and Scale modes with Scale Degree and Resolves To options.

The Interactive Circle of Fifths

This week I decided to challenge Claude Opus 4.1 to a coding challenge. The objective? Create a React application that renders an interactive circle of 5ths. I started by creating a scaffolding application using the Visual Studio React / .Net Core template. I then created an empty Typescript component called “InteractiveCircleOfFifths” and proceeded to throw down with the LLM.

The prompt:

Create a React component using Typescript that renders a musical circle of 5ths onto a canvas. The circle will consist of 3 bands, Major, Minor and Notes. Include the Major keys on the outermost band of the circle, Minor keys on the inside of the Major band, and Notes in the middle. Each key should render it’s slice of the pie in a different pastel color, and each band of that pie a slight variation of the key color. Do not use yellow or bright green.

The Major key labels will be the key itself, like F C G, etc. The minor key labels will also be the key name but in a smaller font since they’re inside the circle. The bottom label will be Gb/F# since it’s enharmonic. The key label will be the most pronounced text in each ring. Add text to the key, closer to the center of the ring indicating the three notes in the chord, for example CEG in the case of C Major. The Notes will be a label consisting the best notes to play in that key, which corresponds to the clockwise adjacent 5 keys. In the case of C that would be C G D A E. The label should render vertically in a small font since it’s the innermost band. Rotate the label so it’s always descending towards the center of the circle.

Outside of the circle add a ring of images showing a short treble musical staff for each key showing the sharps or flats. The sharps or flats will render on the staff lines left to right, and be placed on the staff lines according to their pitch. Add a label to each image indicating the count of sharps or flats. The bottom three key signatures will have two images showing sharps and flats respectively. Always make the staff background white, with the lines and sharps or flats in black. Make the circle interactive. Clicking anywhere on the circle highlights that key, the one before it and the one after it. Draw an outline around them and make them stand out by adding a low-opacity color to the overlay. Clicking a Major or Minor chord also causes the corresponding chord to be played, either using a .wav or .midi file. Make sure to use the proper inversions, so that C Major is CEG, A minor is ACE. Clicking on the Notes band will play the notes in ascending sequence.

At the bottom of the page render a fixed panel that displays either a piano keyboard or a guitar fretboard, best on a menu selection. For the piano, selecting a key on the circle will add an overlay of dots on the piano keyboard corresponding to the notes in the chord, either major or minor. Use a green dot for the root (C for C Major, A for A minor), and blue dots for the other two notes. The same note may appear more than once. Selecting a Notes band will render all of the notes on the keyboard, using green for the root, blue for chord tones, and soft orange for the other notes. Selecting a key on the piano keyboard will cause the corresponding key signature to become active in the circle.

For the guitar add blue dots corresponding to the notes of the chord. For the guitar, selecting a key on the circle will add an overlay of dots on the guitar fretboard. Use a green dot for the root, paying attention to the inversion between major and minor, and blue dots for the other two notes. Selecting a Notes band will render all of the notes on the fretboard, using green for the root, blue for chord tones, and soft orange for the other notes. The notes may appear multiple times on the guitar fretboard. Selecting a note on the guitar will cause the corresponding key signature to become active in the circle. When rendering the guitar fretboard, add a vertical line for the nut, effectively fret zero. Open strings should render left of the nut. Fretted strings should render to the right. Make sure the notes rendered on the fretboard are in the correct position. Previous attempts to render this resulted in the notes being off by one fret because of the ambiguity of the first fret.

The response:

Perfect! I’ve created a comprehensive interactive Circle of Fifths application with all the features you requested:

Key Features Implemented:

Visual Design:

  • Three concentric bands with pastel colors (avoiding yellow and bright green)
  • Musical staff notation around the circle showing sharps/flats on proper staff positions
  • Proper chord inversions: C Major (CEG), A minor (ACE) as specified
  • Enharmonic notation: Bottom key shows as Gb/F# with both staff versions

Interactive Features:

  • Band-specific clicking:
    • Major band plays major chord
    • Minor band plays minor chord
    • Notes band plays the 5 notes in ascending sequence
  • Highlighted selection: Shows selected key plus adjacent keys with overlay
  • Audio playback: Uses Tone.js for chord and sequence playback

Piano View:

  • Color-coded dots:
    • Green for root note
    • Blue for chord tones
    • Soft orange for other notes (in Notes band)
  • Full keyboard with proper note placement
  • Click detection for key selection

Guitar View (with proper fret positioning):

  • Accurate nut placement: Vertical line separating open strings from fretted notes
  • Open strings render to the left of the nut
  • Fretted notes render between the appropriate frets (not off by one)
  • Fret markers at positions 3, 5, 7, 9, and 12 (double dots)
  • Color-coded dots matching the piano scheme
  • All chord notes shown on the fretboard with labels

The application correctly handles musical theory including proper inversions, enharmonic equivalents, and accurate fretboard positioning. The guitar implementation specifically addresses your concern about notes being off by one fret by clearly separating the nut position and properly placing notes either as open strings or between the correct frets.

My reaction:

I was amazed at the thoroughness of the response, and impressed with the component source code. It rendered exactly as prompted, with some minor flaws. The Claude web app actually renders the component right alongside your chat, so you can inspect it with a web debugger and make changes to styles interactively.

The component:

The debugging:

After generating the initial component there was still a lot of debugging to do. I used Claude for that as well, synchronizing my Visual Studio copy with every iteration, synchronizing with my Github repo after every change. Here’s a sample debugging prompt:

Update the Typescript component.

1) Change the rendering of the Note band so the labels never go past 90 degrees, they are currently rendering upside-down for the bottom half of the circle. 2) Adjust the rendering positions for the sharp notes on the staff, the y-axis position is a few pixels too high. 3) Fix the rendering of the notes on the piano so every note on the keyboard that’s in the Major Chord, Minor Chord, or Notes array gets a dot. Currently some key signatures are missing higher notes. 4) Fix the colors on the dots. Currently orange dots are being rendered when the note is in both a chord and the notes array.

Claude’s response:

I’ll update the TypeScript component to fix these issues.

Fixing the Notes band rotation to prevent upside-down text

Adjusting the Y-axis position for sharp/flat notes on the staff

Now I need to fix the piano rendering to show all octaves of the notes and fix the color logic:

Fixing piano rendering to show all instances of notes and correcting dot color logic

Now let’s also fix the same color logic for the guitar to be consistent:

Fixing guitar dot color logic to match piano

Checking the drawPiano section to ensure we have the complete updated code

Let me check the complete Notes band section to verify the fix:

Checking the updated Notes band rotation logic

Perfect! I’ve updated the TypeScript component and created a new component file with all four fixes:

Fixes Applied:

1. Notes Band Rotation Fix

  • Text in the Notes band now stays readable throughout the entire circle
  • For the bottom half (90-270 degrees), text is flipped and notes order is reversed
  • No more upside-down text!

2. Staff Position Adjustment

  • Added 2 pixels to the Y-axis position of sharps/flats on the staff
  • They now align properly with the staff lines

3. Piano Dots Rendering Fix

  • All instances of notes across the keyboard now get dots
  • The code now properly iterates through ALL white and black keys
  • Handles enharmonic equivalents (F#/Gb, C#/Db, etc.)
  • Every occurrence of a note in the chord or notes array gets marked

4. Color Logic Correction

  • Fixed priority system: Root (green) > Chord tones (blue) > Other notes (orange)
  • Orange dots only appear for non-chord notes when in “notes” mode
  • When in major/minor mode, only green (root) and blue (chord tones) appear
  • Consistent color logic between piano and guitar views

The application now properly displays:

Orange dots: Only appears in “notes” mode for notes that aren’t part of the chord

Green dots: Root note only

Blue dots: Chord tones (non-root notes in the chord)

The deployment:

After an entire day of debugging the application was ready to roll out. I initially tried to deploy to Azure, and met with a litany of deployment failures. However, using the job logs provided by Azure, over the course of several hours I was able to debug those failures with the help of Claude and get the application deployed on the Azure websites web app hosting environment. It was even deploying from my Github repo. I was using the Free tier, so when it came time to use my newly registered domain name I was faced with a stiff upgrade fee of $60/month for the Basic tier. That was too rich for my blood, so I found an alternative hosting provider. For $10 a month Winhost provides .Net application hosting. After downloading their Publish template I was able to publish directly from Visual Studio without any hassle.

You can view the deployed application at InteractiveCircleOfFifths.com.

My feedback to Anthropic:
This is incredible. My first shot at the component was 90% complete and accurate. A few hours of revisions and debugging were enough to bring this entire project over the finish line. The refinements were really impressive. Claude demonstrates contextual awareness on a massive scale and shows you the chain of reasoning behind its changes.

Note that a few times Claude generated component code that wouldn’t render. I eventually had to paste back from Visual Studio into Claude to revert to a prior version of the component. Also, Claude attempted to provide a full Html solution that was unnecessary and failed to render every time due to dependency errors with Tailwind.

The only deal killer in the entire process was cost. I had to bump up to the $100/month plan and even then quickly exhausted my quota. This seems steep for 1200 lines of code. I’ll have to downgrade after the month is expired.

Your feedback to me:

This entire process took only three days from inception to deployment. Early feedback was generally positive, but people don’t care for the quality of the tone generator sounds. I will correct that in a future release by converting from the tone generator to using samples. This will take some time to code and record all the samples for both guitar and piano. I have also had a request to add a ukulele visualizer. Please let me know what else you’d like to see in the application by commenting right here on the blog.

Thank you for using the Interactive Circle of Fifths. I look forward to your feedback!