Building a statistical discovery engine with DuckDB, Parquet, Cloudflare R2, PostgreSQL, and AI-assisted causation analysis.
Correlation Studio is a vertically integrated correlation-analysis SaaS built to mine relationships across datasets, surface statistically meaningful findings as persistent knowledge objects, and make advanced data science usable without code.
Most analytics software answers questions users already know to ask. Correlation Studio was built for a different problem: discovering relationships users did not know were there.
The premise is simple but technically demanding. A user uploads or discovers tabular datasets, pairs them into experiments, and the system exhaustively compares numeric columns against one another. When a relationship clears the configured threshold, Correlation Studio promotes it into a first-class object called a Discovery.
A Discovery is not a temporary chart. It is a persistent, publishable, analyzable artifact containing statistical metadata, visualizations, regression overlays, confidence intervals, Granger-causality testing, source-row drilldowns, and AI-generated interpretation.
A one-person team armed with the right primitives for discovery mining and causation analysis can deliver useful analytical insight faster than a larger team using less integrated tools.
That belief shaped every architectural decision. Correlation Studio favors simplicity over optionality: one database, one app server, one object storage account, one embedded analytical engine, and one consistent set of product abstractions.
Whenever the system had to choose between a flexible-but-complex pattern and a rigid-but-simple one, the simpler design won. Per-user database shards were collapsed into a single PostgreSQL database. Bulk analytical rows were moved out of PostgreSQL and into Parquet. In-row binary payloads were replaced with per-discovery columnar files. Multiple state flags were consolidated into simpler state machines.
The result is a product that is technically ambitious but operationally small enough to run as an independent SaaS.
Correlation Studio was designed around a severe constraint: it needed to be sustainable before it became large.
The free tier is genuinely free. New users receive 10 GB of storage and can browse the public corpus without a recurring charge. Revenue comes from token packs that fund ingestion, correlation work, AI search, AI analysis, and chatbot usage.
At roughly $5 per month of token spend per actively engaged user, about 60 engaged users cover the platform’s operating run rate: approximately $300 per month for two VPS instances, Cloudflare, R2 object storage, and the domain.
At larger scale, the infrastructure cost profile stays intentionally flat. Object storage is metered by byte with zero egress through Cloudflare R2. Compute grows in fixed steps. AI API spend is the primary engagement-scaled cost, and the token margin exists specifically to cover it.
The system is not only optimized for query speed. It is optimized so the business can remain alive while the corpus, audience, and go-to-market motion are still forming.
Correlation Studio was built solo over roughly five months. Commit zero landed on December 29, 2025. Around 570 commits later, the project had become a production SaaS.
It did not get there in a straight line.
The original v1 architecture used a central PostgreSQL database plus per-user content shards. Parsed source data lived in relational tables named Rows, KeyIndex, and ValueIndex.
That design worked until realistic ingestion traffic arrived: stress tests designed to break it.
Wide datasets generated enormous write amplification. A CSV with thousands of columns and millions of rows could produce hundreds of millions of B-tree inserts. On SATA-backed shards, that turned into IOPS queue starvation. Under realistic ingestion load, the system collapsed at about 10 concurrent active jobs.
The fix was not a tuning pass.
It was a clean-room replacement of the analytical substrate.
The v2 Lakehouse migration, completed on May 17, 2026, deleted all content sharding, retired the bulk-data PostgreSQL tables, moved source data into columnar Parquet files on Cloudflare R2, and replaced row-range index scans with DuckDB queries against Parquet.
Roughly 5,000 lines of C# disappeared in the cutover. More importantly, the frontend never noticed. Dataset, experiment, and discovery API contracts survived the migration intact.
v1: PostgreSQL as data lakeUpload ↓Rows / KeyIndex / ValueIndex ↓B-tree write amplification ↓IOPS starvation under ingestion load
The original architecture failed because row-oriented storage was doing analytical storage’s job.
v2: Lakehouse substrateUpload ↓Parquet on Cloudflare R2 ↓Local SSD hot cache ↓DuckDB analytical queries ↓PostgreSQL metadata only
The v2 architecture separates metadata, bulk storage, and analytical execution.
Correlation Studio’s current architecture is deliberately compact. It uses a single PostgreSQL database for metadata, Cloudflare R2 for immutable bulk data, DuckDB for analytical execution, and a .NET application server to coordinate ingestion, statistics, AI workflows, billing, and the public web app.
Stores users, billing, datasets, experiments, discoveries, jobs, audit records, and RAG vectors through pgvector.
Stores bulk row data as columnar Parquet files: one Parquet file per dataset and one per discovery datapoint payload.
Runs embedded inside the .NET app server and queries Parquet through a local SSD hot cache or R2’s S3-compatible interface.
Provides the ASP.NET Core backend in a three-project solution: Core, Infrastructure, and Api.
Powers the SPA with Vite, canvas-rendered charts, route-level code splitting, React Query, and Zustand.
Claude, Gemini, and Grok run in parallel for dataset search, AI analysis, and Corrie, the in-app RAG chatbot.
The architecture is not cloud-native in the fashionable sense. It is not a swarm of microservices. It is not built around Kubernetes. It is a vertically integrated system with a small number of moving parts, each chosen because it eliminates an entire class of complexity.
Correlation Studio turns datasets into reusable statistical knowledge. The user workflow is intentionally linear: create data, combine data, discover relationships, explain relationships, and publish the results.
Most analytics tools treat correlations as temporary calculations. A user runs a query, sees a chart, and the result disappears unless someone manually saves it.
Correlation Studio uses a different abstraction. It treats each meaningful relationship as an entity.
Dataset ↓Experiment ↓Discovery ↓Portfolio
The core entity chain turns raw data into reusable, publishable knowledge.
That model matters because it allows discoveries to accumulate. A Discovery can be visualized, analyzed, commented on, published, embedded, rated, searched, indexed for retrieval, and assembled into a portfolio.
The result is not just an analytics app. It is a corpus of statistical relationships.
Correlation Studio prices expensive operations through tokens rather than forcing all users into recurring subscriptions.
| Operation | Token Pricing |
|---|---|
| Dataset ingestion | 25 tokens + 1 token per MB |
| Correlation experiments | 25 tokens + 1 token per MB of joined data |
| Tool transforms | Same rate as ingestion |
| AI search | 125–250 tokens per call, depending on provider |
| AI analysis | 125 tokens per call |
| AI chatbot | 25 tokens per call |
Token packs range from $5 to $200 and do not expire. Storage subscriptions are optional and exist only to raise the default 10 GB storage cap.
This keeps the free tier useful while ensuring the costly parts of the system—large ingestion jobs and AI calls—pay for themselves.
System Architecture
Correlation Studio is intentionally built as a small, vertically integrated system. The architecture avoids distributed complexity wherever possible, but it is not simplistic. Each layer has a narrow responsibility.
PostgreSQL owns transactional truth. R2 owns immutable bulk storage. DuckDB owns analytical execution. The .NET app server orchestrates the workflows that turn user input into datasets, experiments, discoveries, AI analysis, thumbnails, billing records, and searchable public content.
Keep the number of operational primitives small, but make each primitive do the job it is naturally good at.
The backend is split into three .NET projects with a React/Vite frontend beside it. The split is conventional by design: domain contracts at the center, infrastructure around them, and API orchestration at the edge.
CorrelationStudio.Core/ Entities, DTOs, Enums, Interfaces, Utilities, Exceptions No dependencies. Pure POCOs and contracts.CorrelationStudio.Infrastructure/ EF Core data access, ingestion services, R2 / Parquet / DuckDB plumbing, RAG subsystem. Depends on Core.CorrelationStudio.Api/ Controllers, business services, middleware, background hosted services, logging. Depends on Core + Infrastructure.correlationstudio.client/ React 18 + TypeScript SPA. Vite dev server proxies to the API and produces a static SPA bundle for production.
This layout keeps the domain model independent of persistence and hosting concerns. It also keeps the solo-developer ergonomics sane: most feature work touches the same predictable layers rather than scattering through a service mesh or multiple deployable applications.
The production topology consists of two VPS instances and one Cloudflare R2 account. One server runs PostgreSQL. One server runs the application, static frontend, ingestion workers, DuckDB queries, AI orchestration, thumbnail rendering, Stripe webhooks, and background services.
| Role | Plan | Cores | RAM | Disk | Transfer |
|---|---|---|---|---|---|
| Central DB | SSD VPS | 16 | 32 GB | 1.2 TB SSD | 10 TB |
| App server | SSD VPS | 16 | 64 GB | 1.2 TB SSD | 10 TB |
| Cloudflare R2 | Object storage | — | — | Metered | $0 egress |
The split is intentionally boring. The database server holds PostgreSQL only. The app server handles product logic and analytical computation. R2 stores bulk row data and media assets: dataset Parquet files, per-discovery datapoint Parquets, profile images, thumbnails, and portfolio attachments.
Browser ↓React + Vite SPA ↓ASP.NET Core API ├── PostgreSQL 17 metadata, billing, jobs, RAG vectors, audit ├── DuckDB embedded analytical execution ├── SSD hot cache local Parquet working set └── Cloudflare R2 durable Parquet + media storage
The production topology keeps transactional metadata, analytical compute, and durable bulk storage separate without introducing distributed orchestration.
The app server’s local SSD disk also acts as an LRU hot tier for Parquet. Frequently accessed dataset Parquets and per-discovery datapoint Parquets live locally for sub-millisecond DuckDB reads. Cold datasets remain in R2 and are fetched on demand.
The cache is bounded to 80% of disk, and its hit-rate degradation is surfaced in the Admin system metrics. That matters because the cache is not just a performance optimization; it is what allows the system to feel interactive while retaining cheap, durable object storage as the source of truth.
The central architectural lesson from v1 was that PostgreSQL should not be asked to act as an analytical file format.
The v1 design materialized every parsed row into relational tables. For narrow, modest datasets, that was acceptable. For wide datasets, it was fatal. A CSV with more than a thousand columns and millions of rows produced hundreds of millions of B-tree inserts across row and index tables. Under concurrent ingestion, write amplification consumed the shard’s disk I/O budget and collapsed throughput.
DuckDB-on-Parquet inverted the workload.
Ingestion now writes one Parquet file per dataset. Analytical queries then issue DuckDB SQL directly against that Parquet. Column statistics, experiment correlations, drilldown lookups, thumbnails, and chart payloads all operate against columnar files rather than row-oriented database tables.
Scanning one numeric column reads bytes proportional to that column, not the whole row.
Dictionary and RLE encoding compress low-cardinality columns aggressively, often by 10–100×.
Row groups allow DuckDB to skip entire chunks that do not satisfy a filter.
The app retains SQL expressiveness without implementing a custom analytical engine.
The DuckDB pool is sized to approximately 16 connections. Every read path that needs analytical data checks out a connection: drilldown, thumbnails, sparklines, chart loading, experiment ingestion, and RAG corpus building.
Under load, connection pressure is observable. Checkout wait time appears in per-request logs as duckdbPoolWaitEma, and ingestion uses that signal to back off through a square-root yield curve. Instead of blindly flooding the local analytical engine, the ingest loop lets pool pressure slow it down.
The v2 migration did not merely make the old design faster. It moved the analytical workload onto the substrate it wanted all along: columnar files queried by an embedded analytical database.
PostgreSQL access uses Npgsql with Entity Framework Core and a custom resilient retry strategy. Every DbContext resolved through dependency injection carries the same transient-error policy.
Each process also stamps an application name onto its PostgreSQL connections. On restart, ServerLifecycleService can identify and terminate orphaned backends left behind by crashed prior instances. That small operational detail keeps restarts from accumulating stale database connections.
Slow-query observability is handled by a singleton command interceptor. Any database command slower than the configured threshold, defaulting to 1000 ms, is logged with a snapshot of in-flight commands at the time of slowness.
Those logs surface live in the Admin Runtime tab. For a solo-operated system, that visibility is essential. The goal is not perfect distributed tracing. The goal is immediate operational attribution when a request, background job, or query path starts to misbehave.
Correlation Studio accepts large datasets. That requirement shows up in places that are easy to overlook.
Kestrel’s maximum request body size is raised to the configured maximum file size, which defaults to 100 GB. Keep-alive timeout is set to four hours so legacy synchronous remote-download requests are not reaped while the server is still fetching a large file. Request header timeout is extended to two minutes.
Development needed the same treatment. The Vite dev proxy forwards API calls with four-hour timeouts, and a custom large-upload plugin disables Node’s default request, header, socket, and keep-alive timeouts on the browser-to-Vite side.
Supporting multi-gigabyte user files is not only a storage problem. It affects request limits, proxy defaults, development tooling, cancellation semantics, and background-job contracts.
Correlation Studio’s schema is broad because the product is not only an analytical engine. It is also a publishing system, a social graph, a billing system, a job queue, an audit trail, and a RAG corpus.
The full PostgreSQL DDL is roughly 3,200 lines and is organized into predictable domains:
The schema reflects the product strategy: every entity that can become part of the public knowledge base needs metadata, ownership, visibility, social feedback, searchable text, and deletion semantics.
Public-facing entities use short, opaque identifiers instead of exposing integer primary keys.
Users, datasets, experiments, discoveries, portfolios, topics, references, jobs, and invoices carry a PublicId: a unique, indexed, 10-character base62 string. API routes and frontend URLs use the public ID. Internal foreign-key relationships remain ordinary integers.
This provides two benefits at once. Public URLs are compact and shareable, and the system does not leak growth rate through sequential IDs. Internally, joins remain efficient and simple.
Public route: /discoveries/a7Xk92LmQpInternal tables: Discoveries.DiscoveryId = 18427 DiscoveryDatasets.DiscoveryId = 18427 Datapoints.DiscoveryId = 18427
Public IDs protect external surfaces without complicating internal relational joins.
Dataset and experiment states are represented as lookup-backed enum values. Some v1 states remain defined for compatibility but are no longer used by the v2 ingestion flow. For example, v2 has no separate indexing phase because indexing bulk rows into PostgreSQL no longer exists.
DatasetStateType: Pending=1 Queued=2 Reading=3 Read=4 Waiting=5 Indexing=6 Ready=7 Failed=8 Cleaning=9 Downloading=10ExperimentStateType: Pending=1 Queued=2 Running=3 Ready=4 Failed=5 Cleaning=6
Repository access is centralized through IUnitOfWork, which exposes one repository property per entity type. The pre-v2 split between central and per-shard units of work was removed when sharding disappeared.
The unit of work and its CorrelationStudioDbContext are scoped per request. Background services that need scoped dependencies create fresh service scopes through IServiceScopeFactory. That pattern is used by deletion sweeps, cleanup runners, RAG workers, ingestion dispatchers, and background download tasks.
This is not glamorous architecture, but it is the kind of boring correctness that keeps long-running jobs from holding onto request-lifetime dependencies after the request is gone.
Storage Layer
The storage layer is where the v2 architecture earns most of its leverage. PostgreSQL stores metadata and transactional state. Cloudflare R2 stores durable bulk data. Local SSD provides a hot working set. DuckDB bridges the two by querying Parquet files as analytical tables.
That separation lets Correlation Studio keep the operational model small while avoiding the failure mode that broke v1: using a row-oriented transactional database as a high-volume analytical substrate.
Correlation Studio stores bulk data in Cloudflare R2. The abstraction is IR2Storage, backed by an AmazonS3Client configured against R2’s S3-compatible endpoint.
R2 needs a few specific settings:
ForcePathStyle = true, because R2 does not use virtual-host style addressing.AuthenticationRegion = "auto".UseChunkEncoding = false on PUT requests.DisablePayloadSigning = true for uploads, while SigV4 still signs headers and HTTPS encrypts the body.Upload and deletion concurrency are controlled separately with two SemaphoreSlim gates. That prevents a deletion sweep from competing with active ingestion uploads for the same R2 per-prefix throttle budget.
The workload wants cheap durable storage, high read availability, and zero-egress economics more than it wants a full managed data warehouse. R2 fits that shape well.
Large R2 uploads exposed a real production edge case.
Dataset Parquets can be gigabytes in size. Uploads are single PUT requests and may run for several minutes. Under burst load on Windows, pooled HTTPS connections could occasionally fall into a bad SChannel state. The symptom was an IOException saying the decryption operation failed, wrapping Win32Exception 0x8009030F SEC_E_MESSAGE_ALTERED from SslStream.ReadAsyncInternal.
The problem was not that one upload failed. The problem was that the pooled connection could remain poisoned. Retrying through the same client risked receiving the same broken socket again.
The upload path now runs three attempts beyond the SDK’s own retry budget. If the exception chain contains a Win32Exception, the system atomically swaps the AmazonS3Client instance using Interlocked.Exchange. The replacement client carries a fresh HttpClient, handler, connection pool, TLS handshake, and socket.
The displaced client is disposed after a 30-second delay so any concurrent upload that already captured it can finish cleanly.
A 1.39 GB Array of Things Chicago dataset hit the same SChannel failure on every retry because the pool kept returning the poisoned socket. The client-rebuild path exists because that happened in production.
Dataset ingestion writes Parquet through IParquetWriter.WriteAsync, implemented with Parquet.Net.
The writer builds a schema from ParquetColumnDefinition records: one Parquet field per inferred dataset column. Type mapping is intentionally conservative.
| Correlation Studio Type | Parquet Type |
|---|---|
Integer | Int64 |
Decimal | Double |
Date | String using ISO 8601 canonical form |
Time | String using ISO 8601 canonical form |
DateTime | String using ISO 8601 canonical form |
String | String |
Row groups are buffered in per-column staging arrays and flushed at a default 50,000 rows per group. That size gives DuckDB useful predicate pushdown granularity without creating an oversized footer index.
Typical user datasets in the 100,000 to 5 million row range become 2 to 100 row groups.
Values are coerced at write time through typed conversion helpers. Non-nullable columns throw on null. Nullable columns use nullable staging arrays. The writer is always disposed before returning so the Parquet footer is flushed, and output size is measured from stream position.
R2 is the durable store, but local SSD is the interactive working set.
IParquetCache implements an LRU disk cache for dataset Parquets and per-discovery datapoint Parquets. Its public surface is small:
EnsureLocalAsync(r2Key, ct) returns a local file path, downloading from R2 on cache miss.SeedAsync(r2Key, sourceFile, ct) moves a freshly written ingest temp file into the cache so the first query is already local.Pin(r2Key) and Unpin(r2Key) reference-count files during active scans, ingests, drilldowns, and thumbnail renders.Pinned files are never evicted. Eviction runs in the background and uses last access time. The cache is bounded by configuration, defaulting to 80% of disk free space at startup.
This turns the app server’s SSD disk into a practical lakehouse hot tier. Cold data stays cheap in R2. Hot data behaves like local analytical storage.
DuckDB sessions are not thread-safe, so Correlation Studio gates analytical work through a fixed-size connection pool. The default size is 16.
The IDuckDbQueryService surface is deliberately small:
ParquetTable(localPath) returns a safe read_parquet('...') table-valued expression.QueryAsync<T>(sql, projector, ct) runs a query and materializes rows through a projector.StreamAsync<T>(sql, projector, ct) yields rows as an IAsyncEnumerable<T> without materializing the full result.RecentCheckoutWaitMs exposes an exponential moving average for pool checkout wait time.That last metric is especially useful. When a request emits a slow-load warning, the logs include pool wait time so the operator can distinguish slow SQL from pool contention.
The DuckDB AWS extension is loaded at startup so the engine can read from s3:// URIs directly, but in practice almost all hot paths go through the local cache.
Images live in the same R2 bucket as Parquet data, under an images/ prefix. That includes profile avatars, banners, thumbnails, discovery scatterplot PNGs, portfolio thumbnails, post attachments, message images, and forum images.
User-facing image URLs exist in two forms.
In production, image URLs use:
https://images.correlationstudio.com/images/<key>
Bytes flow from R2 to Cloudflare’s edge cache to the browser. The app server never touches them. Egress through Cloudflare’s CDN is free, and the custom domain is attached at the R2 bucket level.
The legacy and development fallback path is:
/api/storage/images/<key>
In that mode, bytes flow from R2 through the ASP.NET storage controller to the browser. The controller consults image safety verdicts, serves bytes through a bounded memory cache, and applies a hard five-second timeout to R2 reads so a bad object-store hop returns a fast 504 instead of riding the AWS SDK retry budget.
The proxy path remains alive for legacy URLs embedded in social previews, RSS feeds, email screenshots, bookmarks, and external caches.
Both URL shapes are accepted during the transition. A storage URL migration rewrote persisted proxy URLs to the custom-domain form across user images, discovery thumbnails, portfolio thumbnails, portfolio block images, and image storage paths.
Serving public images through Cloudflare lets the product keep app-server bandwidth focused on application work while still retaining a safe fallback for moderation and legacy URLs.
Dataset ingestion is the pipeline that turns messy user input into typed, queryable, columnar data.
It is also where many analytics products fail. Real-world datasets are rarely clean. They contain preambles, footers, multi-row headers, sparse section labels, partial dates, currency-formatted numbers, fake headers, long sentence-like labels, and server-side download failures.
Correlation Studio’s ingestion path is designed around that reality.
DatasetIngestionService.ProcessQueuedIngestionAsync orchestrates four phases:
IRowParser, stream rows from CSV, Excel, or HTML sources, buffer them through IParquetWriter, and emit a temp Parquet file.datasets/{userPublicId}/{datasetPublicId}/data.parquet. On success, move the temp file into the local hot cache so the first DuckDB query is local.Ready and debit tokens in the same EF Core transaction.The v2 state machine is simple:
Pending → Queued → Reading → Ready / Failed
The older v1 states such as Read, Waiting, and Indexing remain defined but are not set by the v2 pipeline. There is no separate indexing phase because the row-index tables no longer exist.
Per-dataset mutual exclusion is enforced with an in-memory ConcurrentDictionary<int, Task>. Only one ingest for a dataset ID may run at a time. A losing dispatcher logs the conflict, closes the orphan job, and exits. If the process crashes, the dictionary disappears and the next dispatcher tick handles recovery.
CSV ingestion uses a raw StreamReader and a custom line parser rather than CsvHelper. That choice was made for speed on wide files.
The harder problem is not splitting lines. It is finding where the table actually starts.
Many public CSV files begin with metadata, comments, blank rows, source descriptions, license text, or sparse heading blocks. CsvPreambleDetector.DetectAsync buffers the first roughly 50 lines and runs a two-pass heuristic.
This detector has been revised repeatedly against real dataset failures. Each iteration represents one ugly public CSV that broke the previous heuristic.
Ingestion quality is not determined by how well the parser handles perfect CSV. It is determined by how gracefully it handles the thousand strange almost-tables people upload from the real world.
Excel Parsing
Excel files have their own failure modes. A workbook can contain multiple sheets, decorative title rows, sparse metadata, merged-looking header regions, and footers that look enough like data to confuse a naïve parser.
Correlation Studio handles Excel ingestion with ExcelDataReader in two passes.
ExcelPreambleDetector.FindDataStart scans sheets and chooses the one with the widest tabular region.ExcelPreambleDetector.SeekTo reopens the workbook at the selected sheet and header row, then streams rows forward.Empty rows are dropped. Parsing stops when the detector sees a blank-followed-by-sparse data pattern that indicates a footer rather than more table content.
The Excel preamble detector runs both exact and fuzzy passes. The exact pass looks for rigid column-count blocks. The fuzzy pass allows ±1 column of slop, which helps with exports that carry inconsistent trailing empty cells. If both passes find a predecessor, the stronger match wins. Otherwise the denser region wins, with ties broken toward the earliest plausible row.
A surprising number of real-world CSV and Excel exports have two or three header rows. Some use year / quarter / metric stacks. Some repeat section names. Some carry patterns such as Total | Total | 2024.
Treating only one row as the header loses useful labels. Concatenating every row can create unreadable column names. Correlation Studio uses a more selective merge.
Male or Female, not as a header row.In many public datasets, the bottom header row contains the actual variable name, while rows above it contain broad grouping labels. Replacing rather than concatenating produces shorter, more useful column names.
Once rows and headers are identified, the platform has to infer column types. The classifier works cell by cell first, then refines the result after sampling the column.
ParseValueHelper.ParseValue follows a fixed order of operations.
The numeric cleaner strips currency symbols, percent signs, commas, and accounting-style parentheses for negatives.
($1,234.56) → -1234.56
If double.TryParse succeeds and the result is finite, the value is classified as either Integer or Decimal.
Temporal parsing runs on the uncleaned value. If DateTime.TryParse succeeds, the result is classified as Time, DateTime, or Date depending on the shape of the parsed value.
Time.DateTime.Date.Public data often contains partial dates such as 1992.12, 2020-3, 1999/06, or simply 2023. The partial date parser recognizes these shapes and canonicalizes them as dates.
Anything that does not parse as numeric or temporal becomes String.
After the sampled-row type census, Correlation Studio runs additional passes to catch cases the cell-level parser cannot safely resolve alone.
1871.10 in date-labeled columns.Col1, Col2, and so on.After a dataset is written to Parquet, Correlation Studio immediately computes column statistics with DuckDB and stores the results in PostgreSQL.
For numeric columns, the core query computes:
Additional fan-out queries compute 20-bin histograms and top outliers by distance from the mean. For non-numeric columns, the system stores counts and top categorical values.
Results are upserted into ColumnStats, one row per column.
This replaced an expensive v1 behavior where opening a dataset’s Distributions or Quality tab could trigger 30–90 seconds of recomputation on a 10-million-row dataset. In v2, those tabs render from precomputed rows in under 100 ms.
Column statistics are not a rare analytical side path. Users repeatedly inspect quality, distributions, and summaries. Precomputing them turns a recurring interaction from a scan into a metadata read.
Token accounting has to be correct under concurrency. A billing system that is usually right is not good enough.
Ingestion debits use an atomic PostgreSQL UPDATE … RETURNING statement against Users.TokenCount.
UPDATE "Users"SET "TokenCount" = GREATEST(0, "TokenCount" - @amount), "ModifiedDate" = NOW()WHERE "UserId" = @userIdRETURNING "TokenCount"
PostgreSQL takes a row lock for the duration of the transaction. Concurrent debits queue behind the lock and each receives a consistent post-debit balance via RETURNING.
The ledger row uses that returned balance. The operation history row commits in the same save. Token count, ledger, operation history, dataset state, and any prior-ingest refund all move atomically.
This replaced a read-modify-write implementation that could lose updates under concurrency. In production, the old path created drift where the ledger balance could exceed the stored user token count.
The ingestion cost formula is:
BaseTokenCost + ceil(fileSizeMB) × TokenCost
As of the v2 migration period, ingestion costs 25 base tokens plus 1 token per MB. That rate was reduced after a real 45-file COVID ingestion test proved the previous per-MB rate could make useful workflows too expensive.
Every storage-consuming endpoint is gated by IStorageQuotaService: uploads, paste flows, ingestion, remote downloads, and dataset transforms.
On every successful Parquet upload, User.StorageUsedBytes is incremented by the Parquet size. The default cap is 10 GB and can be raised by an active storage subscription. Over-quota requests return HTTP 402, which the frontend renders as an over-quota banner.
A weekly reconciliation job compares recorded storage usage against the sum of dataset and experiment sizes, then writes drift metrics into system metrics.
One of Correlation Studio’s core product bets is that users should not need to bring every dataset themselves. They should be able to ask for data, let AI providers suggest public sources, verify those sources, and ingest them into the same pipeline as uploaded files.
That workflow has two halves: AI-powered URL search and robust remote download.
Remote Search runs Claude, Gemini, and Grok in parallel. Each provider receives a prompt assembled from database-backed prompt templates with the user’s topic and description interpolated.
Providers are asked for structured JSON containing:
urltitledescriptionformatestimatedSizeMBEach provider has its own integration path: Anthropic SDK for Claude, raw HttpClient against Gemini’s REST endpoint, and raw HttpClient against xAI’s Responses API.
Search progress is tracked per user and provider. The frontend polls a progress endpoint every two seconds and updates provider status chips as each one moves through searching, verifying, done, or error.
Results are merged and deduplicated by URL. A URL returned by more than one provider is boosted because independent agreement is a useful quality signal.
Token costs are tracked per provider and debited when results are produced.
LLMs can produce plausible but unusable links. Every returned URL is therefore verified before it is offered as an ingest candidate.
Verification starts with a HEAD request and falls back to GET for servers that reject HEAD with 405 or 501 responses. A domain throttler controls politeness and prevents one search from hammering a public data host.
Retry-After.Rejected URLs are still shown in a details panel with the reason: HTML returned instead of CSV, HTTP error, dead link, and so on. The user is still charged for the AI search call because the provider performed the work; sourcing failure is a separate outcome from model usage.
Topics accumulate verified references over time. Those references can be reused instead of asking the AI providers to rediscover the same sources.
Remote Search supports three modes:
| Mode | Behavior |
|---|---|
| ExistingOnly | Skip AI search and return only verified references already associated with the topic. |
| New | Run a fresh AI search without reference reuse. |
| Both | Run a fresh AI search and merge existing references into the result set. |
Link verification is triggered on user login rather than running as a giant timed background sweep. The original 24-hour batch model did not scale: at 10,000 users and 100 references each, one full cycle would take years. Login-triggered checks keep the verification work close to active users.
References follow a three-strike failure rule. A URL that repeatedly fails is marked dead. If it later comes back online, the dead flag is cleared.
The Remote Download Pipeline
After a URL is selected, Correlation Studio downloads the file server-side. That path needs to be resilient because public data servers are inconsistent. Some throttle. Some disconnect mid-stream. Some omit content lengths. Some support range requests. Some ignore them.
The download pipeline uses a named HttpClient configured to look like a browser, with automatic decompression, HTTP/1.1, and manual timeout management.
A domain throttler wraps each download so multiple users or tabs do not overwhelm the same host. The cancel button in the UI propagates through the throttler and can release a waiting download before it even starts.
The old single total timeout was replaced with three separate timers:
If a previous attempt for the same user and URL left bytes on disk, the next attempt reads the number of bytes already present and sends a range request:
Range: bytes={N}-
This replaced an earlier stubbed resume-upload path with a simpler and more durable invariant: a partially downloaded remote file can resume by URL and byte count.
The server also guards against duplicate downloads for the same user and URL. If another browser tab or retry path is already downloading the same source, the second request is refused instead of starting a duplicate transfer.
Some of the hardest remote-download bugs came from chunked-transfer endpoints. Socrata-backed sources such as data.cdc.gov can stream CSV data with Transfer-Encoding: chunked. In that mode, the GET response may not expose a content length.
That creates a nasty ambiguity. At the read layer, a mid-stream disconnect and a clean end-of-file can both look like ReadAsync returning zero bytes.
Correlation Studio defends against this with three layers.
Content-Length. That value is cached and reused later. If the GET response arrives without a content length, the HEAD-cached size becomes the expected total.Range: bytes=0-0 request with Accept-Encoding: identity. Servers that omit content length on ordinary requests frequently return Content-Range for ranged requests.A clean EOF and a severed chunked stream can look identical unless the system establishes the truth from another signal: HEAD, Content-Range, or a resume probe.
The original remote-download endpoint was synchronous. The client posted a URL, the server downloaded the file, and the server returned an UploadDto when the download finished.
That worked for small files. It failed for long downloads.
On a 25-minute server-side fetch, the browser-to-server connection had to stay open the entire time. If the proxy or Kestrel reaped the connection, the client saw a network error, the server-side bytes became orphaned, and the wizard had no durable way to recover.
The replacement contract starts the download asynchronously.
POST /api/datasets/download-remote/start returns HTTP 202 immediately with a download key.GET /api/datasets/download-remote/status?url=... polls live state.completed, partial, failed, cancelled, and not_found.The frontend wraps the start-and-poll sequence behind the same conceptual surface: await a completed upload or receive an error. Internally, it tolerates transient poll failures and threads cancellation through to the server.
Active downloads are stored durably in the ActiveDownloads table, one row per user and URL. The row records filename, path, bytes on disk, status, heartbeat, and other recovery state.
During the chunk loop, database writes are throttled to once every 10 seconds or 50 MB. On startup, a reconciliation service waits for a short grace period, then finds stale non-terminal downloads. If bytes exist on disk, the row becomes partial. If the file is missing, the row becomes failed. Old terminal rows are evicted.
The result is a download model that survives client disconnects, proxy timeouts, process restarts, and ordinary public-data-server unreliability.
The experiment engine is where Correlation Studio turns datasets into discoveries. It takes two datasets, selects a join strategy, enumerates numeric-column pairs, and computes correlation statistics for every compatible pair.
The central challenge is combinatorics. If dataset X has 400 numeric columns and dataset Y has 500, the experiment has 200,000 pairs. At 1,000 by 1,000, it has one million.
The engine therefore has to be fast, resumable, deterministic, and selective about what becomes a persisted Discovery.
ExperimentIngestionService.ProcessQueuedExperimentAsync orchestrates the experiment flow.
TotalColumnPairs = |X_numeric| × |Y_numeric|The performance arc is instructive. Throughput improved from roughly 3 pairs per second to 14–15 pairs per second, with peaks around 21, through a sequence of targeted changes:
In v2, the Datapoints row stores metadata and a Parquet path. The chart payload itself lives beside the dataset Parquets in object storage.
Every correlation requires paired X and Y values. Correlation Studio supports three join strategies: row sequence, shared key, and time series.
RowSequence is a positional zip. The engine assigns a synthetic row number to both datasets and joins on that position.
x_agg AS ( SELECT CAST(rn AS VARCHAR) AS k, CAST("xval" AS DOUBLE) AS v FROM ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn, "xval" FROM read_parquet('x.parquet') ))
Numbering happens before null filtering. That preserves row alignment: X row 5 still pairs with Y row 5 even if row 3 has a missing value.
SharedKey joins both sides on an exact string key. Values are aggregated per key before correlation.
x_agg AS ( SELECT CAST("xkey" AS VARCHAR) AS k, AVG(CAST("xval" AS DOUBLE)) AS v FROM read_parquet('x.parquet') WHERE "xval" IS NOT NULL GROUP BY 1)
Date and DateTime columns are stored in canonical ISO 8601 form during ingestion, so equivalent dates join cleanly as strings.
TimeSeries joins snap timestamps to a tolerance bucket, such as hourly, daily, weekly, monthly, or yearly.
x_agg AS ( SELECT strftime( to_timestamp( FLOOR(EPOCH(TRY_CAST("xkey" AS TIMESTAMP)) / 86400) * 86400 ) AT TIME ZONE 'UTC', '%Y-%m-%dT%H:%M:%S' ) AS k, AVG(CAST("xval" AS DOUBLE)) AS v FROM read_parquet('x.parquet') WHERE TRY_CAST("xkey" AS TIMESTAMP) IS NOT NULL AND "xval" IS NOT NULL GROUP BY 1)
Both sides snap to the same epoch grid. A single tolerance knob controls the bucket size.
A prior implementation cast to_timestamp(...) directly to a string. DuckDB returns TIMESTAMPTZ, and string output included the local timezone offset. Drilldown queried plain timestamps and produced a different string shape for the same logical day. The UTC strftime expression fixed the mismatch across machines and drilldown paths.
Sampling must preserve pairs. Independent random sampling on X and Y would break the join by selecting different key sets. Correlation Studio instead samples deterministically on the bucketed join key.
WHERE hash({bucketedKey}) % 100 < {samplePercent}
Both sides compute hash() on identical key values. The surviving key sets therefore line up exactly.
Re-running an experiment on the same datasets with the same sample percent produces the same sample. That stability matters because published Discoveries should not change simply because an experiment was reprocessed.
When multiple rows fall into the same key or time bucket, each selected column’s aggregate function determines the per-bucket value.
| Aggregate | Expression |
|---|---|
| Avg | AVG(CAST(col AS DOUBLE)) |
| Sum | SUM(CAST(col AS DOUBLE)) |
| Min | MIN(CAST(col AS DOUBLE)) |
| Max | MAX(CAST(col AS DOUBLE)) |
| First | FIRST(col) |
| Last | LAST(col) |
| Count | CAST(COUNT(col) AS DOUBLE) |
Average is the default because it fits most exploratory datasets. Everything is cast into a numeric shape that DuckDB’s CORR() can consume.
After both sides have been joined and aggregated into matching key-value pairs, DuckDB computes Pearson and Spearman correlation from the joined set.
joined AS ( SELECT x.k AS k, x.v AS xv, y.v AS yv FROM x_agg x INNER JOIN y_agg y USING (k) WHERE x.v IS NOT NULL AND y.v IS NOT NULL AND NOT isnan(x.v) AND NOT isnan(y.v) AND NOT isinf(x.v) AND NOT isinf(y.v)),ranked AS ( SELECT RANK() OVER (ORDER BY xv) AS xr, RANK() OVER (ORDER BY yv) AS yr FROM joined)SELECT CORR(xv, yv) AS pearson, (SELECT CORR(xr, yr) FROM ranked) AS spearman, COUNT(*) AS nFROM joined
Pearson correlation uses DuckDB’s CORR() aggregate. Spearman correlation is implemented as Pearson correlation over ranked values. Pairs with fewer than three joined observations are discarded. Pairs with zero variance on either side produce no Discovery.
The p-value is computed in C# using a two-tailed Student-t test:
t = r · √[(n − 2) / (1 − r²)]p = 2 · (1 − CDF_t(|t|; df = n − 2))
Confidence intervals use Fisher’s z-transform:
z = atanh(r) = 0.5 · ln((1 + r) / (1 − r))SE = 1 / √(n − 3)z_lo = z − 1.96 · SEz_hi = z + 1.96 · SECI = [tanh(z_lo), tanh(z_hi)]
Perfect correlation is handled explicitly. When |r| = 1, the p-value path avoids divide-by-zero and the confidence interval degenerates to [r, r].
When a pair clears the threshold, the system writes a small Parquet file containing the joined datapoints used for charting.
COPY ( SELECT k, xv, yv FROM joined ORDER BY hash(k) LIMIT 5000) TO '{tempFile}' (FORMAT PARQUET)
The deterministic ORDER BY hash(k) keeps chart sampling stable across reruns. The hard cap of 5,000 rows ensures the frontend can render a discovery without an additional downsampling pass.
The file is uploaded to R2 under the discovery’s path and seeded into the local cache. The database row stores the Parquet path, counts, and data size. The payload itself no longer lives inside PostgreSQL.
Granger causality is available on demand from the Discovery detail page. It is not computed during bulk experiment ingestion because doing so added a 30–60% throughput penalty to the hot path.
The on-demand algorithm has four stages.
Δy_t = α + δ·y_{t-1} + ε. The t-statistic is δ / SE(δ), using a 5% critical value of −2.86 for larger samples. Non-stationary series are differenced up to twice.k · log(n) + n · log(RSS / n) The search considers lags from 1 to min(10, n/4) and chooses the best score.F = [(RSS_restricted − RSS_unrestricted) / k] / [RSS_unrestricted / (n − 2k − 1)] p = 1 − CDF_F(F; k, n − 2k − 1)OLS is implemented through MathNet matrix inversion. Singular systems fall back to full residual sum of squares and report no causality.
Granger causality is valuable, but it is not cheap enough to run for every candidate pair during brute-force experiment ingestion. Deferring it keeps discovery mining fast while still making causal testing available when a user cares about a specific relationship.
Not every computed pair becomes a Discovery. The ingestion loop applies a series of gates, each designed to avoid storing obvious noise or redundant relationships.
Older duplicate history created before the gate can be cleaned by an admin utility. The cleanup keeps the oldest Discovery in each duplicate group so established permalinks, thumbnails, search records, and published states are preserved.
Correlation Studio intentionally uses standard statistical methods. The goal is not to invent new statistics. The goal is to make proven statistical workflows available at corpus scale and wrap them in useful product primitives.
r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]
Pearson appears in three places: DuckDB during experiment ingestion, MathNet for one-off server-side computations, and a TypeScript implementation for client-side rolling windows.
Spearman is Pearson applied to ranks:
ρ = Pearson(rank(x), rank(y))
DuckDB computes ranks with window functions and then correlates them. MathNet handles one-off server paths. Ties are handled by the respective rank implementations.
t = r · √[(n − 2) / (1 − r²)]df = n − 2p = 2 · (1 − CDF_t(|t|; df))
The p-value path uses MathNet’s Student-t distribution. Correlations with fewer than three observations return no p-value.
z = atanh(r)SE = 1 / √(n − 3)z_lo = z − 1.96 · SEz_hi = z + 1.96 · SECI = [tanh(z_lo), tanh(z_hi)]
Confidence intervals use Fisher z without Olkin-Pratt small-sample bias correction. The interval is reported as-is.
The frontend regression utility implements four model families: linear, polynomial, logarithmic, and exponential.
slope = (n·Σxy − Σx·Σy) / (n·Σx² − (Σx)²)intercept = (Σy − slope·Σx) / n
Degenerate X variance invalidates the model. R² is computed as 1 − SS_resid / SS_total.
Degree-2 and degree-3 polynomial fits build normal-equation matrices from power sums and solve them with Gaussian elimination and partial pivoting. Singular pivots invalidate the model.
The logarithmic model fits y = a + b·ln(x). Values where x ≤ 0 are filtered out.
The exponential model fits y = a·e^(bx) by fitting a linear model over (x, ln(y)), then exponentiating the intercept. Values where y ≤ 0 are filtered out.
Every model returns a prediction function, equation string, coefficients, and R². The equation formatter uses adaptive precision so extremely large, normal-sized, small, and tiny coefficients remain readable.
Regression overlays include a 95% prediction interval. For each x value on the regression curve, the margin is:
margin = 1.96 · SE · √[1 + 1/n + (x − x̄)² / SS_x]SE = √(SS_resid / (n − p))
The chart draws the band as a filled polygon above and below the curve. For nonlinear models, the prediction function is sampled across the x-range.
Correlation Studio uses the difference between Spearman and Pearson strength to flag relationships that may be nonlinear or outlier-influenced.
Δ = |Spearman| − |Pearson|"non-linear" if Δ > 0.15"outlier-influenced" if Δ < −0.15null otherwise
A large positive delta suggests a monotonic but nonlinear relationship. Spearman preserves the rank order while Pearson underreports the curve. A large negative delta suggests outliers may be propping up Pearson more than the rank relationship supports.
This drives the small divergence indicators in discovery lists and powers the dedicated divergence visualization.
Rolling correlation runs client-side for interactive exploration. The implementation skips non-finite values, returns zero for windows with fewer than three valid pairs, and guards zero-variance denominators.
function pearson(xs: number[], ys: number[], start: number, end: number): number { let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0, sumY2 = 0, n = 0; for (let i = start; i < end; i++) { if (!Number.isFinite(xs[i]) || !Number.isFinite(ys[i])) continue; sumX += xs[i]; sumY += ys[i]; sumXY += xs[i] * ys[i]; sumX2 += xs[i] * xs[i]; sumY2 += ys[i] * ys[i]; n++; } if (n < 3) return 0; const denom = Math.sqrt( (n * sumX2 - sumX * sumX) * (n * sumY2 - sumY * sumY) ); if (denom === 0) return 0; const r = (n * sumXY - sumX * sumY) / denom; return Number.isFinite(r) ? r : 0;}
Users can adjust the window size. The chart overlays strength bands and reports mean and volatility below the line.
Correlation Studio does not implement Largest-Triangle-Three-Buckets. Instead, it uses three practical strategies tailored to the product.
log2(count).The goal is not to preserve every rendered primitive. The goal is to preserve the visual meaning of dense data without making the browser draw thousands of redundant pixels.
The primary chart engine is DiscoveryChart.tsx: a canvas-based, double-buffered renderer that handles scatterplots, line graphs, residual plots, trajectory mode, and regression overlays.
Device-pixel-ratio scaling keeps output crisp on high-DPI displays. Double buffering keeps redraws clean: a new frame renders into an offscreen canvas, then atomically replaces the visible canvas.
| Component | Technology | Purpose |
|---|---|---|
MiniChart | SVG | Sparkline previews in feed cards and grids |
CorrelationHeatmap | SVG grid | Pairwise correlation matrix across columns |
DivergenceChart | SVG scatter | Pearson vs. Spearman per discovery |
BubbleChart | SVG | Correlation vs. sample size with bubble size as divergence |
LagAnalysisChart | SVG polyline | Correlation at positive and negative lags |
RollingCorrelationChart | SVG polyline | Windowed correlation across a series |
CorrelationNetwork | D3 force graph | Node-link graph of inter-column correlations |
DatasetMapView | Leaflet | Geographic scatter for geo-typed datasets |
ColumnHistogram | SVG | Per-column distribution histogram |
ColumnBoxPlot | SVG | Quartiles and outliers |
The main chart pipeline starts with memoized data preparation. It recomputes when the selected datapoint payload or visualization mode changes.
Scatterplot and residual modes use a 1000×600 bucket grid. Each point is mapped into a bucket by its normalized x and y position. Duplicate bucket hits collapse into a single rendered dot, with radius scaled by density.
Line graph mode uses arrays indexed by row order. Trajectory mode parses compound join keys into group and time pairs, then renders grouped alpha-blended lines.
Residual mode transforms the plotted y value before bucketing:
plotY = y − (slope · x + intercept)
Row order is sometimes meaningful and sometimes noise. Before recommending ordered visualizations, the chart computes lag-1 autocorrelation for both series. If both series have weak autocorrelation, the UI warns that row order may not carry signal and suggests scatterplot or trajectory mode instead.
Rendering happens on an offscreen canvas. When the frame is complete, the offscreen buffer is copied into the visible canvas using globalCompositeOperation = 'copy'. This avoids the white flash that can occur when resizing or clearing the visible canvas directly.
Resize events are handled with a leading-edge throttle and a short cooldown. The first resize applies immediately. Subsequent resize events coalesce into a pending size and apply when the cooldown expires.
The chart accumulates stage timings for data preparation, trajectory grouping, and regression. If a render exceeds 100 ms, it logs a warning with stage-level detail.
In scatterplot mode, users can choose among linear, polynomial, logarithmic, and exponential regression overlays. The selected model is fitted client-side and rendered directly over the discovery datapoints.
The overlay draws four elements:
Nonlinear curves are sampled at 80 points across the visible x-range. Linear fits use only the two endpoints.
Discovery thumbnails are rendered server-side with SkiaSharp. The server renderer mirrors the client chart logic closely enough that thumbnails and interactive charts have the same visual language.
The thumbnail pipeline is:
Portfolio thumbnails use a similar renderer, composing mini previews for dataset, experiment, and discovery blocks.
One of the most important product affordances is the ability to click a chart point and inspect the source rows behind it.
The client sends selected keys to:
POST /api/discoveries/{id}/datapoint-rows
Multi-select is supported through lasso and Shift-click. The endpoint streams NDJSON: one JSON object per line. The client parses rows lazily instead of waiting for a giant JSON array.
Streaming uses:
Content-Type: application/x-ndjson; charset=utf-8X-Accel-Buffering: noResponse.Body.DisableBuffering()FlushAsync callsX and Y sides are scanned concurrently. Each side queries its source dataset Parquet through DuckDB:
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rid, c1, c2, ...FROM read_parquet('dataset.parquet')WHERE {bucketed-key-expression} IN ('key1', 'key2', ...)ORDER BY {bucketed-key-expression}LIMIT {DrilldownPerSideRowCap + 1}
The bucketed key expression is generated by the same helper used during experiment ingestion. That consistency is what makes time-series drilldown correct: a daily bucket returns every X row and every Y row whose timestamp falls in that day.
The per-side row cap is 5,000. The query fetches one extra row so truncation can be detected without issuing a separate count query.
Dataset tools let users produce derivative datasets without leaving the platform. They are implemented by DatasetToolService and share the same storage model as ordinary ingestion: DuckDB transforms source Parquet, the result is written to a new Parquet file, and the new dataset becomes a normal first-class dataset.
The transformation pipeline mirrors ingestion:
Output Parquet column names use the same c{ColumnId} convention as ordinary ingested datasets.
A transformed dataset is not a temporary view. It has columns, stats, storage usage, tokens, join-key detection, publication state, and can participate in future experiments.
If no output column is explicitly marked as a join key, the tool pipeline scans for the first Date or DateTime column. If none exists, it falls back to the first String column. This mirrors the dataset wizard’s detection behavior so downstream experiments can time-series join derivative datasets with minimal user intervention.
The tool catalog covers common reshaping, time-series, normalization, filtering, and joining workflows.
Converts wide data to long form using an unpivot-style operation. Value columns are cast to double before unnesting.
Computes row-over-row deltas using LAG() over deterministic row order.
Shifts a column forward or backward with LAG or LEAD.
Computes rolling SUM, MIN, MAX, COUNT, FIRST, LAST, or AVG over a row window.
Groups timestamps by period using DATE_TRUNC and aggregates numeric columns per period.
Builds a WHERE clause from user conditions such as equality, range, contains, starts with, ends with, null, and not-null.
Supports z-score and min-max normalization with zero-variance guards.
Computes percent change from the prior row and returns null when the prior value is zero.
Uses a windowed cumulative sum from the beginning of the row order to the current row.
Supports forward-fill, backward-fill, numeric interpolation, and constant fill.
Ranks rows by key group and keeps the first or last row per group.
Groups by one or more key columns and applies an aggregate function.
Discovers distinct pivot values and builds one output column per value.
Computes rank over a selected column with nulls placed last.
Computes numeric buckets from precomputed min and max values.
Filters rows whose target values exceed a configured standard-deviation threshold.
Performs inner, left, right, or full outer joins on string-cast key columns.
Unions two datasets by discovering a case-insensitive union schema and projecting missing columns as null.
Tools that depend on row order inject a ROW_NUMBER() OVER () AS rowid CTE. Window frames use row-based frames rather than value-based ranges.
The streaming transform path enforces DuckDB reader output into the target Parquet type. Integer outputs become Int64. Decimal outputs become Double. Temporal and string outputs become canonical strings.
Missing values are handled explicitly. Forward-fill and backward-fill use IGNORE NULLS. Interpolation tracks null positions and neighboring non-null values. Rank uses NULLS LAST. Aggregates coalesce to safe defaults where needed to avoid division-by-zero and invalid numeric output.
Corrie is Correlation Studio’s in-app RAG chatbot. The name is user-facing; the historical frontend component remains StudioChat for compatibility, but the UI, empty states, button labels, error banners, and system prompt all identify the assistant as Corrie.
Corrie answers two kinds of questions:
The chatbot is not bolted on as a generic assistant. It is wired into the same entity model and publication lifecycle as the rest of the platform.
The RAG system has three core database entities.
RagChunk: one retrievable unit with corpus type, source type, source entity ID, title, URL, chunk text, embedding, and creation date.ChatConversation and ChatMessage: persistent chat history per user.ChatQueryAudit: per-query audit records containing token counts, cost estimates, retrieved chunks, and escalation flags.Chunks belong to several corpus types: documentation, entities, and pre-seeded content. Source types cover document sections, discoveries, experiments, portfolios, topics, datasets, forum questions, and seed content.
This gives the chatbot a uniform retrieval interface across static docs and user-generated analytical content.
Embedding is dual-vendor. OpenAI’s text-embedding-3-large is the primary path, reduced to 1536 dimensions. Gemini text-embedding-004 is the fallback.
The 1536-dimensional shape was retained because the existing pgvector column already used 1536 dimensions. That made the cutover to OpenAI primary a service change rather than a schema migration.
Each embedding service owns its own retry loop with jittered backoff and Retry-After support. The composite fallback service tries OpenAI first, then Gemini on transient failure. Vendor and cost data are captured separately so admin reporting can distinguish normal and fallback spend.
Documentation chunks are built from the user guide by splitting Markdown sections on H2 and H3 headings. The docs corpus is hashed at startup and re-chunked only when the hash changes.
Entity chunks are built by source-specific chunk builders. Publishing, unpublishing, deletion, and analysis changes enqueue RAG upsert or remove commands.
Index commands are durable rows, not in-memory events. The worker polls due rows, coalesces multiple commands for the same source, embeds upserts in vendor batches, and writes precomputed embeddings into RagChunks.
Failed rows are retried with exponential backoff and are never silently dropped.
Retrieval uses pgvector cosine similarity:
SELECT * FROM "RagChunks"WHERE "CorpusType" = 1 OR "SourceUserId" = @userIdORDER BY "Embedding" <=> @queryEmbeddingLIMIT @k
The default retrieval count is eight chunks. Visibility filtering ensures private entities are not retrieved for other users.
ChatService orchestrates the full query lifecycle:
RagChunks.ChatQueryAudit row with estimated cost.Haiku is the default because it is fast and inexpensive. Sonnet is reserved for ambiguous or low-confidence cases where a stronger model noticeably improves answer quality.
Citations are rendered inline by the Corrie panel. Provenance links are resolved from stored entity URLs at service startup.
Corrie uses a flat token price per query. Each chat query debits 25 tokens regardless of whether the answer used Haiku alone or escalated to Sonnet.
That pricing model lets cheaper Haiku queries subsidize occasional Sonnet escalations, while the daily per-user query cap bounds worst-case spend.
Exceeding the query cap returns HTTP 429. Insufficient token balance returns HTTP 402. The chat panel renders both as inline banners.
Correlation Studio’s billing model is built around non-expiring token packs. Tokens fund usage-based operations: ingestion, correlation, transforms, AI search, AI analysis, and Corrie.
The design goal is to keep browsing and lightweight exploration free while making expensive compute and AI work pay for itself.
Users can buy token packs in four tiers. Packs never expire and stack into one user balance.
| Pack | Price | Tokens | Rate |
|---|---|---|---|
| Small | $5 | 5,000 | $1.00 / 1K tokens |
| Medium | $20 | 25,000 | $0.80 / 1K tokens |
| Large | $50 | 65,000 | $0.77 / 1K tokens |
| Bulk | $200 | 300,000 | $0.67 / 1K tokens |
The volume discount is intentionally modest. A bulk-pack buyer should receive a better rate, but not enough to destroy margin on expensive AI-backed operations.
Token pricing is tied to the kind of work being performed.
| Operation | Base | Per Unit | Unit | Approximate Retail Cost |
|---|---|---|---|---|
| Ingest Dataset | 25 | 1 | per MB | $0.025 + $0.001/MB |
| Run Correlation | 25 | 1 | per MB | $0.025 + $0.001/MB |
| Tool Transform | 25 | 1 | per MB output | Shares ingest rate |
| AI Analysis | 0 | 125 | per call | $0.125 |
| AI Search: Claude | 0 | 250 | per call | $0.25 |
| AI Search: Gemini | 0 | 125 | per call | $0.125 |
| AI Search: Grok | 0 | 125 | per call | $0.125 |
| AI Chatbot | 0 | 25 | per call | $0.025 |
The per-MB compute rate was reduced after real ingestion testing showed that the old price could make legitimate multi-file workflows feel punitive. The per-MB charge is compute markup, not storage rent; R2 storage is cheap enough that the ingest fee covers many years of storage at retail rates.
AI Analysis was also repriced after bulk-analysis workflows exposed a power-user failure mode. One click could analyze hundreds of child discoveries. Reducing analysis cost kept that path viable without damaging margin.
All token-spending call sites route through the same accounting service. Dataset search, AI analysis, Corrie, ingestion, experiment ingestion, and dataset transforms all use the atomic debit path.
Every token movement writes a ledger row: purchases, grants, debits, refunds, and reclamation credits. The ledger row stores a running balance, and that balance comes from the same RETURNING TokenCount value produced by the atomic PostgreSQL update.
By construction, the ledger balance matches the stored user balance.
A reconciliation path can recompute token balances from the ledger. It finds the most recent token-reset sentinel row per user, then sums all later ledger movements. Older users without a sentinel fall back to the latest ledger balance.
Deleting a dataset, experiment, or discovery can return part of the original per-MB token cost to the user.
The reclaim rule credits back 70% of the per-MB portion. The 25-token base fee is retained, and the remaining 30% acts as a reclamation surcharge. Very small entities whose reclaim amount would round below one token receive no credit.
Credits appear in the ledger as token reclamation entries tied to the deleted entity title. The deletion service writes the ledger entry in the same transaction as the soft-delete flag.
Stripe handles saved cards, subscriptions, token-pack purchases, and refunds.
The Stripe webhook controller is the canonical entry point for billing state changes. It verifies webhook signatures, deduplicates events by Stripe event ID, and delegates side effects to transactional billing-service methods.
Invoices are rendered as PDFs with monotonic invoice numbers. Each invoice stores an HTML snapshot at issue time so future re-rendering cannot drift from what the user originally received.
Correlation Studio supports email/password login, Google OAuth, Microsoft OAuth, and JWT session tokens.
External identities map provider and provider user ID to the local user account. If a mapping exists, OAuth login signs in that user. If not, the flow either creates a new user or links the external login to the currently authenticated account.
The role hierarchy is strict:
Anonymous=1Registered=2Moderator=3Curator=4Administrator=5
Publishable entities have three visibility states:
| Visibility | Who Can See It |
|---|---|
| Private | The owner only |
| Public | Visible on Home feed, search, profiles, crawlers, and public entity pages |
| Workgroup | Members of the assigned workgroup |
Several straightforward hardening layers protect accounts and public surfaces.
Connection logs are retained for 30 days. They are useful not only for security but also for diagnosing slow routes and operational anomalies.
Every non-request-scoped task runs as an ASP.NET hosted service. The system relies on background workers for ingestion, cleanup, expiration, subscriptions, link verification, metrics, RAG indexing, thumbnails, active downloads, and more.
| Service | Purpose | Cadence |
|---|---|---|
IngestionQueueService | Dispatch dataset, experiment, search, index, and transform jobs | Every 2 seconds with per-lane caps |
DatasetCleanupService | Hard-delete soft-deleted datasets after grace period | Runner cycle with bounded time budget |
ExpirationService | Expire datasets based on configured expiration dates | Every 15 minutes |
SecurityCleanupService | Expire registration and password-change tokens | Daily |
UploadCleanupService | Delete old staging uploads | Every 15 minutes |
SubscriptionRenewalService | Renew subscriptions and send notices | Daily |
LinkVerificationService | HEAD-check references | Triggered on login with cooldown |
SystemMetricsCollectorService | Sample CPU, RAM, disk, connections, and slow queries | Every 30 seconds |
RagIndexWorker | Drain the RAG index queue and embed chunks | Continuous loop |
ActiveDownloadReconcileService | Recover stale active-download rows | Startup plus periodic sweep |
IngestionQueueService is the dispatcher for long-running analytical work. It has separate lanes for dataset jobs, experiment jobs, and tool jobs.
Each lane has its own concurrency semaphore. The dispatcher selects the oldest pending job up to the lane cap, flips it to running, and invokes the appropriate service.
An admin-controlled pause flag stops new dispatches without killing running jobs. This is important for maintenance: the operator can let active work drain while preventing new long-running jobs from starting.
A duplicate guard prevents concurrent jobs against the same entity. Without that, a stale redispatch could create two pipelines writing the same Parquet path.
A stale-job watchdog inspects long-running jobs. Jobs that are still reporting progress are left alone. Truly hung jobs are canceled.
Long-running jobs can be interrupted by process crashes, server restarts, or deploys. Startup recovery hooks repair common partial states.
Long-running work should not require perfect uptime. Every state that can be interrupted should have a deterministic next step on startup.
In-app notifications are stored as rows and surfaced through the bell UI. Transactional emails use database-backed templates for registration verification, password reset, login alerts, renewal notices, invoices, refunds, and related account events.
Development sends through Gmail SMTP with quota limits. A production migration to SES or SendGrid is planned if outbound email volume grows beyond lightweight usage.
The frontend is a React 18 and TypeScript SPA built with Vite. It is designed around rich client-side interaction: dataset creation, remote download progress, chart rendering, drag-and-drop portfolio editing, maps, feeds, search, and AI chat.
The frontend is not a thin CRUD shell. Several product-defining features happen in the browser: chart interaction, client-side thumbnail capture, wizard state persistence, optimistic feed updates, and search-as-feed navigation.
Server state lives in React Query. Query keys follow predictable shapes such as ['my-datasets'], ['experiment', id], and ['available-datasets-my']. Mutations invalidate the relevant keys on success.
One recurring frontend lesson is that similar pages may use different query keys for good reasons. Dataset creation, deletion, and publish actions therefore need to invalidate all surfaces that can show the entity, not just the page where the mutation happened.
Client state is intentionally narrow. Zustand stores the authenticated user, token, and authentication status in localStorage. A global confirmation-dialog provider replaces native browser confirms and alerts with themed, promise-based dialogs.
The frontend API layer exports named service objects such as datasetsApi, experimentsApi, discoveriesApi, and portfoliosApi. Each wraps an Axios instance with JWT handling.
A sliding-refresh interceptor watches authenticated responses. If the JWT is inside the refresh window, the client triggers a background refresh and updates the auth store. A failed refresh redirects to the login flow.
The dataset-create wizard is one of the most performance-sensitive parts of the frontend. It can host dozens of in-flight remote downloads, each polling progress every few seconds, while preserving enough state to survive navigation and refreshes.
Several optimizations were required after large remote-search runs exposed rendering and storage bottlenecks.
Wizard state is persisted to sessionStorage, but progress ticks can be frequent. Without debouncing, every tick triggered a full parse, mutate, stringify, and write cycle against a large state object. Autosaves now debounce on a short timer.
Downloaded file rows are isolated into a memoized component. Stable callback identities prevent every row from rerendering whenever one file receives a progress update.
A parent-level timer recalculates stalled status every 10 seconds. That allows “stalled at N MB” labels to update without forcing all rows to rerender on every poll.
The file list uses a nested flex layout with min-h-0 and overflow-y-auto, so the list scrolls inside the wizard instead of pushing the footer below the viewport.
A 42-file remote-search run with multi-gigabyte CDC files exposed three problems at once: autosave churn, row rerender storms, and layout overflow. Fixing all three turned the wizard from a stress case into a usable workflow.
Discovery thumbnails are generated both client-side and server-side.
The client-side path gives users immediate visual feedback when publishing or saving a discovery:
The server-side renderer uses SkiaSharp and the same scatter-bucket logic. Keeping the two paths visually synchronized requires discipline: changes to bucket math need to land in both renderers.
Search results reuse the same rich feed cards as the normal Home feed. Pressing Enter in the header search box navigates to a Home URL with query parameters rather than a separate search-results page.
/?q=<term> /?q=<term>&entityTypeId=<n>
The backend unifies both paths by extracting feed DTO construction into a shared pipeline. The normal feed reads rows from HomeFeed. Search produces synthetic feed references from ranked entity results. Both hydrate through the same DTO-building path and receive the same cards, thumbnails, sentiment state, and mini-chart payloads.
Search pagination treats results as one deterministic stream clustered by type: discoveries, experiments, datasets, portfolios, and posts. Each type ranks within itself, then fixed type-order concatenation produces stable infinite-scroll pages.
Search rows receive synthetic feed IDs so existing React keys and optimistic sentiment updates continue working.
Search is term-based rather than exact-phrase only. A shared ranking helper converts the raw query into search terms, a match predicate, and a relevance score.
Matching and scoring run server-side as expression trees over EF.Functions.ILike. The database evaluates the score through SQL CASE expressions rather than materializing every candidate into memory.
Because ILIKE '%term%' cannot use ordinary B-tree indexes, production search relies on pg_trgm GIN indexes over titles, descriptions, and post text. That turns multi-term OR searches into bitmap index scans instead of sequential scans across the corpus.
Rich search over user-facing text is not only a ranking problem. Without trigram indexes, a friendly multi-term query becomes an expensive full-table scan.
Version 2.4 focused on making Correlation Studio’s public corpus visible to search engines, AI crawlers, social-media unfurlers, and dataset discovery tools.
The corpus is large: public datasets, experiments, discoveries, and portfolios form a body of structured, citable analytical content. But before 2.4, that corpus lived behind a client-rendered React SPA. Non-JavaScript crawlers fetched entity URLs and received the same generic shell: a root div and script tag, without entity-specific title, body, or structured data.
Version 2.4 adds crawler-grade HTML without changing runtime behavior for human users.
EntityShellMiddleware sits ahead of the SPA static-file fallback. When a request path matches a public entity route, the middleware intercepts the shell and rewrites it before bytes leave the server.
The middleware injects three kinds of crawler-visible content.
The middleware resolves the public ID, loads the entity, and injects a real <title>, meta description, Open Graph tags, and Twitter Card tags. Entity thumbnails become social preview images.
A readable entity article is emitted inside <noscript>. Human users with JavaScript hydrate the SPA as before. Non-executing crawlers receive an indexable page body containing titles, analysis text, coefficients, dataset columns, and other entity-specific content.
Each entity receives structured JSON-LD in the document head. That gives crawlers a machine-readable description of the page, its entity type, breadcrumbs, authorship, dates, and distributions where applicable.
Visibility rules are enforced in the middleware. Private and workgroup-only entities do not leak metadata or noscript content to anonymous crawlers.
Each public entity maps to an appropriate schema.org payload.
Dataset with name, description, creator, modified date, keywords, license, and distribution.Article with the AI analysis as article body and source datasets linked through about.Article or CollectionPage.BreadcrumbList so crawlers can understand and traverse the site hierarchy.Structured data is generated through a shared builder, keeping entity-to-schema mapping centralized.
The original product had per-user topics: useful for organization, but not a global taxonomy. Version 2.4 adds a two-level controlled vocabulary that organizes content across users.
OntologyTopics defines curated parent and child topics, each with a stable slug, display name, and description.EntityTopics maps datasets, experiments, and discoveries to ontology topics through an entity-type plus entity-ID pattern.Assignment is automatic. A Claude Haiku classifier reads existing AI analysis text and returns allowed topic slugs from the fixed vocabulary. Out-of-vocabulary results are discarded.
This is intentionally cheap. It reuses analysis text the platform already paid to generate, and experiment topics inherit to public discoveries. That means classifying experiments and datasets can cover the discovery corpus without one classifier call per discovery.
Topic pages turn the public corpus into a navigable subject graph.
The /topics index and each /topics/{slug} hub are public, crawler-grade pages rendered through the same shell middleware used for entity pages. Each hub exposes a noscript body listing mapped entities, plus CollectionPage and BreadcrumbList JSON-LD.
Topic hubs also receive their own sitemap shard so crawlers can discover them directly.
Functionally, the hubs act as an internal-linking layer. They connect otherwise isolated entity pages by subject:
Topic index ↓Topic hub ↓Dataset / Experiment / Discovery ↓Breadcrumb back to topic
Topic hubs give crawlers a dense, navigable graph instead of a flat list of orphan URLs.
That matters because internal linking is one of the strongest levers for encouraging crawlers to explore a large site deeply.
Version 2.4 adds several mechanisms to shorten the lag between publishing content and having search engines revisit it.
When a public entity is published, unpublished, or deleted, the platform sends an IndexNow ping for the changed URL. The ping runs asynchronously from the same background lifecycle path that already handles RAG indexing, so publishing remains fast for the user.
Domain ownership is proven through the standard key-file-at-root mechanism.
The sitemap index now emits real MAX(ModifiedDate) values per entity type and shard instead of static build-time timestamps. Crawlers can therefore use conditional requests correctly and distinguish changed sections from unchanged ones.
Discovery scatterplot thumbnails and portfolio composite thumbnails are advertised through image-sitemap extensions. That exposes the generated visual corpus to image search, not just ordinary web search.
The public corpus is published as open data under CC BY 4.0, with attribution to the content creator.
It is exposed through three surfaces.
/llms-full.txtThe full public corpus rendered as one Markdown document containing titles, analysis, and coefficients. It is designed as a single-fetch corpus surface for language models.
/open-data/correlation-studio-public.ndjsonA streamed bulk dump with one self-describing JSON object per public entity. Streaming avoids a memory ceiling as the corpus grows.
/dataA human-facing landing page documenting the open-data surfaces, license, and NDJSON schema.
The open-data layer is also represented in structured data. Dataset JSON-LD includes a real license and downloadable distribution. The entire public corpus dump is itself typed as a schema.org Dataset, with distribution pointing to the NDJSON file.
That makes the corpus discoverable both as individual analytical pages and as a downloadable dataset in its own right.
The discoverability work runs against live production traffic. A 30-day operational snapshot through June 29, 2026 showed:
| Metric | Value |
|---|---|
| Requests | ~975,000 |
| Unique IPs | ~45,000 |
| Average request duration | 79 ms |
| Uptime | 100.00% |
| Error rate | 2.38% |
| Response mix | 2xx 93.2% / 3xx 4.4% / 4xx 2.4% / 5xx ~0% |
The average request duration held with SSR middleware in the request path because the shell rewrite is lightweight. It uses data the API already loads and string-templated HTML insertion, not a second round trip through an external rendering service.
The most valuable architectural lessons in Correlation Studio were not anticipated in design documents. They came from production failures, load tests, and real workflows that broke assumptions.
The advantage of a solo build is not that one person can predict every failure. It is that the loop between “this broke” and “this is fixed permanently” can be short.
Delete cleverness the moment the simpler design proves more durable.
Before the state-machine refactor, dataset ingestion used multiple flags: ingestion status, indexing status, queued flags, and related booleans. Every dispatcher race created another corner case.
The fix was a single DatasetStateId column backed by a DatasetStates lookup table. Experiments received the same treatment.
One column. One writer per transition. Deterministic startup recovery.
It was one of the highest-leverage structural cleanups in the project.
Token balances originally lived as a column on Users and were debited through read-modify-write. Under concurrency, lost updates produced silent drift.
The ledger forced correctness. Every debit now atomically updates the user balance and writes a ledger row whose balance field is the returned post-update value.
Drift recovery becomes deterministic: find the latest reset sentinel and sum every ledger movement after it.
The v1 storage design used row tables and index tables for bulk dataset content. It failed when write amplification overwhelmed disk I/O under concurrent ingestion.
The v2 migration replaced that entire path with one columnar Parquet file per dataset, stored in R2 and queried by DuckDB. Sharding disappeared. Thousands of lines of code were deleted. The frontend stayed unchanged.
The lesson was not “optimize PostgreSQL harder.” The lesson was “stop asking PostgreSQL to do the wrong job.”
A time-series bug exposed a subtle mismatch. The experiment join bucketed timestamps through one DuckDB expression. Drilldown queried the raw column through a different expression. The chart showed correlations because both sides of the join collided on the same bucket string, but drilldown returned no rows because the stored key shape did not match.
The fix was to centralize bucket-key generation. Both ingest and drilldown now use the same helper, and time-series buckets are formatted in UTC ISO 8601 form.
The lesson: if a derived key is used in more than one path, it deserves one canonical builder.
R2 upload retries initially failed to recover from a poisoned SChannel socket because the same client pool kept returning the same bad connection.
The fix was not more backoff. The fix was to replace the client and its underlying connection pool when the failure chain indicated a Windows TLS failure.
Retrying the same broken resource is not resilience. Sometimes the resource must be evicted.
The remote-download wizard did not become slow because of one bug. It became slow because autosave churn, row rerender storms, and layout overflow compounded under a realistic 40-file workflow.
Fixing only one layer would not have been enough. Debounced autosaves, memoized rows, stable callbacks, stall timers, and flex-chain layout all had to land together.
Rich workflow pages need performance design just as much as backend ingestion loops.
Stripe retries webhooks aggressively. Without idempotency on event ID, one checkout completion can produce multiple token credits.
The webhook controller now writes a deduplication row at the start of every handler and exits immediately if the event has already been processed.
For billing systems, “probably once” is not a delivery guarantee. Build for repeated delivery.
Fire-and-forget work launched from a request cannot safely reuse scoped dependencies from the request. The request scope may be disposed before the background task touches its first DbContext.
The canonical pattern is to inject IServiceScopeFactory, create a fresh scope inside the background task, and resolve the required service from that scope.
That pattern now underlies background downloads, RAG indexing paths, and ingestion queue dispatch.
In v1, users lived in the central database while analytical content lived in shards. Any accidental navigation property across that boundary could fail because the related entity lived in a different database.
The interim fix was a systemic ban on cross-shard navigation: either denormalize the fields needed for display or explicitly fan out through shard routing.
The v2 migration collapsed the system into one PostgreSQL database, so the constraint disappeared. The useful denormalized counters remained because they make common reads cheap.
Correlation Studio evolved through several major architectural generations. Although the implementation details changed dramatically, a small set of design principles remained constant throughout the project.
Expensive work should be performed once and reused everywhere. Dataset statistics, thumbnails, embeddings, chart data, and AI analysis are all persisted rather than regenerated on every request.
Whenever possible, computation moves to the data rather than moving the data to the application. DuckDB executes directly against Parquet files instead of materializing millions of rows through EF Core.
Local disk, R2 object storage, CDN caching, HTTP caching, database caches, and application caches all reduce latency while keeping expensive analytical operations off the critical request path.
Metadata is inexpensive. Raw analytical data is not. The application stores lightweight metadata in PostgreSQL while leaving high-volume numerical payloads in Parquet.
Every long-running workflow is designed to survive crashes, deployments, connection loss, or browser refreshes. Durable state always wins over in-memory state.
Large language models enhance the platform but never define its architecture. AI-generated results become first-class entities that users can edit, publish, search, cite, and build upon.
Correlation Studio is built almost entirely on proven open-source technologies, selected for stability, performance, and operational simplicity rather than novelty.
| Layer | Technology |
|---|---|
| Backend | ASP.NET Core 9 |
| Language | C# 13 |
| Frontend | React 18 + TypeScript |
| ORM | Entity Framework Core |
| Database | PostgreSQL + pgvector |
| Analytics Engine | DuckDB |
| Columnar Storage | Apache Parquet |
| Object Storage | Cloudflare R2 |
| Authentication | JWT, Google OAuth, Microsoft OAuth |
| Payments | Stripe |
| Embeddings | OpenAI + Gemini |
| Language Models | Claude, Gemini, Grok |
| Maps | Leaflet |
| Charts | Canvas, SVG, D3 |
Every major technology choice was made to support a specific architectural goal: DuckDB for analytical execution, Parquet for compact columnar storage, PostgreSQL for transactional metadata, and R2 for inexpensive durable object storage.
As of Version 2.4, the public corpus contains:
600+Datasets
8,000+Experiments
45,000+Discoveries
165M+Rows
15,000+Columns
Every public dataset can participate in future experiments. Every experiment can generate discoveries. Every discovery can become part of a portfolio, appear in the search index, receive AI analysis, be discussed socially, and become part of the retrieval corpus used by Corrie.
That network effect is the central design objective of the platform. Correlation Studio is not simply a statistics application. It is a knowledge graph built from statistical relationships.
Correlation Studio began as an attempt to make correlation analysis accessible to people who were not statisticians. Along the way it evolved into something much broader: a platform for discovering, publishing, discussing, and preserving statistical relationships at scale.
The engineering journey followed the same trajectory. Early versions emphasized functionality. Later versions emphasized scalability, durability, recoverability, discoverability, and operational simplicity. Many of the most valuable improvements came not from adding features, but from removing unnecessary complexity and replacing fragile assumptions with deterministic systems.
The resulting architecture combines transactional metadata in PostgreSQL, high-performance analytical execution through DuckDB, compact Parquet storage in Cloudflare R2, AI-assisted workflows, durable background processing, and a modern React frontend into a unified platform capable of mining millions of statistical relationships while remaining approachable to non-programmers.
Correlation Studio’s mission is simple:
Make data science accessible, make discovery collaborative, and make statistical insight searchable.
Correlation Studio
Rapid Discovery Mining • Causation Analysis
Version 2.4 is a major SEO and content release focused on improving the site’s reputation, discoverability, and visibility for both search engines and AI-powered crawlers.
Most of the improvements happen behind the scenes, but users will notice several significant new features.
A new Topics menu provides an ontological view of the entire platform, making it easy to browse datasets and discoveries by category. The Topics browser is fully available on both desktop and mobile.
Datasets and Discoveries now include a Reanalyze button that generates fresh AI analysis using Claude. Users can also edit the generated analysis directly from the detail pages.
The platform has been rebranded around its core mission:
Rapid Discovery Mining & Causation Analysis
A shorter version also appears throughout the site:
Discovery Mining • Causation Analysis
Don’t forget to try the new full-panel Search, introduced in Version 2.3.
Correlation Studio is a powerful SaaS platform that brings the insights of modern correlation data science to everyone—without requiring users to write code.
Users can upload their own data or discover new datasets through the AI-powered Dataset Wizard, driven by Claude, Gemini, and Grok. Imported datasets become the foundation for Experiments, which automatically compare every compatible column against every other column.
Each experiment produces reusable Discoveries containing:
Every Discovery becomes a searchable, analyzable knowledge object rather than a temporary calculation.
Datasets, Experiments, and Discoveries can be assembled into shareable Portfolios that combine:
Portfolios make it easy to present research findings in an engaging, interactive format.
Artificial intelligence is integrated into every stage of the workflow.
The platform also includes Corrie, an AI assistant that understands every public entity on the site.
Corrie is context-aware, allowing users to ask questions about the page they’re viewing, statistical concepts, or correlation analysis in general. She also serves as the platform’s curator, publishing hundreds of datasets, thousands of discoveries, and numerous example portfolios.
Every major entity—Datasets, Experiments, Discoveries, and Portfolios—can be published to the community feed.
Users can:
From its inception, Correlation Studio was designed as a social platform for statistical discovery.
Correlation Studio is designed for:
Many users may not realize that correlation analysis could improve their workflow until they see it in action. The platform makes advanced statistical discovery accessible to technical and non-technical users alike.
Even casual visitors and mobile users can explore the growing public knowledge base containing over:
Correlation Studio – Data Science Without the Code.
Curious to see what ChatGPT had to say about Correlation Studio since it’s only been out for a month. So I fed it the website and a whitepaper for analysis. I was surprised to see most of the reference samples coming up from my replies on Reddit. But the real story was in the whitepaper. Here are the highlights, but jump to the end if you want to see our rating on a 10-point scale.
After reviewing Correlation Studio and its architecture, I came away with a very positive impression. This isn’t a typical startup that wraps AI around existing analytics software. It reflects a carefully engineered analytical platform with a well-thought-out architecture and a clear understanding of the challenges involved in large-scale statistical analysis.
Below are my technical observations.
The most significant architectural decision was migrating from using PostgreSQL as both metadata store and analytical engine to a true lakehouse architecture.
Rather than attempting to optimize indexes indefinitely, the storage model itself was redesigned.
This mirrors many of the architectural principles used by modern analytical systems such as Snowflake, Databricks, ClickHouse, and MotherDuck, while avoiding the operational complexity of distributed infrastructure.
The separation of concerns is particularly clean:
Choosing DuckDB was probably the most important technical decision in the project.
Instead of building:
the platform leverages an extremely capable analytical database that already provides:
As a result, many future performance improvements arrive automatically through DuckDB itself.
This may be the most underappreciated aspect of Correlation Studio.
Traditional analytics platforms treat correlations as temporary calculations.
Correlation Studio persists them as first-class objects called Discoveries.
Each Discovery contains:
Instead of following the traditional workflow:
Run Query↓View Chart↓Discard Results
Correlation Studio models knowledge as:
Dataset ↓Experiment ↓Discovery ↓Portfolio
This makes statistical discoveries reusable rather than disposable.
Several implementation details demonstrate experience with messy real-world datasets.
These are not academic problems—they are operational ones encountered only after processing thousands of imperfect datasets.
Rather than inventing new statistical methods, Correlation Studio assembles proven techniques including:
Using established statistical methods alongside DuckDB and MathNet makes the platform significantly more trustworthy than many AI-first analytics products.
The primary scaling challenge isn’t dataset size—it’s the explosion of possible column pairs.
For example:
Even with excellent execution speed, brute-force analysis eventually becomes impractical.
Future optimization opportunities include:
One of the platform’s strengths is that AI explains statistical discoveries rather than replacing statistics altogether.
This architecture creates an important dependency inversion:
That makes the system considerably more durable than products that rely entirely on AI.
The greatest challenge may not be engineering at all.
It’s communicating what Correlation Studio actually is.
Initially, the name suggests a statistical calculator.
After examining the architecture, it’s much closer to:
The onboarding experience should emphasize outcomes instead of mechanics—for example:
Find hidden relationships between your own data and thousands of public datasets.
If I were contributing to the project, one feature I’d prioritize would be relationship graphs.
Imagine every Discovery becoming an edge in a knowledge graph:
GDP│Inflation│Interest Rates│Housing Prices│Building Permits
Rather than isolated discoveries, users could navigate connected variables and uncover indirect relationships across datasets.
What impressed me most wasn’t any single algorithm—it was the engineering maturity.
The architecture documents:
That level of transparency gives the architecture significant credibility.
Most analytics platforms answer questions users already know to ask.
Correlation Studio has the potential to answer questions users didn’t know they should ask.
That is a much more difficult—and potentially much more valuable—problem.
As the platform evolves, features such as relationship graphs, causal hypothesis generation, anomaly detection, and cross-domain exploration could make it feel less like traditional business intelligence software and more like a scientific discovery engine.
From a technical perspective, I’d rate the architecture around 9.5 out of 10 for a solo-built SaaS. The remaining work isn’t fixing the foundation—it’s building the next layer of capabilities that naturally extend an already solid design.
On June 1st, 2026, a powerful new SaaS statistics application was unveiled. Correlation Studio brings the insights of correlation data science to expert users looking for no-code solutions, and opens it up to anyone who wants to do research on correlation models. It combines search, experiment & discovery statistics workflows with a publishing platform and social media fabric.
Correlation Studio – Welcome
Users bring their own data in the form of CSV and Excel files, or discover it with our Dataset Wizard. By providing only a topic & description, the wizard harnesses the power of Claude, Gemini and Grok to source files for downloading, ingestion & analysis. In this example we’ve asked for S&P 500 datasets with a simple query and can see the providers have all returned results. The wizard has stripped out bad results and presents only validated links for downloading.

Stepping through the wizard, users can review the links sourced by the AI providers and select datasets for downloading based on rich descriptions, format & size parameters.

After downloading, the Datasets can be examined and column mappings provided. Each column that is selected will be included in the ingested dataset and available for Discovery drill-down. For Datasets with unique keys such as time-series data, a Join Key will be specified for one or more of the columns. This column is used to combine datasets together for correlation analysis. Two datasets must share common key values or the same row-order in order to be used in Experiments.

Using the Create Experiment workflow, users combine Datasets together and prepare to analyze each column against every other. Experiments can consist of Self correlations, which examine every column in a single Dataset against every other, Cross correlations which examine every column of every Dataset in the experiment, or Star correlations which examines an anchor dataset against every other. A Full correlations option is provided to merge all three correlation types.
Here we’re creating a Star Experiment for the S&P 500 Daily Time Series, so the Row Matching mechanism is Time Series. Other values include Shared Key, useful for connecting datasets with information like stock ticker symbols, and Row Offsets, used when the Datasets are from the same export process and known to be linked by cardinal offset.
Additional parameters to the Experiment workflow include Sample Size, which can be used to select idempotent samples of the target Datasets, Minimum Paired Rows and Minimum Sample Coverage which ensure healthy statistical assessments.

After the Create Experiments workflow is complete the user is presented with an overview of all of their Experiments. There may be many, as in this example of 50. Each Experiment row can be expanded to reveal the Discoveries it contains, and each Discovery row can be expanded to reveal its chart.

Depending on the parameters of the Experiment, the columns in the Datasets and the number of rows connected by there may be anywhere from one Discovery to thousands. The Experiments detail page includes all the metadata and parameters of the Experiment, and a table of Discoveries. Click through to each one for a drill-down view.

Choose the Sparklines view to see an overview of all the scatterplots rendered on the same page, useful for spotting trends.

Click-through on any Discovery to view a detailed page with metadata, scatterplots & line charts with drill-down capability and geospatial views where available. The datapoints, P-values, Granger Causality & other metadata are all part of the Discoveries that are analyzable.

Each Discovery can be analyzed with the Granger Analysis and AI Analysis buttons.

Users compile their Datasets, Experiments and Discoveries into shareable Portfolios that combine them with markdown & multimedia to make interactive presentations featuring their podcasts & lectures.

It’s AI all the way, from the Dataset Wizard to Claude analysis for Datasets, Discoveries and Portfolios.
Corrie is the site’s chatbot and she understands every public entity on the site. She’s context-aware so you can ask her questions about what you’re looking at, or more general questions about correlations or statistics. Corrie is also the curator, and has published hundred of Datasets and thousands of Discoveries on the site, as well as example Portfolios.

The entire bundle of entities – Datasets, Experiments, Discoveries & Portfolios – provide content for the site. Users can publish them to the Home feed, create Posts and add them as attachments. There’s sentiment feedback in the form of Likes, Ratings and Comments for all entities. The application was built from the ground up for socializing.

The target market is researchers, data scientists, students & teachers who already know they need to do correlation analysis as part of their workflow. However, the casually curious and users on mobile can still enjoy the content and our comprehensive Search, which uses Year, Topic & Description as parameters.
“Data science without the code” – tagline
What kind of data science? Correlations. Discover hidden statistical relationships in numeric data. Bring your own files or use the Dataset Wizard to compile your sources. Ingest and combine Datasets into Experiments which yield Discoveries in the form of scatterplots, line charts, and statistics. Combine your Discoveries into sharable Portfolios with markup and multimedia support. Apply AI analysis for deep insight into the correlations and causation in your data.
That’s a tall order. Nobody else does that.
If you need this kind of data science you already know it. Please join us. New accounts are granted 5,000 tokens to explore the creator workflows, a $5 value. Only available on wide screen displays. Mobile users have a limited, read-only version of the site, but have full access to Search, Corrie, and our unique corpus of curated correlation data.
What will you discover today?

I get asked a lot if I’m afraid of AI since it’s coming for my job and my answer is:
Not yet.
I already use AI in 75% of my professional work and it does 90% of the coding. I’m working for a global enterprise that has thousands of applications that will take a decade to modernize even with contemporary AI tools.
Still, the threat is real. I should be in a low-key panic, but I’m not and here’s why.
Everybody has access to AI now to build software just like everybody has access to guitars to make music. The tools are there.
It doesn’t mean you know how to use it. Or want to. Or can be good at it if even you want to be.
You have a pen & pencil, but that doesn’t make you an author. You have Excel, but you are not a mathematician. I have a guitar and I know how to use it, but I’m not Van Halen despite 45 years of trying to be.
Software development at scale still requires unique skills just like ripping out a guitar solo. It requires many years of dedicated learning effort, practice and failure, and most people wash out. Walk into any Guitar Center store for evidence.
What’s important is the domain knowledge that you bring to the table. What are you programming about?
Implementing something trivial is easy. Anybody can pick up a guitar and play a few cowboy chords, and everybody should. You could even write your own song, and you should do that, too. It won’t make you a musician, although you may be musically inclined. But it might make you happy.
Making a day planner with Lovable is fine in a world where everybody has their own day planner or their own calorie counter.
You could create those right now, with Lovable or Claude Code or whatever you get your fix off of. You could have your own website.
Why would you? It might make you happy, and if so you should do that. Something cool about that. Playing guitar makes me happy, and a few million other folks.
You might even pay for the pleasure of doing it.
But at the end of the day programming is not just about automation. It’s about creation. And generative AI technology helps a creative mind execute on its will. But what is the thinking behind the execution? It has to be something worth creating to bring any value to the equation. It has to be coherent, informed, disciplined and thorough.
I’ve been programming for 45 years, 35 of it professionally. But I don’t make my living by coding my whimsy. I make my living by learning other people’s problem domains and codifying them into rules-based systems. And that still requires knowing something about something.
“There are still humans in this equation, robot” – Rango Unmuzzled
See samples of my work:
http://rango.music

https://interactivecircleoffifths.com

Yesterday I used Claude for Chrome to help me storyboard a video for American Style. I gave him explicit instructions to browse the web for images and videos of 9/11 and the war in Afghanistan, preferring widescreen, high-resolution photos and videos. I added the requirement that they must all be open-source. I then provided the lyrics to the song we are storyboarding and asked him to create an index of the media related to the lyrics.
The song is 6 minutes long. I asked for 200 media items.
Claude proceeded to browse the web for over an hour, unattended, and identified 800 initial media items that matched the subject matter, from which it selected 200 based on my constraints. It created an index file which grouped into 10 sections according to the lyrics. which it cited inline.
That was impressive, but I needed to download all the media. So I asked Claude to crawl the list and download each piece of media, creating a new index file with local filenames. He started to doing the task, but after a few minutes stopped and prompted me with a question.
The web navigation was slow and prone to errors. He was having trouble downloading too many links and stopped. He said it was going to take 3-4 hours or more and was likely to fail to acquire all the media. Claude then proposed we create a Python script instead that would read the index file and download all the links. That sounded reasonable, so we proceeded to do that. Claude created the script and I ran it.
Pretty academic. The file was already in markdown language and easy to parse. Or so I thought.
The script failed immediately, zero downloads. What followed was an aggravating round of at least 15 revisions to that script as I copied non-trivial Python code and error messages back and forth with Claude, pointing out where it was incorrect and trying to get regular expression matches to work with spaces, underscores, slashes and invisible embedded characters from the file that he created.
That’s right, Claude got stuck on regex. Seriously stuck. All programmers should hate regex because they’re so damn complicated. They’re powerful, like a loaded gun. Claude kept telling me, “this is the final version” and “this will run correctly”. Fail. He finally said “we are going in circles” and he was right. I finally realized I was running against the Sonnet model, trying to conserve tokens since it was such an academic task. So I switched models to the latest Opus and in just a few revisions the script started working.
Mostly. It still failed to download after 10 files or so, blocked by the web server for being a greedy client. So I had to tune it to wait up to a minute between requests. I just started it, and it’s running as I write this. Estimated time to completion is 3-4 hours. I ran out of tokens for my session so the last round of debugging was on my own.
I found all of that astonishing given the great success I’ve had with Claude Code building the Interactive Circle of Fifths, and my latest full-stack application, Correlation Studio. Amazing, mind-blowing results and surreal conversations. Months worth.
But there he was, dead on the hill of regex. “We keep going in circles”. Famous last words.
American Style – Audio and Lyrics
The song in development is American Style. Featuring vocalist John Serrano and drummer Bill Ray, this track is a musical exploration of the psyche of American warfare post 9/11. It’s a hard rock / progressive metal track full of dramatic intensity and explosive power, drawing out the story of a sniper who goes to Afghanistan to avenge his countrymen. The collaboration was one for the books.
American Style on Spotify • American Style on Soundcloud • American Style on Bandcamp
I’m going to Afghanistan
That’s where I’ll make my final stand
Like Johnny Spann and everyone
Who loved him
I know you may not understand
I find myself the kind of man
To cross the line in the sand
For something
This is war, American style
Nightmares come of age
Deliver me from rage
This is war
The target looks like every man
I see him in the market stand
I could take him now
If I’m not wrong
Before he gets to Pakistan
I’ll track him through the mountains and
He’ll give me one reason
To be strong
This is war, American style
Nightmares come of age
Deliver me from rage
This is war
Winter’s come, it bites my hands
Like every day I’ve walked these lands
I’ve found another reason
For a man’s revenge
With every shot I’ve made my stand
So many men so much demand
I’ll never leave, I’ve joined the band
Until the very end
This is war, American style
Nightmares come of age
Deliver me from rage
This is war
Face to face, hand to hand
Fighting for the promised land
Every target, every man
Squeeze and go
I look at him and his demands
I know he’ll never understand
I’ve always had the upper hand
He will not know
This is war, American style
Nightmares come of age
Deliver me from rage
This is war
12 straight hours of debugging a production data issue and I finally had to excuse myself and call it a night.
I started by myself at 6:30 AM. By 9:00 I was hosting a meeting with another developer and a handful of managers. Also a few business people along with their confusing portfolio of queries, all manner of broken. This is a red-alert, all-hands-on-deck problem that is time-sensitive because it’s regarding year-end data.
It’s not right, that data.
So all day I’ve been debugging with people looking over my shoulder as I type and switch tabs and jump around, with managers asking questions and everybody speculating and expressing their concern and exasperation. At the start of the day the assumption was that I broke something spectacularly back in August. By noon we were onto something else, chasing data through linked tables and stored procedures that were created 20 years ago and haven’t changed in 10. By 3:00 we discovered the root cause.
By 6:30 PM we finally had a solution carved out in our test environment. I had made dozens of changes to jobs to simulate rolling back the clock so the year-end process could be run again. We were just about to pull the trigger on the offending jobs when I abruptly had a moment of doubt. I concluded it was unwise to proceed.
We’re not really going to do this in production tonight, are we?
I then explained why I thought it was a bad idea for me, specifically. Mental fatigue is real. 12 hours in the chair, no food, no breaks. But I suspected I was not the only one and I was right. Everybody was struggling to keep up with the details of the diagnosis, much less the workings of the fix. They were looking to me and my engineering partner to push the button and have a month’s worth of problems corrected without overlooking anything or introducing any side-effects. Nice work if you can get it. And what if something does go wrong? We’re potentially looking at another 12 hours.
So I called it. Not that I have the authority to do that. But we had consensus that fatigue was the deal breaker. It’s an emergency, but it’s not <that> kind of emergency. It’s a business emergency. It can wait until morning.
This is the senior part of senior software engineering. I’m old and tired. In my thirties this would not have been an issue, because I was young and foolish and frequently felt immune from fatigue due to my bipolar disorder. And I probably would have pushed to production, sunk the battleship and lost my job. Not today, rabbit. We take our medicine here, and observe doctor’s orders.
As for tomorrow? Thoughts and prayers. Good vibes and juju. Just remember to ask for a raise when this is all said and done.
I received some feedback about the Interactive Circle of Fifths on Facebook. I decided to take the conversation to Claude Opus 4.5. The following is my conversation verbatim. It reveals a depth of reasoning that is astounding, and provides guidance for next steps. For an introduction to the online tool, refer to the About page.
The conversation:
I have some feedback regarding the Inner Ring. I’m not sure it’s working how people might expect, with regards to the diminished scales. For the Cmaj/Amin wedge, what is the appropriate scale to display? It’s currently showing Bdim because that’s the chord. However an argument could be made that it should be something else (I think Cdim). This was the feedback:
The feedback:
What is the inner circle supposed to be exactly? The whole half symmetric? It’s in the wrong spot on the circle. The one at 12 o’clock would occur at 9 o’clock Eb/C minor over a G13b9 spelled as: G Ab Bb (B nat) C# D E F. It would have a dominant function over G (V7), using Bº7 and C#º7 as upper structure both tensions as upper and lower leading tones pointing toward C tonic, but the MINOR version not the major version. Without a tonal context (5 in the bass) the symmetrical scale is ambiguous just like a chromatic or whole tone scale would be. The scale that should be in the C/A minor family 12 o’clock should be rooted “E F G G# A# B C# D”. At 3 o’clock here.
My response:
The layout is simple, the diminished chord in Cmaj is Bdim. The scale can be toggled and it’s Bdim as people would expect. I don’t understand why you would want to deviate from that. It makes sense as it is even if there are other ways to express the inner ring.
His follow-up:
The B diminished chord in C major is a half diminished. BDFA. The purpose of the “diminished scale” is that it derives from the FULLY diminished 7, from the raised 7th degree from C MINOR. Bb->B natural gives BDFAb. The variable 7th is even a part of the actual scale for this reason, you get to use both notes. You have the scales associated with the wrong key signatures inside the circle. The extra symmetrical “out notes” are accidentals against the minor key’s key signature.
Can you make sense of his feedback? Is the current behavior what a someone deeply familiar with music theory would expect?
Claude’s Response:
This is a deep music theory question. Let me parse through the feedback carefully.
The user’s friend is making a sophisticated point about diminished scales and their harmonic function. Let me break it down:
Current Implementation:
Half-diminished vs Fully diminished distinction:
In C MAJOR, the naturally occurring vii chord is B half-diminished (B-D-F-A) – also written Bø or Bm7♭5
The FULLY diminished seventh (B-D-F-Ab) comes from C HARMONIC MINOR (which has B natural, not Bb)
The diminished scale (whole-half or half-whole symmetric) is specifically associated with the FULLY diminished 7th chord
The scale’s harmonic function:
The diminished scale has dominant function – it resolves somewhere
The scale built on B (B C# D E F G Ab Bb) functions as a dominant over… not C major, but rather as part of a V7 chord
The friend is saying this scale should be associated with the key it RESOLVES to, considering its dominant function
What scale should be at C/Am?
The friend suggests “E F G G# A# B C# D” should be at the C/Am position
This is the E diminished scale (half-whole)
Why? Because E°7 (E-G-Bb-Db, enharmonically E-G-A#-C#) is actually a rootless G7b9, which resolves to C minor
Let me think about this more carefully…
In jazz theory:
The diminished scale’s symmetrical nature allows for multiple interpretations. Its half-whole mode creates tension that wants to resolve, essentially acting as a sophisticated dominant chord generator. The scale’s chromatic inflections provide rich harmonic possibilities across different musical contexts.
The diatonic half-diminished chord in C major (B-D-F-A) differs from the fully diminished B°7 chord. This distinction matters because it changes how the chord functions within the harmonic progression. The scale choice depends on the specific harmonic intent – whether creating tension, preparing resolution, or exploring chromatic substitutions.
The key insight is understanding how these theoretical approaches – traditional and jazz functional harmony – offer different lenses for analyzing the same musical material. Each perspective reveals unique aspects of the chord’s potential and its role in harmonic movement. The B°7 chord in C harmonic minor serves as a dominant leading to Cm, highlighting the scale’s unique harmonic structure.
Examining the C/Am wedge reveals multiple potential diminished scale perspectives: C major’s Bø7, A minor’s G#°7, and C minor’s B°7. Each context offers a different harmonic interpretation.
The proposed E half-whole diminished scale demonstrates fascinating symmetry, connecting to multiple diminished seventh chords across different enharmonic equivalents. Its intricate relationships reveal the complex interconnectedness of harmonic systems.
A closer look at the current implementation shows discrepancies in scale placement and root selection, suggesting a need for careful recalibration of the theoretical framework.
The friend’s proposed scale follows a precise half-whole diminished pattern starting from E, with specific interval relationships that differ from our current configurations. I’m tracking the unique intervallic structure: half-whole-half-whole-half-whole-half-whole, which creates a distinct harmonic landscape.
This scale’s unique construction implies potential shifts in our current musical mapping, particularly around the C/Am and related harmonic regions.
The symmetric nature of the diminished scale suggests multiple valid interpretations, with each chord potentially serving multiple functional roles depending on harmonic context. The E°7 chord exemplifies this complexity, revealing potential resolutions to F, F#, and other tonal centers.
The core insight emerges from the scale’s inherent ambiguity – its symmetric structure allows multiple theoretical perspectives, challenging straightforward categorization.
Two primary approaches emerge: the scale-degree method versus functional harmony interpretation. While the scale-degree approach offers pedagogical clarity by associating the diminished scale with the seventh scale degree, the functional harmony perspective emphasizes harmonic resolution and contextual meaning.
The current implementation provides a consistent, learner-friendly framework, but risks oversimplifying the nuanced relationships within diminished harmony.
For an educational tool targeting musicians at different skill levels, maintaining the straightforward approach while acknowledging the complexity for advanced practitioners seems most effective. The goal is to balance accessibility with theoretical depth, allowing musicians to grasp foundational concepts before exploring more intricate harmonic relationships.
Claude’s Summary:
Your friend raises valid points from a jazz/functional harmony perspective. Let me break down the theoretical disagreement:
The Core Issue: Half-diminished vs Fully-diminished
Your friend is correct that in C major, the diatonic vii chord is actually B half-diminished (Bø7): B-D-F-A. This uses only notes from the C major scale.
The fully diminished B°7 (B-D-F-Ab) comes from C harmonic minor (C-D-Eb-F-G-Ab-B), where the raised 7th degree creates the leading tone.
The diminished scale (the symmetric whole-half or half-whole pattern) is specifically associated with fully diminished 7th chords, which arise from minor keys, not major.
Two Valid Approaches:
| Approach | Logic | At C/Am wedge |
|---|---|---|
| Current (scale-degree) | Show diminished scale on the 7th degree of the major key | B dim scale |
| Functional harmony | Show the diminished scale that resolves to this key (from its dominant function) | Different association |
Your friend’s specific argument:
The diminished scale functions as a dominant — it wants to resolve somewhere. The B°7 chord (and its associated scale) functions as part of G7b9, which resolves to C minor, not C major. So showing B diminished scale at the C/Am wedge conflates major and minor in a way that’s theoretically inconsistent.
My assessment:
Your current implementation is pedagogically sound and consistent — it follows a simple rule (7th degree = diminished) that’s easy to understand. Many circle of fifths tools take this approach.
However, your friend is right that a jazz theorist might expect different behavior, since the diminished scale’s natural harmonic context is minor keys with dominant function.
Options:
Would you like me to explore option 2, or do you prefer keeping the current straightforward approach?
Resolution:
Based on Claude’s analysis, I’ve decided to add menu to the UI. When a diminished scale is selected the control will become active, allowing the user to switch between Scale Degree / Harmonic Function / Minor Tonic. This supports more advanced users while keeping the default workflow as simple as possible.
If you’ve ever wondered why certain chords sound good together, why some key changes feel smooth while others feel jarring, or how musicians seem to instinctively know which notes will work over a chord progression, the Circle of Fifths holds the answers.
This post is about the Interactive Circle of Fifths. To provide feedback on the application simply reply to this post with your questions and comments. To see a video of the application in use, watch The Interactive Circle of Fifths on YouTube.
The Circle of Fifths is a visual map of all 12 major and minor keys, arranged to reveal the relationships between them. It’s been a cornerstone of Western music theory for nearly 300 years, and once you understand it, you’ll start seeing patterns everywhere in music.

The name comes from how the keys are arranged: as you move clockwise around the circle, each key is a “perfect fifth” higher than the one before it. A perfect fifth is the interval of seven half-steps—from C to G, from G to D, from D to A, and so on. This interval has a natural, consonant quality that our ears find pleasing, which is why it forms the backbone of so much music.
The outer ring displays all 12 major keys. Starting from C major at the top (which has no sharps or flats), moving clockwise adds one sharp to the key signature with each step:
Moving counter-clockwise from C adds flats instead:
Click any major key on the outer ring to hear its major chord and see the notes highlighted on the piano or guitar below.
Every major key has a “relative minor”—a minor key that shares the exact same notes and key signature. The relative minor is always found three half-steps below the major key’s root note.
For example, A minor is the relative minor of C major. Both keys use only the white keys on a piano (no sharps or flats), but they have different tonal centers. C major sounds bright and resolved when you land on C; A minor sounds darker and resolved when you land on A.
This relationship is powerful for songwriters. You can shift between a major key and its relative minor within a song to change the emotional color while keeping all the notes familiar.
The inner ring changes based on the dropdown menu selection:
Diatonic shows the complete seven-note major or minor scale. These are the notes that “belong” to the key and form the basis for melody and harmony.
Pentatonic displays the five-note pentatonic scale. This scale omits the two notes that create the most tension in the full scale, making it incredibly versatile for melodies and improvisation. The major pentatonic works over major chords; the minor pentatonic is the foundation of blues and rock soloing.
Blues shows the six-note blues scale. This is pentatonic scale with the the addition of a flatted-fifth. This extra note gives a distinctive bluesy sound.
Diminished shows the diminished triad built on the seventh degree of the scale. This tense, unstable chord naturally wants to resolve and is useful for creating movement in chord progressions.
Click the inner ring when a scale is selected to hear the scale played ascending and descending.
The small staff notations around the circle show each key’s signature—the sharps or flats that apply throughout a piece in that key. Rather than writing a sharp or flat symbol every time you encounter that note, the key signature tells you once at the beginning: “Every F in this piece is F#” (in G major, for instance).
Notice the elegant pattern: sharps accumulate in a specific order (F#, C#, G#, D#, A#, E#) and flats in the reverse order (Bb, Eb, Ab, Db, Gb, Cb). Musicians memorize these patterns, making it quick to identify any key at a glance.
Below the circle, the harmonized scale staff shows something essential: the seven chords that naturally occur within a key.
When you build a triad on each note of a major scale using only notes from that scale, you get a specific pattern of chord qualities:
| Scale Degree | Roman Numeral | Quality | Example in C |
|---|---|---|---|
| 1st | I | Major | C |
| 2nd | ii | Minor | Dm |
| 3rd | iii | Minor | Em |
| 4th | IV | Major | F |
| 5th | V | Major | G |
| 6th | vi | Minor | Am |
| 7th | vii° | Diminished | B° |
This pattern—Major, minor, minor, Major, Major, minor, diminished—holds true for every major key. The Roman numerals (uppercase for major, lowercase for minor) let musicians discuss chord progressions in a universal language. When someone says “it’s a I-IV-V progression,” that means the same chord relationships whether you’re in C major (C-F-G) or E major (E-A-B).
Click any chord on the harmonized staff to hear it and see its notes on the instrument display.
For minor keys, the pattern shifts to: minor, diminished, Major, minor, minor, Major, Major (i, ii°, III, iv, v, VI, VII).
When you select a key, notice that the adjacent keys on either side are highlighted. These are your closest musical neighbors—the keys most closely related to your current key.
Moving one position clockwise adds one sharp (or removes one flat). Moving counter-clockwise adds one flat (or removes one sharp). These neighboring keys share six of their seven notes with your current key, differing by only one note.
This is why chord progressions often “borrow” chords from neighboring keys—they fit naturally because most of the notes are already part of your key. It’s also why modulating (changing keys) to an adjacent key sounds smooth, while jumping to a distant key on the circle sounds dramatic or surprising.
Toggle between piano and guitar displays to see how scales and chords map onto each instrument.
Color coding:
On guitar, you’ll notice that scale patterns repeat in predictable shapes due to the instrument’s tuning. On piano, the relationship between black and white keys makes certain keys feel quite different under your fingers, even though the theoretical relationships are identical.
Finding chords that work together: Any chord from the harmonized scale will sound good with any other chord from that same key. The I, IV, and V chords form the backbone of countless songs. Adding the vi chord gives you the ubiquitous I-V-vi-IV progression heard in hundreds of pop songs.
Transposing songs: If a song is too high or low for your voice, use the circle to find a new key. The relationships between chords stay the same—a IV chord is still a IV chord—only the letter names change.
Understanding songs you hear: When you recognize that a song moves from I to vi to IV to V, you’re hearing the underlying structure that makes it work. This pattern recognition accelerates learning new music.
Improvising and soloing: The pentatonic scale is your safety net. Staying within those five notes virtually guarantees you won’t hit a “wrong” note over chords in that key.
Writing chord progressions: Experiment with the harmonized scale. Try moving between chords that share notes (like C and Am, which share C and E). Try the tension of the vii° chord resolving to I. Try borrowing the IV chord from the parallel minor for a melancholy color.
The Circle of Fifths isn’t just a reference chart—it’s a tool for discovery. Click around, listen to how keys relate to each other, notice which chords share notes, hear how scales differ between major and minor.
Music theory sometimes gets a reputation for being dry or restrictive, but it’s really a map of possibilities. The Circle of Fifths shows you why certain things work, freeing you to make informed creative choices rather than stumbling in the dark.
Whether you’re writing songs, learning an instrument, studying for a theory exam, or just curious about how music fits together, spend some time with the circle. The patterns you discover will start appearing everywhere you listen.
11/15/2025 – Initial build.
11/18/2025 – Initial release v1.0.0 Supporting all features in the prompt.
11/25/2025 – Feature release v1.1.0 Added samples & new diatonic scale mode, updated pentatonic mode.
11/27/2025 – Feature release v1.2.0 Added 15 frets to guitar fretboard, stacked triads for chords.
11/28/2025 – Feature release v1.3.0 Added harmonized scale section and new guitar samples.
11/30/2025 – Feature release v1.4.0 Added complete set of .mp3 guitar samples for all 48 notes E1 to D6.
12/15/2025 – Feature release v1.5.0 Added Diminished Scale to Inner Ring menu.
12/18/2025 – Feature release v1.6.0 Added legend for dots on piano and guitar, swapped green & blue colors.
12/21/2025 – Feature release v1.7.0 Added support for Blues Scale in Inner Ring.
12/31/2025 – Feature release v1.8.0 Added Chord Sequencer panel sharing space with the Harmonized Scale control.
01/04/2026 – Feature release v1.9.0 Added Function selector for Diminished Chord and Scale modes with Scale Degree and Resolves To options.
This week I decided to challenge Claude Opus 4.1 to a coding challenge. The objective? Create a React application that renders an interactive circle of 5ths. I started by creating a scaffolding application using the Visual Studio React / .Net Core template. I then created an empty Typescript component called “InteractiveCircleOfFifths” and proceeded to throw down with the LLM.
The prompt:
Create a React component using Typescript that renders a musical circle of 5ths onto a canvas. The circle will consist of 3 bands, Major, Minor and Notes. Include the Major keys on the outermost band of the circle, Minor keys on the inside of the Major band, and Notes in the middle. Each key should render it’s slice of the pie in a different pastel color, and each band of that pie a slight variation of the key color. Do not use yellow or bright green.
The Major key labels will be the key itself, like F C G, etc. The minor key labels will also be the key name but in a smaller font since they’re inside the circle. The bottom label will be Gb/F# since it’s enharmonic. The key label will be the most pronounced text in each ring. Add text to the key, closer to the center of the ring indicating the three notes in the chord, for example CEG in the case of C Major. The Notes will be a label consisting the best notes to play in that key, which corresponds to the clockwise adjacent 5 keys. In the case of C that would be C G D A E. The label should render vertically in a small font since it’s the innermost band. Rotate the label so it’s always descending towards the center of the circle.
Outside of the circle add a ring of images showing a short treble musical staff for each key showing the sharps or flats. The sharps or flats will render on the staff lines left to right, and be placed on the staff lines according to their pitch. Add a label to each image indicating the count of sharps or flats. The bottom three key signatures will have two images showing sharps and flats respectively. Always make the staff background white, with the lines and sharps or flats in black. Make the circle interactive. Clicking anywhere on the circle highlights that key, the one before it and the one after it. Draw an outline around them and make them stand out by adding a low-opacity color to the overlay. Clicking a Major or Minor chord also causes the corresponding chord to be played, either using a .wav or .midi file. Make sure to use the proper inversions, so that C Major is CEG, A minor is ACE. Clicking on the Notes band will play the notes in ascending sequence.
At the bottom of the page render a fixed panel that displays either a piano keyboard or a guitar fretboard, best on a menu selection. For the piano, selecting a key on the circle will add an overlay of dots on the piano keyboard corresponding to the notes in the chord, either major or minor. Use a green dot for the root (C for C Major, A for A minor), and blue dots for the other two notes. The same note may appear more than once. Selecting a Notes band will render all of the notes on the keyboard, using green for the root, blue for chord tones, and soft orange for the other notes. Selecting a key on the piano keyboard will cause the corresponding key signature to become active in the circle.
For the guitar add blue dots corresponding to the notes of the chord. For the guitar, selecting a key on the circle will add an overlay of dots on the guitar fretboard. Use a green dot for the root, paying attention to the inversion between major and minor, and blue dots for the other two notes. Selecting a Notes band will render all of the notes on the fretboard, using green for the root, blue for chord tones, and soft orange for the other notes. The notes may appear multiple times on the guitar fretboard. Selecting a note on the guitar will cause the corresponding key signature to become active in the circle. When rendering the guitar fretboard, add a vertical line for the nut, effectively fret zero. Open strings should render left of the nut. Fretted strings should render to the right. Make sure the notes rendered on the fretboard are in the correct position. Previous attempts to render this resulted in the notes being off by one fret because of the ambiguity of the first fret.
The response:
Perfect! I’ve created a comprehensive interactive Circle of Fifths application with all the features you requested:
The application correctly handles musical theory including proper inversions, enharmonic equivalents, and accurate fretboard positioning. The guitar implementation specifically addresses your concern about notes being off by one fret by clearly separating the nut position and properly placing notes either as open strings or between the correct frets.
My reaction:
I was amazed at the thoroughness of the response, and impressed with the component source code. It rendered exactly as prompted, with some minor flaws. The Claude web app actually renders the component right alongside your chat, so you can inspect it with a web debugger and make changes to styles interactively.
The component:

The debugging:
After generating the initial component there was still a lot of debugging to do. I used Claude for that as well, synchronizing my Visual Studio copy with every iteration, synchronizing with my Github repo after every change. Here’s a sample debugging prompt:
Update the Typescript component.
1) Change the rendering of the Note band so the labels never go past 90 degrees, they are currently rendering upside-down for the bottom half of the circle. 2) Adjust the rendering positions for the sharp notes on the staff, the y-axis position is a few pixels too high. 3) Fix the rendering of the notes on the piano so every note on the keyboard that’s in the Major Chord, Minor Chord, or Notes array gets a dot. Currently some key signatures are missing higher notes. 4) Fix the colors on the dots. Currently orange dots are being rendered when the note is in both a chord and the notes array.
Claude’s response:
I’ll update the TypeScript component to fix these issues.
Fixing the Notes band rotation to prevent upside-down text
Adjusting the Y-axis position for sharp/flat notes on the staff
Now I need to fix the piano rendering to show all octaves of the notes and fix the color logic:
Fixing piano rendering to show all instances of notes and correcting dot color logic
Now let’s also fix the same color logic for the guitar to be consistent:
Fixing guitar dot color logic to match piano
Checking the drawPiano section to ensure we have the complete updated code
Let me check the complete Notes band section to verify the fix:
Checking the updated Notes band rotation logic
Perfect! I’ve updated the TypeScript component and created a new component file with all four fixes:
The application now properly displays:
Orange dots: Only appears in “notes” mode for notes that aren’t part of the chord
Green dots: Root note only
Blue dots: Chord tones (non-root notes in the chord)
The deployment:
After an entire day of debugging the application was ready to roll out. I initially tried to deploy to Azure, and met with a litany of deployment failures. However, using the job logs provided by Azure, over the course of several hours I was able to debug those failures with the help of Claude and get the application deployed on the Azure websites web app hosting environment. It was even deploying from my Github repo. I was using the Free tier, so when it came time to use my newly registered domain name I was faced with a stiff upgrade fee of $60/month for the Basic tier. That was too rich for my blood, so I found an alternative hosting provider. For $10 a month Winhost provides .Net application hosting. After downloading their Publish template I was able to publish directly from Visual Studio without any hassle.
You can view the deployed application at InteractiveCircleOfFifths.com.
My feedback to Anthropic:
This is incredible. My first shot at the component was 90% complete and accurate. A few hours of revisions and debugging were enough to bring this entire project over the finish line. The refinements were really impressive. Claude demonstrates contextual awareness on a massive scale and shows you the chain of reasoning behind its changes.
Note that a few times Claude generated component code that wouldn’t render. I eventually had to paste back from Visual Studio into Claude to revert to a prior version of the component. Also, Claude attempted to provide a full Html solution that was unnecessary and failed to render every time due to dependency errors with Tailwind.
The only deal killer in the entire process was cost. I had to bump up to the $100/month plan and even then quickly exhausted my quota. This seems steep for 1200 lines of code. I’ll have to downgrade after the month is expired.
Your feedback to me:
This entire process took only three days from inception to deployment. Early feedback was generally positive, but people don’t care for the quality of the tone generator sounds. I will correct that in a future release by converting from the tone generator to using samples. This will take some time to code and record all the samples for both guitar and piano. I have also had a request to add a ukulele visualizer. Please let me know what else you’d like to see in the application by commenting right here on the blog.
Thank you for using the Interactive Circle of Fifths. I look forward to your feedback!