Correlation Studio Whitepaper
Building a statistical discovery engine with DuckDB, Parquet, Cloudflare R2, PostgreSQL, and AI-assisted causation analysis.
Correlation Studio is a vertically integrated correlation-analysis SaaS built to mine relationships across datasets, surface statistically meaningful findings as persistent knowledge objects, and make advanced data science usable without code.
Executive Summary
Most analytics software answers questions users already know to ask. Correlation Studio was built for a different problem: discovering relationships users did not know were there.
The premise is simple but technically demanding. A user uploads or discovers tabular datasets, pairs them into experiments, and the system exhaustively compares numeric columns against one another. When a relationship clears the configured threshold, Correlation Studio promotes it into a first-class object called a Discovery.
A Discovery is not a temporary chart. It is a persistent, publishable, analyzable artifact containing statistical metadata, visualizations, regression overlays, confidence intervals, Granger-causality testing, source-row drilldowns, and AI-generated interpretation.
The Core Bet
A one-person team armed with the right primitives for discovery mining and causation analysis can deliver useful analytical insight faster than a larger team using less integrated tools.
That belief shaped every architectural decision. Correlation Studio favors simplicity over optionality: one database, one app server, one object storage account, one embedded analytical engine, and one consistent set of product abstractions.
Whenever the system had to choose between a flexible-but-complex pattern and a rigid-but-simple one, the simpler design won. Per-user database shards were collapsed into a single PostgreSQL database. Bulk analytical rows were moved out of PostgreSQL and into Parquet. In-row binary payloads were replaced with per-discovery columnar files. Multiple state flags were consolidated into simpler state machines.
The result is a product that is technically ambitious but operationally small enough to run as an independent SaaS.
Default-Alive by Design
Correlation Studio was designed around a severe constraint: it needed to be sustainable before it became large.
The free tier is genuinely free. New users receive 10 GB of storage and can browse the public corpus without a recurring charge. Revenue comes from token packs that fund ingestion, correlation work, AI search, AI analysis, and chatbot usage.
At roughly $5 per month of token spend per actively engaged user, about 60 engaged users cover the platform’s operating run rate: approximately $300 per month for two VPS instances, Cloudflare, R2 object storage, and the domain.
At larger scale, the infrastructure cost profile stays intentionally flat. Object storage is metered by byte with zero egress through Cloudflare R2. Compute grows in fixed steps. AI API spend is the primary engagement-scaled cost, and the token margin exists specifically to cover it.
Capital Efficiency as Architecture
The system is not only optimized for query speed. It is optimized so the business can remain alive while the corpus, audience, and go-to-market motion are still forming.
The Crucible: When v1 Broke
Correlation Studio was built solo over roughly five months. Commit zero landed on December 29, 2025. Around 570 commits later, the project had become a production SaaS.
It did not get there in a straight line.
The original v1 architecture used a central PostgreSQL database plus per-user content shards. Parsed source data lived in relational tables named Rows, KeyIndex, and ValueIndex.
That design worked until realistic ingestion traffic arrived: stress tests designed to break it.
Wide datasets generated enormous write amplification. A CSV with thousands of columns and millions of rows could produce hundreds of millions of B-tree inserts. On SATA-backed shards, that turned into IOPS queue starvation. Under realistic ingestion load, the system collapsed at about 10 concurrent active jobs.
The fix was not a tuning pass.
It was a clean-room replacement of the analytical substrate.
The v2 Lakehouse migration, completed on May 17, 2026, deleted all content sharding, retired the bulk-data PostgreSQL tables, moved source data into columnar Parquet files on Cloudflare R2, and replaced row-range index scans with DuckDB queries against Parquet.
Roughly 5,000 lines of C# disappeared in the cutover. More importantly, the frontend never noticed. Dataset, experiment, and discovery API contracts survived the migration intact.
v1: PostgreSQL as data lakeUpload ↓Rows / KeyIndex / ValueIndex ↓B-tree write amplification ↓IOPS starvation under ingestion load
The original architecture failed because row-oriented storage was doing analytical storage’s job.
v2: Lakehouse substrateUpload ↓Parquet on Cloudflare R2 ↓Local SSD hot cache ↓DuckDB analytical queries ↓PostgreSQL metadata only
The v2 architecture separates metadata, bulk storage, and analytical execution.
The Current Architecture
Correlation Studio’s current architecture is deliberately compact. It uses a single PostgreSQL database for metadata, Cloudflare R2 for immutable bulk data, DuckDB for analytical execution, and a .NET application server to coordinate ingestion, statistics, AI workflows, billing, and the public web app.
PostgreSQL 17
Stores users, billing, datasets, experiments, discoveries, jobs, audit records, and RAG vectors through pgvector.
Cloudflare R2
Stores bulk row data as columnar Parquet files: one Parquet file per dataset and one per discovery datapoint payload.
DuckDB
Runs embedded inside the .NET app server and queries Parquet through a local SSD hot cache or R2’s S3-compatible interface.
.NET 10 API
Provides the ASP.NET Core backend in a three-project solution: Core, Infrastructure, and Api.
React + TypeScript
Powers the SPA with Vite, canvas-rendered charts, route-level code splitting, React Query, and Zustand.
AI Providers
Claude, Gemini, and Grok run in parallel for dataset search, AI analysis, and Corrie, the in-app RAG chatbot.
The architecture is not cloud-native in the fashionable sense. It is not a swarm of microservices. It is not built around Kubernetes. It is a vertically integrated system with a small number of moving parts, each chosen because it eliminates an entire class of complexity.
Product Overview
Correlation Studio turns datasets into reusable statistical knowledge. The user workflow is intentionally linear: create data, combine data, discover relationships, explain relationships, and publish the results.
- Sign In Users can sign in with email and password, Google OAuth, or Microsoft OAuth. New accounts receive 10 GB of storage and zero tokens. The platform remains free until the user wants to ingest data, run AI search, or perform AI analysis.
- Create a Dataset Datasets can be created by uploading local files, pasting URLs to remote files, or using the AI-powered Remote Search workflow. Remote Search asks Claude, Gemini, and Grok for relevant open-data URLs. Every dataset is assigned a Topic, receives inferred column types, and produces both a metadata row in PostgreSQL and a Parquet file in R2.
- Create an Experiment An experiment pairs two datasets, chooses a join strategy, and sets an absolute correlation threshold. The experiment ingester evaluates every numeric-column pair and emits one Discovery for each pair whose absolute correlation exceeds the threshold.
- Browse Discoveries Discoveries can be explored through scatterplots, line graphs, residual plots, heatmaps, Pearson-vs-Spearman divergence charts, bubble charts, sparklines, lag analysis, rolling correlation, regression overlays, and correlation networks.
- Drill Into Source Rows Every chart datapoint can be traced back to the source rows on both sides of the join. For time-series joins, the drilldown is bucket-aware: a single day can collect all X and Y rows whose timestamps fall inside that bucket.
- Analyze and Reanalyze Datasets, experiments, discoveries, and portfolios can be analyzed with Claude, Gemini, or Grok. The resulting Markdown narrative is pinned to the entity. Once analysis exists, the action becomes Reanalyze, replacing the prior model output with a fresh one. Datasets and Discoveries also expose an edit path so owners, and in some cases moderators, can refine the analysis text manually.
- Publish Datasets, experiments, discoveries, portfolios, posts, and forum questions can be published to the Home feed and made searchable.
- Compose Portfolios Portfolios are block-based pages that combine Markdown, images, embedded datasets, experiments, discoveries, analysis output, audio, and video into interactive presentations.
- Transform Data The Tools page supports derivative dataset operations such as pivot, melt, normalize, resample, rank, bin, outlier removal, missing-value filling, merge, join, and append workflows.
- Ask Corrie Corrie, the in-app RAG chatbot, can answer product questions over the user guide and content questions over public or user-owned entities. A user can ask how to use the product or ask Corrie to summarize a set of published experiments.
The Product Is a Relationship Graph
Most analytics tools treat correlations as temporary calculations. A user runs a query, sees a chart, and the result disappears unless someone manually saves it.
Correlation Studio uses a different abstraction. It treats each meaningful relationship as an entity.
Dataset ↓Experiment ↓Discovery ↓Portfolio
The core entity chain turns raw data into reusable, publishable knowledge.
That model matters because it allows discoveries to accumulate. A Discovery can be visualized, analyzed, commented on, published, embedded, rated, searched, indexed for retrieval, and assembled into a portfolio.
The result is not just an analytics app. It is a corpus of statistical relationships.
The Token Economy
Correlation Studio prices expensive operations through tokens rather than forcing all users into recurring subscriptions.
| Operation | Token Pricing |
|---|---|
| Dataset ingestion | 25 tokens + 1 token per MB |
| Correlation experiments | 25 tokens + 1 token per MB of joined data |
| Tool transforms | Same rate as ingestion |
| AI search | 125–250 tokens per call, depending on provider |
| AI analysis | 125 tokens per call |
| AI chatbot | 25 tokens per call |
Token packs range from $5 to $200 and do not expire. Storage subscriptions are optional and exist only to raise the default 10 GB storage cap.
This keeps the free tier useful while ensuring the costly parts of the system—large ingestion jobs and AI calls—pay for themselves.
System Architecture
Correlation Studio is intentionally built as a small, vertically integrated system. The architecture avoids distributed complexity wherever possible, but it is not simplistic. Each layer has a narrow responsibility.
PostgreSQL owns transactional truth. R2 owns immutable bulk storage. DuckDB owns analytical execution. The .NET app server orchestrates the workflows that turn user input into datasets, experiments, discoveries, AI analysis, thumbnails, billing records, and searchable public content.
Design Principle
Keep the number of operational primitives small, but make each primitive do the job it is naturally good at.
Solution Layout
The backend is split into three .NET projects with a React/Vite frontend beside it. The split is conventional by design: domain contracts at the center, infrastructure around them, and API orchestration at the edge.
CorrelationStudio.Core/ Entities, DTOs, Enums, Interfaces, Utilities, Exceptions No dependencies. Pure POCOs and contracts.CorrelationStudio.Infrastructure/ EF Core data access, ingestion services, R2 / Parquet / DuckDB plumbing, RAG subsystem. Depends on Core.CorrelationStudio.Api/ Controllers, business services, middleware, background hosted services, logging. Depends on Core + Infrastructure.correlationstudio.client/ React 18 + TypeScript SPA. Vite dev server proxies to the API and produces a static SPA bundle for production.
This layout keeps the domain model independent of persistence and hosting concerns. It also keeps the solo-developer ergonomics sane: most feature work touches the same predictable layers rather than scattering through a service mesh or multiple deployable applications.
The v2 Lakehouse Topology
The production topology consists of two VPS instances and one Cloudflare R2 account. One server runs PostgreSQL. One server runs the application, static frontend, ingestion workers, DuckDB queries, AI orchestration, thumbnail rendering, Stripe webhooks, and background services.
| Role | Plan | Cores | RAM | Disk | Transfer |
|---|---|---|---|---|---|
| Central DB | SSD VPS | 16 | 32 GB | 1.2 TB SSD | 10 TB |
| App server | SSD VPS | 16 | 64 GB | 1.2 TB SSD | 10 TB |
| Cloudflare R2 | Object storage | — | — | Metered | $0 egress |
The split is intentionally boring. The database server holds PostgreSQL only. The app server handles product logic and analytical computation. R2 stores bulk row data and media assets: dataset Parquet files, per-discovery datapoint Parquets, profile images, thumbnails, and portfolio attachments.
Browser ↓React + Vite SPA ↓ASP.NET Core API ├── PostgreSQL 17 metadata, billing, jobs, RAG vectors, audit ├── DuckDB embedded analytical execution ├── SSD hot cache local Parquet working set └── Cloudflare R2 durable Parquet + media storage
The production topology keeps transactional metadata, analytical compute, and durable bulk storage separate without introducing distributed orchestration.
The app server’s local SSD disk also acts as an LRU hot tier for Parquet. Frequently accessed dataset Parquets and per-discovery datapoint Parquets live locally for sub-millisecond DuckDB reads. Cold datasets remain in R2 and are fetched on demand.
The cache is bounded to 80% of disk, and its hit-rate degradation is surfaced in the Admin system metrics. That matters because the cache is not just a performance optimization; it is what allows the system to feel interactive while retaining cheap, durable object storage as the source of truth.
Why DuckDB on Parquet
The central architectural lesson from v1 was that PostgreSQL should not be asked to act as an analytical file format.
The v1 design materialized every parsed row into relational tables. For narrow, modest datasets, that was acceptable. For wide datasets, it was fatal. A CSV with more than a thousand columns and millions of rows produced hundreds of millions of B-tree inserts across row and index tables. Under concurrent ingestion, write amplification consumed the shard’s disk I/O budget and collapsed throughput.
DuckDB-on-Parquet inverted the workload.
Ingestion now writes one Parquet file per dataset. Analytical queries then issue DuckDB SQL directly against that Parquet. Column statistics, experiment correlations, drilldown lookups, thumbnails, and chart payloads all operate against columnar files rather than row-oriented database tables.
Column Pruning
Scanning one numeric column reads bytes proportional to that column, not the whole row.
Compression
Dictionary and RLE encoding compress low-cardinality columns aggressively, often by 10–100×.
Predicate Pushdown
Row groups allow DuckDB to skip entire chunks that do not satisfy a filter.
Vectorized SQL
The app retains SQL expressiveness without implementing a custom analytical engine.
The DuckDB pool is sized to approximately 16 connections. Every read path that needs analytical data checks out a connection: drilldown, thumbnails, sparklines, chart loading, experiment ingestion, and RAG corpus building.
Under load, connection pressure is observable. Checkout wait time appears in per-request logs as duckdbPoolWaitEma, and ingestion uses that signal to back off through a square-root yield curve. Instead of blindly flooding the local analytical engine, the ingest loop lets pool pressure slow it down.
The Important Shift
The v2 migration did not merely make the old design faster. It moved the analytical workload onto the substrate it wanted all along: columnar files queried by an embedded analytical database.
Connection Resilience and Observability
PostgreSQL access uses Npgsql with Entity Framework Core and a custom resilient retry strategy. Every DbContext resolved through dependency injection carries the same transient-error policy.
Each process also stamps an application name onto its PostgreSQL connections. On restart, ServerLifecycleService can identify and terminate orphaned backends left behind by crashed prior instances. That small operational detail keeps restarts from accumulating stale database connections.
Slow-query observability is handled by a singleton command interceptor. Any database command slower than the configured threshold, defaulting to 1000 ms, is logged with a snapshot of in-flight commands at the time of slowness.
Those logs surface live in the Admin Runtime tab. For a solo-operated system, that visibility is essential. The goal is not perfect distributed tracing. The goal is immediate operational attribution when a request, background job, or query path starts to misbehave.
Large Uploads and Long-Lived Requests
Correlation Studio accepts large datasets. That requirement shows up in places that are easy to overlook.
Kestrel’s maximum request body size is raised to the configured maximum file size, which defaults to 100 GB. Keep-alive timeout is set to four hours so legacy synchronous remote-download requests are not reaped while the server is still fetching a large file. Request header timeout is extended to two minutes.
Development needed the same treatment. The Vite dev proxy forwards API calls with four-hour timeouts, and a custom large-upload plugin disables Node’s default request, header, socket, and keep-alive timeouts on the browser-to-Vite side.
Operational Lesson
Supporting multi-gigabyte user files is not only a storage problem. It affects request limits, proxy defaults, development tooling, cancellation semantics, and background-job contracts.
Data Model
Correlation Studio’s schema is broad because the product is not only an analytical engine. It is also a publishing system, a social graph, a billing system, a job queue, an audit trail, and a RAG corpus.
The full PostgreSQL DDL is roughly 3,200 lines and is organized into predictable domains:
- Lookup tables for enum values materialized as rows.
- Identity tables for users, external logins, workgroups, and registrations.
- Communication tables for email templates, password changes, notifications, and blacklists.
- Configuration tables for runtime settings, prompt templates, and AI-provider metadata.
- Jobs tables for queues, history, and durable active downloads.
- Geocode caches for resolved geographic column values.
- Billing tables for invoices, subscriptions, token packs, ledgers, refunds, and operation histories.
- Topics and references for dataset discovery and reusable source URLs.
- Content tables for datasets, columns, stats, experiments, discoveries, and datapoints.
- Social tables for portfolios, posts, forum content, messages, and the Home feed.
- Cross-cutting feedback tables for tags, comments, ratings, and sentiment.
- Audit tables for events, connection logs, server sessions, and system metrics.
- RAG tables for chunks, embeddings, chat conversations, messages, query audits, and embed audits.
- System tables for terms, privacy policy, and other platform records.
The schema reflects the product strategy: every entity that can become part of the public knowledge base needs metadata, ownership, visibility, social feedback, searchable text, and deletion semantics.
The PublicId Pattern
Public-facing entities use short, opaque identifiers instead of exposing integer primary keys.
Users, datasets, experiments, discoveries, portfolios, topics, references, jobs, and invoices carry a PublicId: a unique, indexed, 10-character base62 string. API routes and frontend URLs use the public ID. Internal foreign-key relationships remain ordinary integers.
This provides two benefits at once. Public URLs are compact and shareable, and the system does not leak growth rate through sequential IDs. Internally, joins remain efficient and simple.
Public route: /discoveries/a7Xk92LmQpInternal tables: Discoveries.DiscoveryId = 18427 DiscoveryDatasets.DiscoveryId = 18427 Datapoints.DiscoveryId = 18427
Public IDs protect external surfaces without complicating internal relational joins.
States, Jobs, and Unit of Work
Dataset and experiment states are represented as lookup-backed enum values. Some v1 states remain defined for compatibility but are no longer used by the v2 ingestion flow. For example, v2 has no separate indexing phase because indexing bulk rows into PostgreSQL no longer exists.
DatasetStateType: Pending=1 Queued=2 Reading=3 Read=4 Waiting=5 Indexing=6 Ready=7 Failed=8 Cleaning=9 Downloading=10ExperimentStateType: Pending=1 Queued=2 Running=3 Ready=4 Failed=5 Cleaning=6
Repository access is centralized through IUnitOfWork, which exposes one repository property per entity type. The pre-v2 split between central and per-shard units of work was removed when sharding disappeared.
The unit of work and its CorrelationStudioDbContext are scoped per request. Background services that need scoped dependencies create fresh service scopes through IServiceScopeFactory. That pattern is used by deletion sweeps, cleanup runners, RAG workers, ingestion dispatchers, and background download tasks.
This is not glamorous architecture, but it is the kind of boring correctness that keeps long-running jobs from holding onto request-lifetime dependencies after the request is gone.
Storage Layer
The storage layer is where the v2 architecture earns most of its leverage. PostgreSQL stores metadata and transactional state. Cloudflare R2 stores durable bulk data. Local SSD provides a hot working set. DuckDB bridges the two by querying Parquet files as analytical tables.
That separation lets Correlation Studio keep the operational model small while avoiding the failure mode that broke v1: using a row-oriented transactional database as a high-volume analytical substrate.
Cloudflare R2 as Durable Bulk Storage
Correlation Studio stores bulk data in Cloudflare R2. The abstraction is IR2Storage, backed by an AmazonS3Client configured against R2’s S3-compatible endpoint.
R2 needs a few specific settings:
ForcePathStyle = true, because R2 does not use virtual-host style addressing.AuthenticationRegion = "auto".UseChunkEncoding = falseon PUT requests.DisablePayloadSigning = truefor uploads, while SigV4 still signs headers and HTTPS encrypts the body.
Upload and deletion concurrency are controlled separately with two SemaphoreSlim gates. That prevents a deletion sweep from competing with active ingestion uploads for the same R2 per-prefix throttle budget.
Why R2 Works Here
The workload wants cheap durable storage, high read availability, and zero-egress economics more than it wants a full managed data warehouse. R2 fits that shape well.
SChannel-Aware Retry
Large R2 uploads exposed a real production edge case.
Dataset Parquets can be gigabytes in size. Uploads are single PUT requests and may run for several minutes. Under burst load on Windows, pooled HTTPS connections could occasionally fall into a bad SChannel state. The symptom was an IOException saying the decryption operation failed, wrapping Win32Exception 0x8009030F SEC_E_MESSAGE_ALTERED from SslStream.ReadAsyncInternal.
The problem was not that one upload failed. The problem was that the pooled connection could remain poisoned. Retrying through the same client risked receiving the same broken socket again.
The upload path now runs three attempts beyond the SDK’s own retry budget. If the exception chain contains a Win32Exception, the system atomically swaps the AmazonS3Client instance using Interlocked.Exchange. The replacement client carries a fresh HttpClient, handler, connection pool, TLS handshake, and socket.
The displaced client is disposed after a 30-second delay so any concurrent upload that already captured it can finish cleanly.
Production Scar
A 1.39 GB Array of Things Chicago dataset hit the same SChannel failure on every retry because the pool kept returning the poisoned socket. The client-rebuild path exists because that happened in production.
The Parquet Writer
Dataset ingestion writes Parquet through IParquetWriter.WriteAsync, implemented with Parquet.Net.
The writer builds a schema from ParquetColumnDefinition records: one Parquet field per inferred dataset column. Type mapping is intentionally conservative.
| Correlation Studio Type | Parquet Type |
|---|---|
Integer | Int64 |
Decimal | Double |
Date | String using ISO 8601 canonical form |
Time | String using ISO 8601 canonical form |
DateTime | String using ISO 8601 canonical form |
String | String |
Row groups are buffered in per-column staging arrays and flushed at a default 50,000 rows per group. That size gives DuckDB useful predicate pushdown granularity without creating an oversized footer index.
Typical user datasets in the 100,000 to 5 million row range become 2 to 100 row groups.
Values are coerced at write time through typed conversion helpers. Non-nullable columns throw on null. Nullable columns use nullable staging arrays. The writer is always disposed before returning so the Parquet footer is flushed, and output size is measured from stream position.
The Local Hot-Tier Cache
R2 is the durable store, but local SSD is the interactive working set.
IParquetCache implements an LRU disk cache for dataset Parquets and per-discovery datapoint Parquets. Its public surface is small:
EnsureLocalAsync(r2Key, ct)returns a local file path, downloading from R2 on cache miss.SeedAsync(r2Key, sourceFile, ct)moves a freshly written ingest temp file into the cache so the first query is already local.Pin(r2Key)andUnpin(r2Key)reference-count files during active scans, ingests, drilldowns, and thumbnail renders.
Pinned files are never evicted. Eviction runs in the background and uses last access time. The cache is bounded by configuration, defaulting to 80% of disk free space at startup.
This turns the app server’s SSD disk into a practical lakehouse hot tier. Cold data stays cheap in R2. Hot data behaves like local analytical storage.
The DuckDB Connection Pool
DuckDB sessions are not thread-safe, so Correlation Studio gates analytical work through a fixed-size connection pool. The default size is 16.
The IDuckDbQueryService surface is deliberately small:
ParquetTable(localPath)returns a saferead_parquet('...')table-valued expression.QueryAsync<T>(sql, projector, ct)runs a query and materializes rows through a projector.StreamAsync<T>(sql, projector, ct)yields rows as anIAsyncEnumerable<T>without materializing the full result.RecentCheckoutWaitMsexposes an exponential moving average for pool checkout wait time.
That last metric is especially useful. When a request emits a slow-load warning, the logs include pool wait time so the operator can distinguish slow SQL from pool contention.
The DuckDB AWS extension is loaded at startup so the engine can read from s3:// URIs directly, but in practice almost all hot paths go through the local cache.
Image Bytes Through Cloudflare-Fronted R2
Images live in the same R2 bucket as Parquet data, under an images/ prefix. That includes profile avatars, banners, thumbnails, discovery scatterplot PNGs, portfolio thumbnails, post attachments, message images, and forum images.
User-facing image URLs exist in two forms.
Custom-Domain Path
In production, image URLs use:
https://images.correlationstudio.com/images/<key>
Bytes flow from R2 to Cloudflare’s edge cache to the browser. The app server never touches them. Egress through Cloudflare’s CDN is free, and the custom domain is attached at the R2 bucket level.
Proxy Path
The legacy and development fallback path is:
/api/storage/images/<key>
In that mode, bytes flow from R2 through the ASP.NET storage controller to the browser. The controller consults image safety verdicts, serves bytes through a bounded memory cache, and applies a hard five-second timeout to R2 reads so a bad object-store hop returns a fast 504 instead of riding the AWS SDK retry budget.
The proxy path remains alive for legacy URLs embedded in social previews, RSS feeds, email screenshots, bookmarks, and external caches.
Both URL shapes are accepted during the transition. A storage URL migration rewrote persisted proxy URLs to the custom-domain form across user images, discovery thumbnails, portfolio thumbnails, portfolio block images, and image storage paths.
Why This Matters
Serving public images through Cloudflare lets the product keep app-server bandwidth focused on application work while still retaining a safe fallback for moderation and legacy URLs.
Dataset Ingestion
Dataset ingestion is the pipeline that turns messy user input into typed, queryable, columnar data.
It is also where many analytics products fail. Real-world datasets are rarely clean. They contain preambles, footers, multi-row headers, sparse section labels, partial dates, currency-formatted numbers, fake headers, long sentence-like labels, and server-side download failures.
Correlation Studio’s ingestion path is designed around that reality.
Pipeline Overview
DatasetIngestionService.ProcessQueuedIngestionAsync orchestrates four phases:
- Source Resolution Prefer the local uploaded file from the wizard. If it is no longer available, fall back to the raw upload archived in R2 and download it to a temp file.
- Parse and Write Parquet Select a parser through
IRowParser, stream rows from CSV, Excel, or HTML sources, buffer them throughIParquetWriter, and emit a temp Parquet file. - Upload and Seed Cache Upload the Parquet file to R2 at
datasets/{userPublicId}/{datasetPublicId}/data.parquet. On success, move the temp file into the local hot cache so the first DuckDB query is local. - Compute Stats and Debit Tokens Run DuckDB queries to populate column statistics. Then flip the dataset to
Readyand debit tokens in the same EF Core transaction.
The v2 state machine is simple:
Pending → Queued → Reading → Ready / Failed
The older v1 states such as Read, Waiting, and Indexing remain defined but are not set by the v2 pipeline. There is no separate indexing phase because the row-index tables no longer exist.
Per-dataset mutual exclusion is enforced with an in-memory ConcurrentDictionary<int, Task>. Only one ingest for a dataset ID may run at a time. A losing dispatcher logs the conflict, closes the orphan job, and exits. If the process crashes, the dictionary disappears and the next dispatcher tick handles recovery.
CSV Parsing and Preamble Detection
CSV ingestion uses a raw StreamReader and a custom line parser rather than CsvHelper. That choice was made for speed on wide files.
The harder problem is not splitting lines. It is finding where the table actually starts.
Many public CSV files begin with metadata, comments, blank rows, source descriptions, license text, or sparse heading blocks. CsvPreambleDetector.DetectAsync buffers the first roughly 50 lines and runs a two-pass heuristic.
- First Pass Parse each line’s field count and fill ratio. Mark rows containing alphabetic content, comment-only rows, and blank rows.
- Richest Header Scan In the first eight lines, choose the row with the most non-empty fields that also contains alphabetic content and at least two populated columns.
- Consecutive Field-Count Streak Find the first point where two successive rows share compatible field counts and sufficient fill ratio. One blank or comment row may break the streak, which handles NOAA and NASA CSVs that put blank lines between headers and data.
- Decision If the streak begins at or before the richest candidate, use the streak. Otherwise use the richest candidate. If neither works, fall back to the first line with at least two fields.
This detector has been revised repeatedly against real dataset failures. Each iteration represents one ugly public CSV that broke the previous heuristic.
The Lesson
Ingestion quality is not determined by how well the parser handles perfect CSV. It is determined by how gracefully it handles the thousand strange almost-tables people upload from the real world.
Excel Parsing
Excel files have their own failure modes. A workbook can contain multiple sheets, decorative title rows, sparse metadata, merged-looking header regions, and footers that look enough like data to confuse a naïve parser.
Correlation Studio handles Excel ingestion with ExcelDataReader in two passes.
- Find the Data Region
ExcelPreambleDetector.FindDataStartscans sheets and chooses the one with the widest tabular region. - Reopen and Stream
ExcelPreambleDetector.SeekToreopens the workbook at the selected sheet and header row, then streams rows forward.
Empty rows are dropped. Parsing stops when the detector sees a blank-followed-by-sparse data pattern that indicates a footer rather than more table content.
The Excel preamble detector runs both exact and fuzzy passes. The exact pass looks for rigid column-count blocks. The fuzzy pass allows ±1 column of slop, which helps with exports that carry inconsistent trailing empty cells. If both passes find a predecessor, the stronger match wins. Otherwise the denser region wins, with ties broken toward the earliest plausible row.
Multi-Row Header Merge
A surprising number of real-world CSV and Excel exports have two or three header rows. Some use year / quarter / metric stacks. Some repeat section names. Some carry patterns such as Total | Total | 2024.
Treating only one row as the header loses useful labels. Concatenating every row can create unreadable column names. Correlation Studio uses a more selective merge.
- Forward Merge After identifying the primary header row, the parser scans forward for rows with compatible field counts and no corrupting numerics. Numeric cells are considered corrupting if they appear in columns that already have labels; numeric year-style labels are allowed in columns that are still blank.
- Section Divider Skip A row with a single short cell in a column that already has a label is treated as a stratification banner, such as
MaleorFemale, not as a header row. - Last Row Wins When a merge row is accepted, non-empty labels replace prior labels rather than being concatenated. Multi-row metadata stacks collapse to the most useful bottom label.
- Backward Fill For columns whose labels remain blank, the parser walks backward from the selected header row and fills missing labels from alphabetic rows with compatible field counts. Existing non-empty labels are never overwritten.
Why “Last Row Wins” Works
In many public datasets, the bottom header row contains the actual variable name, while rows above it contain broad grouping labels. Replacing rather than concatenating produces shorter, more useful column names.
Datatype Inference
Once rows and headers are identified, the platform has to infer column types. The classifier works cell by cell first, then refines the result after sampling the column.
ParseValueHelper.ParseValue follows a fixed order of operations.
1. Numeric Detection
The numeric cleaner strips currency symbols, percent signs, commas, and accounting-style parentheses for negatives.
($1,234.56) → -1234.56
If double.TryParse succeeds and the result is finite, the value is classified as either Integer or Decimal.
2. Temporal Detection
Temporal parsing runs on the uncleaned value. If DateTime.TryParse succeeds, the result is classified as Time, DateTime, or Date depending on the shape of the parsed value.
- Time-only strings become
Time. - Values with a non-zero time-of-day become
DateTime. - Plain calendar values become
Date. - All are canonicalized to ISO 8601.
3. Partial Dates
Public data often contains partial dates such as 1992.12, 2020-3, 1999/06, or simply 2023. The partial date parser recognizes these shapes and canonicalizes them as dates.
4. Fallback
Anything that does not parse as numeric or temporal becomes String.
Refinement Passes
After the sampled-row type census, Correlation Studio runs additional passes to catch cases the cell-level parser cannot safely resolve alone.
- YYYYMMDD integers become dates when the column label looks like a date label and every sampled value parses as a valid calendar date.
- Partial-date numerics become dates for year-month values such as
1871.10in date-labeled columns. - Headerless tables are detected when multiple numeric or date columns have type-consistent “headers.” In that case the supposed header row is returned to the data and columns are renamed
Col1,Col2, and so on. - Long headers are demoted when average column-label length exceeds 80 characters, because URLs and sentences are usually data rather than real column names.
Column Statistics Precomputation
After a dataset is written to Parquet, Correlation Studio immediately computes column statistics with DuckDB and stores the results in PostgreSQL.
For numeric columns, the core query computes:
- count
- distinct count
- minimum
- maximum
- average
- sample standard deviation
- quartiles
- skewness
Additional fan-out queries compute 20-bin histograms and top outliers by distance from the mean. For non-numeric columns, the system stores counts and top categorical values.
Results are upserted into ColumnStats, one row per column.
This replaced an expensive v1 behavior where opening a dataset’s Distributions or Quality tab could trigger 30–90 seconds of recomputation on a 10-million-row dataset. In v2, those tabs render from precomputed rows in under 100 ms.
Precompute What Users Reopen
Column statistics are not a rare analytical side path. Users repeatedly inspect quality, distributions, and summaries. Precomputing them turns a recurring interaction from a scan into a metadata read.
Token Accounting
Token accounting has to be correct under concurrency. A billing system that is usually right is not good enough.
Ingestion debits use an atomic PostgreSQL UPDATE … RETURNING statement against Users.TokenCount.
UPDATE "Users"SET "TokenCount" = GREATEST(0, "TokenCount" - @amount), "ModifiedDate" = NOW()WHERE "UserId" = @userIdRETURNING "TokenCount"
PostgreSQL takes a row lock for the duration of the transaction. Concurrent debits queue behind the lock and each receives a consistent post-debit balance via RETURNING.
The ledger row uses that returned balance. The operation history row commits in the same save. Token count, ledger, operation history, dataset state, and any prior-ingest refund all move atomically.
This replaced a read-modify-write implementation that could lose updates under concurrency. In production, the old path created drift where the ledger balance could exceed the stored user token count.
The ingestion cost formula is:
BaseTokenCost + ceil(fileSizeMB) × TokenCost
As of the v2 migration period, ingestion costs 25 base tokens plus 1 token per MB. That rate was reduced after a real 45-file COVID ingestion test proved the previous per-MB rate could make useful workflows too expensive.
Storage Quota
Every storage-consuming endpoint is gated by IStorageQuotaService: uploads, paste flows, ingestion, remote downloads, and dataset transforms.
On every successful Parquet upload, User.StorageUsedBytes is incremented by the Parquet size. The default cap is 10 GB and can be raised by an active storage subscription. Over-quota requests return HTTP 402, which the frontend renders as an over-quota banner.
A weekly reconciliation job compares recorded storage usage against the sum of dataset and experiment sizes, then writes drift metrics into system metrics.
Remote Dataset Search and Download
One of Correlation Studio’s core product bets is that users should not need to bring every dataset themselves. They should be able to ask for data, let AI providers suggest public sources, verify those sources, and ingest them into the same pipeline as uploaded files.
That workflow has two halves: AI-powered URL search and robust remote download.
AI-Powered URL Search
Remote Search runs Claude, Gemini, and Grok in parallel. Each provider receives a prompt assembled from database-backed prompt templates with the user’s topic and description interpolated.
Providers are asked for structured JSON containing:
urltitledescriptionformatestimatedSizeMB
Each provider has its own integration path: Anthropic SDK for Claude, raw HttpClient against Gemini’s REST endpoint, and raw HttpClient against xAI’s Responses API.
Search progress is tracked per user and provider. The frontend polls a progress endpoint every two seconds and updates provider status chips as each one moves through searching, verifying, done, or error.
Results are merged and deduplicated by URL. A URL returned by more than one provider is boosted because independent agreement is a useful quality signal.
Token costs are tracked per provider and debited when results are produced.
URL Verification
LLMs can produce plausible but unusable links. Every returned URL is therefore verified before it is offered as an ingest candidate.
Verification starts with a HEAD request and falls back to GET for servers that reject HEAD with 405 or 501 responses. A domain throttler controls politeness and prevents one search from hammering a public data host.
- Per-domain concurrency defaults to two simultaneous requests.
- Requests maintain at least one second between hits to the same domain.
- HTTP 429 and 503 responses honor
Retry-After. - Bad URL verdicts are cached so fabricated links are not rechecked repeatedly.
- Content-Length values from HEAD are cached for later truncation detection during GET.
Rejected URLs are still shown in a details panel with the reason: HTML returned instead of CSV, HTTP error, dead link, and so on. The user is still charged for the AI search call because the provider performed the work; sourcing failure is a separate outcome from model usage.
Reference Reuse
Topics accumulate verified references over time. Those references can be reused instead of asking the AI providers to rediscover the same sources.
Remote Search supports three modes:
| Mode | Behavior |
|---|---|
| ExistingOnly | Skip AI search and return only verified references already associated with the topic. |
| New | Run a fresh AI search without reference reuse. |
| Both | Run a fresh AI search and merge existing references into the result set. |
Link verification is triggered on user login rather than running as a giant timed background sweep. The original 24-hour batch model did not scale: at 10,000 users and 100 references each, one full cycle would take years. Login-triggered checks keep the verification work close to active users.
References follow a three-strike failure rule. A URL that repeatedly fails is marked dead. If it later comes back online, the dead flag is cleared.
The Remote Download Pipeline
After a URL is selected, Correlation Studio downloads the file server-side. That path needs to be resilient because public data servers are inconsistent. Some throttle. Some disconnect mid-stream. Some omit content lengths. Some support range requests. Some ignore them.
The download pipeline uses a named HttpClient configured to look like a browser, with automatic decompression, HTTP/1.1, and manual timeout management.
A domain throttler wraps each download so multiple users or tabs do not overwhelm the same host. The cancel button in the UI propagates through the throttler and can release a waiting download before it even starts.
Timeouts
The old single total timeout was replaced with three separate timers:
- Headers timeout, defaulting to 30 seconds, covers DNS, TCP, TLS, and waiting for response headers.
- Idle timeout, defaulting to 60 seconds, is rearmed after every successful chunk. This is the real “is the wire alive?” guard.
- Maximum total seconds is optional and defaults to disabled.
Resume by URL
If a previous attempt for the same user and URL left bytes on disk, the next attempt reads the number of bytes already present and sends a range request:
Range: bytes={N}-
This replaced an earlier stubbed resume-upload path with a simpler and more durable invariant: a partially downloaded remote file can resume by URL and byte count.
Duplicate In-Flight Protection
The server also guards against duplicate downloads for the same user and URL. If another browser tab or retry path is already downloading the same source, the second request is refused instead of starting a duplicate transfer.
Truncation Detection for Chunked Transfers
Some of the hardest remote-download bugs came from chunked-transfer endpoints. Socrata-backed sources such as data.cdc.gov can stream CSV data with Transfer-Encoding: chunked. In that mode, the GET response may not expose a content length.
That creates a nasty ambiguity. At the read layer, a mid-stream disconnect and a clean end-of-file can both look like ReadAsync returning zero bytes.
Correlation Studio defends against this with three layers.
- HEAD-Cached Size During URL verification, HEAD responses often include
Content-Length. That value is cached and reused later. If the GET response arrives without a content length, the HEAD-cached size becomes the expected total. - Post-Download Range Probe When neither HEAD nor GET reveals a total, the downloader issues a
Range: bytes=0-0request withAccept-Encoding: identity. Servers that omit content length on ordinary requests frequently returnContent-Rangefor ranged requests. - Unknown-Size Resume Probe If the total is still unknown, the downloader performs up to eight resume probes using the number of bytes already read. A 416 confirms completion. A 206 with a declared total hands off to known-size resume. A 206 without a declared total appends data and probes again.
The Edge Case
A clean EOF and a severed chunked stream can look identical unless the system establishes the truth from another signal: HEAD, Content-Range, or a resume probe.
The Async Download Contract
The original remote-download endpoint was synchronous. The client posted a URL, the server downloaded the file, and the server returned an UploadDto when the download finished.
That worked for small files. It failed for long downloads.
On a 25-minute server-side fetch, the browser-to-server connection had to stay open the entire time. If the proxy or Kestrel reaped the connection, the client saw a network error, the server-side bytes became orphaned, and the wizard had no durable way to recover.
The replacement contract starts the download asynchronously.
POST /api/datasets/download-remote/startreturns HTTP 202 immediately with a download key.- The server runs the download in a background task created against a fresh dependency injection scope.
GET /api/datasets/download-remote/status?url=...polls live state.- Terminal states include
completed,partial,failed,cancelled, andnot_found.
The frontend wraps the start-and-poll sequence behind the same conceptual surface: await a completed upload or receive an error. Internally, it tolerates transient poll failures and threads cancellation through to the server.
Cross-Restart Durability
Active downloads are stored durably in the ActiveDownloads table, one row per user and URL. The row records filename, path, bytes on disk, status, heartbeat, and other recovery state.
During the chunk loop, database writes are throttled to once every 10 seconds or 50 MB. On startup, a reconciliation service waits for a short grace period, then finds stale non-terminal downloads. If bytes exist on disk, the row becomes partial. If the file is missing, the row becomes failed. Old terminal rows are evicted.
The result is a download model that survives client disconnects, proxy timeouts, process restarts, and ordinary public-data-server unreliability.
The Experiment Engine
The experiment engine is where Correlation Studio turns datasets into discoveries. It takes two datasets, selects a join strategy, enumerates numeric-column pairs, and computes correlation statistics for every compatible pair.
The central challenge is combinatorics. If dataset X has 400 numeric columns and dataset Y has 500, the experiment has 200,000 pairs. At 1,000 by 1,000, it has one million.
The engine therefore has to be fast, resumable, deterministic, and selective about what becomes a persisted Discovery.
Experiment Pipeline
ExperimentIngestionService.ProcessQueuedExperimentAsync orchestrates the experiment flow.
- Resolve Inputs Load datasets X and Y, selected columns, join type, tolerance, sample percent, and absolute correlation threshold.
- Enumerate Column Pairs Build the cross product of numeric X columns and numeric Y columns.
TotalColumnPairs = |X_numeric| × |Y_numeric| - Compute Each Pair For every pair, build a DuckDB query that joins both sides on the bucketed key, computes Pearson, Spearman, p-value, confidence interval, and paired row count.
- Emit Discoveries If the pair clears the threshold and all filter gates, insert a Discovery and write a per-discovery Datapoint Parquet.
- Batch Database Writes Discoveries are inserted in batches of 50 to amortize EF Core overhead.
- Run Bounded Parallel Work Inner branches run concurrently, defaulting to four, while respecting DuckDB pool pressure.
The performance arc is instructive. Throughput improved from roughly 3 pairs per second to 14–15 pairs per second, with peaks around 21, through a sequence of targeted changes:
- batching Discovery inserts instead of writing one at a time,
- finding a better flush-window size,
- moving datapoint payloads away from PostgreSQL JSONB / bytea hot paths,
- deferring autovacuum pressure during ingest-heavy operations,
- and finally moving per-discovery datapoints into Parquet on R2.
In v2, the Datapoints row stores metadata and a Parquet path. The chart payload itself lives beside the dataset Parquets in object storage.
Join Types
Every correlation requires paired X and Y values. Correlation Studio supports three join strategies: row sequence, shared key, and time series.
RowSequence
RowSequence is a positional zip. The engine assigns a synthetic row number to both datasets and joins on that position.
x_agg AS ( SELECT CAST(rn AS VARCHAR) AS k, CAST("xval" AS DOUBLE) AS v FROM ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn, "xval" FROM read_parquet('x.parquet') ))
Numbering happens before null filtering. That preserves row alignment: X row 5 still pairs with Y row 5 even if row 3 has a missing value.
SharedKey
SharedKey joins both sides on an exact string key. Values are aggregated per key before correlation.
x_agg AS ( SELECT CAST("xkey" AS VARCHAR) AS k, AVG(CAST("xval" AS DOUBLE)) AS v FROM read_parquet('x.parquet') WHERE "xval" IS NOT NULL GROUP BY 1)
Date and DateTime columns are stored in canonical ISO 8601 form during ingestion, so equivalent dates join cleanly as strings.
TimeSeries
TimeSeries joins snap timestamps to a tolerance bucket, such as hourly, daily, weekly, monthly, or yearly.
x_agg AS ( SELECT strftime( to_timestamp( FLOOR(EPOCH(TRY_CAST("xkey" AS TIMESTAMP)) / 86400) * 86400 ) AT TIME ZONE 'UTC', '%Y-%m-%dT%H:%M:%S' ) AS k, AVG(CAST("xval" AS DOUBLE)) AS v FROM read_parquet('x.parquet') WHERE TRY_CAST("xkey" AS TIMESTAMP) IS NOT NULL AND "xval" IS NOT NULL GROUP BY 1)
Both sides snap to the same epoch grid. A single tolerance knob controls the bucket size.
The Timezone Bug
A prior implementation cast to_timestamp(...) directly to a string. DuckDB returns TIMESTAMPTZ, and string output included the local timezone offset. Drilldown queried plain timestamps and produced a different string shape for the same logical day. The UTC strftime expression fixed the mismatch across machines and drilldown paths.
Deterministic Sampling
Sampling must preserve pairs. Independent random sampling on X and Y would break the join by selecting different key sets. Correlation Studio instead samples deterministically on the bucketed join key.
WHERE hash({bucketedKey}) % 100 < {samplePercent}
Both sides compute hash() on identical key values. The surviving key sets therefore line up exactly.
Re-running an experiment on the same datasets with the same sample percent produces the same sample. That stability matters because published Discoveries should not change simply because an experiment was reprocessed.
Aggregation
When multiple rows fall into the same key or time bucket, each selected column’s aggregate function determines the per-bucket value.
| Aggregate | Expression |
|---|---|
| Avg | AVG(CAST(col AS DOUBLE)) |
| Sum | SUM(CAST(col AS DOUBLE)) |
| Min | MIN(CAST(col AS DOUBLE)) |
| Max | MAX(CAST(col AS DOUBLE)) |
| First | FIRST(col) |
| Last | LAST(col) |
| Count | CAST(COUNT(col) AS DOUBLE) |
Average is the default because it fits most exploratory datasets. Everything is cast into a numeric shape that DuckDB’s CORR() can consume.
Correlation Computation
After both sides have been joined and aggregated into matching key-value pairs, DuckDB computes Pearson and Spearman correlation from the joined set.
joined AS ( SELECT x.k AS k, x.v AS xv, y.v AS yv FROM x_agg x INNER JOIN y_agg y USING (k) WHERE x.v IS NOT NULL AND y.v IS NOT NULL AND NOT isnan(x.v) AND NOT isnan(y.v) AND NOT isinf(x.v) AND NOT isinf(y.v)),ranked AS ( SELECT RANK() OVER (ORDER BY xv) AS xr, RANK() OVER (ORDER BY yv) AS yr FROM joined)SELECT CORR(xv, yv) AS pearson, (SELECT CORR(xr, yr) FROM ranked) AS spearman, COUNT(*) AS nFROM joined
Pearson correlation uses DuckDB’s CORR() aggregate. Spearman correlation is implemented as Pearson correlation over ranked values. Pairs with fewer than three joined observations are discarded. Pairs with zero variance on either side produce no Discovery.
The p-value is computed in C# using a two-tailed Student-t test:
t = r · √[(n − 2) / (1 − r²)]p = 2 · (1 − CDF_t(|t|; df = n − 2))
Confidence intervals use Fisher’s z-transform:
z = atanh(r) = 0.5 · ln((1 + r) / (1 − r))SE = 1 / √(n − 3)z_lo = z − 1.96 · SEz_hi = z + 1.96 · SECI = [tanh(z_lo), tanh(z_hi)]
Perfect correlation is handled explicitly. When |r| = 1, the p-value path avoids divide-by-zero and the confidence interval degenerates to [r, r].
Per-Discovery Datapoint Parquet
When a pair clears the threshold, the system writes a small Parquet file containing the joined datapoints used for charting.
COPY ( SELECT k, xv, yv FROM joined ORDER BY hash(k) LIMIT 5000) TO '{tempFile}' (FORMAT PARQUET)
The deterministic ORDER BY hash(k) keeps chart sampling stable across reruns. The hard cap of 5,000 rows ensures the frontend can render a discovery without an additional downsampling pass.
The file is uploaded to R2 under the discovery’s path and seeded into the local cache. The database row stores the Parquet path, counts, and data size. The payload itself no longer lives inside PostgreSQL.
Granger Causality
Granger causality is available on demand from the Discovery detail page. It is not computed during bulk experiment ingestion because doing so added a 30–60% throughput penalty to the hot path.
The on-demand algorithm has four stages.
- Stationarity A simplified ADF-style test fits
Δy_t = α + δ·y_{t-1} + ε. The t-statistic isδ / SE(δ), using a 5% critical value of −2.86 for larger samples. Non-stationary series are differenced up to twice. - Lag Selection Candidate lags are scored with BIC:
k · log(n) + n · log(RSS / n)The search considers lags from 1 tomin(10, n/4)and chooses the best score. - F-Test Restricted and unrestricted models are compared with:
F = [(RSS_restricted − RSS_unrestricted) / k] / [RSS_unrestricted / (n − 2k − 1)] p = 1 − CDF_F(F; k, n − 2k − 1) - Bidirectional Evaluation The service tests X→Y and Y→X separately. The resulting direction can be none, X causes Y, Y causes X, or bidirectional.
OLS is implemented through MathNet matrix inversion. Singular systems fall back to full residual sum of squares and report no causality.
Why It Is On Demand
Granger causality is valuable, but it is not cheap enough to run for every candidate pair during brute-force experiment ingestion. Deferring it keeps discovery mining fast while still making causal testing available when a user cares about a specific relationship.
Discovery Filter Gates
Not every computed pair becomes a Discovery. The ingestion loop applies a series of gates, each designed to avoid storing obvious noise or redundant relationships.
- Duplicate Gate Before computing a pair, the engine checks whether the same user already has a Discovery for that exact X-column and Y-column pair in another experiment. If so, the pair is skipped without running DuckDB, writing Parquet, or uploading to R2. This makes the common “rerun the same cross-matrix as the corpus grows” workflow cost-proportional to new pairs only.
- Identity Gate After computation, near-perfect identities are skipped. If the rounded displayed correlation is 1.0000, the pair is treated as column-vs-self, duplicate column, or perfect linear transform.
- Minimum Evidence Gates Pairs are skipped if they have too few joined rows, fall below the configured absolute correlation threshold, or fail the minimum paired-row percentage.
Older duplicate history created before the gate can be cleaned by an admin utility. The cleanup keeps the oldest Discovery in each duplicate group so established permalinks, thumbnails, search records, and published states are preserved.
Statistical Algorithms
Correlation Studio intentionally uses standard statistical methods. The goal is not to invent new statistics. The goal is to make proven statistical workflows available at corpus scale and wrap them in useful product primitives.
Pearson Correlation
r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)²]
Pearson appears in three places: DuckDB during experiment ingestion, MathNet for one-off server-side computations, and a TypeScript implementation for client-side rolling windows.
Spearman Rank Correlation
Spearman is Pearson applied to ranks:
ρ = Pearson(rank(x), rank(y))
DuckDB computes ranks with window functions and then correlates them. MathNet handles one-off server paths. Ties are handled by the respective rank implementations.
P-Value
t = r · √[(n − 2) / (1 − r²)]df = n − 2p = 2 · (1 − CDF_t(|t|; df))
The p-value path uses MathNet’s Student-t distribution. Correlations with fewer than three observations return no p-value.
Confidence Interval
z = atanh(r)SE = 1 / √(n − 3)z_lo = z − 1.96 · SEz_hi = z + 1.96 · SECI = [tanh(z_lo), tanh(z_hi)]
Confidence intervals use Fisher z without Olkin-Pratt small-sample bias correction. The interval is reported as-is.
Regression Models
The frontend regression utility implements four model families: linear, polynomial, logarithmic, and exponential.
Linear
slope = (n·Σxy − Σx·Σy) / (n·Σx² − (Σx)²)intercept = (Σy − slope·Σx) / n
Degenerate X variance invalidates the model. R² is computed as 1 − SS_resid / SS_total.
Polynomial
Degree-2 and degree-3 polynomial fits build normal-equation matrices from power sums and solve them with Gaussian elimination and partial pivoting. Singular pivots invalidate the model.
Logarithmic
The logarithmic model fits y = a + b·ln(x). Values where x ≤ 0 are filtered out.
Exponential
The exponential model fits y = a·e^(bx) by fitting a linear model over (x, ln(y)), then exponentiating the intercept. Values where y ≤ 0 are filtered out.
Every model returns a prediction function, equation string, coefficients, and R². The equation formatter uses adaptive precision so extremely large, normal-sized, small, and tiny coefficients remain readable.
Prediction Intervals
Regression overlays include a 95% prediction interval. For each x value on the regression curve, the margin is:
margin = 1.96 · SE · √[1 + 1/n + (x − x̄)² / SS_x]SE = √(SS_resid / (n − p))
The chart draws the band as a filled polygon above and below the curve. For nonlinear models, the prediction function is sampled across the x-range.
Pearson vs. Spearman Divergence
Correlation Studio uses the difference between Spearman and Pearson strength to flag relationships that may be nonlinear or outlier-influenced.
Δ = |Spearman| − |Pearson|"non-linear" if Δ > 0.15"outlier-influenced" if Δ < −0.15null otherwise
A large positive delta suggests a monotonic but nonlinear relationship. Spearman preserves the rank order while Pearson underreports the curve. A large negative delta suggests outliers may be propping up Pearson more than the rank relationship supports.
This drives the small divergence indicators in discovery lists and powers the dedicated divergence visualization.
NaN-Safe Rolling Correlation
Rolling correlation runs client-side for interactive exploration. The implementation skips non-finite values, returns zero for windows with fewer than three valid pairs, and guards zero-variance denominators.
function pearson(xs: number[], ys: number[], start: number, end: number): number { let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0, sumY2 = 0, n = 0; for (let i = start; i < end; i++) { if (!Number.isFinite(xs[i]) || !Number.isFinite(ys[i])) continue; sumX += xs[i]; sumY += ys[i]; sumXY += xs[i] * ys[i]; sumX2 += xs[i] * xs[i]; sumY2 += ys[i] * ys[i]; n++; } if (n < 3) return 0; const denom = Math.sqrt( (n * sumX2 - sumX * sumX) * (n * sumY2 - sumY * sumY) ); if (denom === 0) return 0; const r = (n * sumXY - sumX * sumY) / denom; return Number.isFinite(r) ? r : 0;}
Users can adjust the window size. The chart overlays strength bands and reports mean and volatility below the line.
Downsampling and Density
Correlation Studio does not implement Largest-Triangle-Three-Buckets. Instead, it uses three practical strategies tailored to the product.
- Per-discovery cap: every Datapoint Parquet is capped at 5,000 rows with deterministic hash ordering.
- Scatter bucket deduplication: the chart computes a 1000×600 data-value bucket grid. Multiple points in the same bucket render as a single dot with radius scaled by
log2(count). - Density-aware line rendering: dense line graphs lower alpha as segments-per-column increases, letting opacity accumulate where many segments overlap.
The goal is not to preserve every rendered primitive. The goal is to preserve the visual meaning of dense data without making the browser draw thousands of redundant pixels.
Visualizations
The primary chart engine is DiscoveryChart.tsx: a canvas-based, double-buffered renderer that handles scatterplots, line graphs, residual plots, trajectory mode, and regression overlays.
Device-pixel-ratio scaling keeps output crisp on high-DPI displays. Double buffering keeps redraws clean: a new frame renders into an offscreen canvas, then atomically replaces the visible canvas.
| Component | Technology | Purpose |
|---|---|---|
MiniChart | SVG | Sparkline previews in feed cards and grids |
CorrelationHeatmap | SVG grid | Pairwise correlation matrix across columns |
DivergenceChart | SVG scatter | Pearson vs. Spearman per discovery |
BubbleChart | SVG | Correlation vs. sample size with bubble size as divergence |
LagAnalysisChart | SVG polyline | Correlation at positive and negative lags |
RollingCorrelationChart | SVG polyline | Windowed correlation across a series |
CorrelationNetwork | D3 force graph | Node-link graph of inter-column correlations |
DatasetMapView | Leaflet | Geographic scatter for geo-typed datasets |
ColumnHistogram | SVG | Per-column distribution histogram |
ColumnBoxPlot | SVG | Quartiles and outliers |
DiscoveryChart Render Pipeline
The main chart pipeline starts with memoized data preparation. It recomputes when the selected datapoint payload or visualization mode changes.
Scatterplot and residual modes use a 1000×600 bucket grid. Each point is mapped into a bucket by its normalized x and y position. Duplicate bucket hits collapse into a single rendered dot, with radius scaled by density.
Line graph mode uses arrays indexed by row order. Trajectory mode parses compound join keys into group and time pairs, then renders grouped alpha-blended lines.
Residual mode transforms the plotted y value before bucketing:
plotY = y − (slope · x + intercept)
Autocorrelation Gate
Row order is sometimes meaningful and sometimes noise. Before recommending ordered visualizations, the chart computes lag-1 autocorrelation for both series. If both series have weak autocorrelation, the UI warns that row order may not carry signal and suggests scatterplot or trajectory mode instead.
Double-Buffered Rendering
Rendering happens on an offscreen canvas. When the frame is complete, the offscreen buffer is copied into the visible canvas using globalCompositeOperation = 'copy'. This avoids the white flash that can occur when resizing or clearing the visible canvas directly.
Resize Throttling
Resize events are handled with a leading-edge throttle and a short cooldown. The first resize applies immediately. Subsequent resize events coalesce into a pending size and apply when the cooldown expires.
Performance Instrumentation
The chart accumulates stage timings for data preparation, trajectory grouping, and regression. If a render exceeds 100 ms, it logs a warning with stage-level detail.
Regression Overlay Rendering
In scatterplot mode, users can choose among linear, polynomial, logarithmic, and exponential regression overlays. The selected model is fitted client-side and rendered directly over the discovery datapoints.
The overlay draws four elements:
- a 95% prediction interval as a filled polygon band,
- the regression curve as a dashed line,
- the equation in a corner badge,
- and R² alongside the discovery’s correlation coefficient.
Nonlinear curves are sampled at 80 points across the visible x-range. Linear fits use only the two endpoints.
Server-Side Thumbnails
Discovery thumbnails are rendered server-side with SkiaSharp. The server renderer mirrors the client chart logic closely enough that thumbnails and interactive charts have the same visual language.
The thumbnail pipeline is:
- Read Discovery Data Resolve the Discovery, load its Datapoint Parquet, and query it through DuckDB while filtering null, NaN, and non-finite values.
- Render PNG Draw a 640×360 PNG with SkiaSharp using the same scatter-bucket logic as the client.
- Upload to R2 Store the PNG under the user’s discovery thumbnail path. URLs are versioned with the modified date for CDN cache-busting.
- Trigger Refresh Experiment ingestion enqueues a thumbnail refresh job after each discovery’s datapoint Parquet is uploaded. The experiment detail page can also trigger server-side renders for visible discoveries that are missing thumbnails.
Portfolio thumbnails use a similar renderer, composing mini previews for dataset, experiment, and discovery blocks.
Datapoint Drilldown
One of the most important product affordances is the ability to click a chart point and inspect the source rows behind it.
The client sends selected keys to:
POST /api/discoveries/{id}/datapoint-rows
Multi-select is supported through lasso and Shift-click. The endpoint streams NDJSON: one JSON object per line. The client parses rows lazily instead of waiting for a giant JSON array.
Streaming uses:
Content-Type: application/x-ndjson; charset=utf-8X-Accel-Buffering: noResponse.Body.DisableBuffering()- explicit
FlushAsynccalls
X and Y sides are scanned concurrently. Each side queries its source dataset Parquet through DuckDB:
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rid, c1, c2, ...FROM read_parquet('dataset.parquet')WHERE {bucketed-key-expression} IN ('key1', 'key2', ...)ORDER BY {bucketed-key-expression}LIMIT {DrilldownPerSideRowCap + 1}
The bucketed key expression is generated by the same helper used during experiment ingestion. That consistency is what makes time-series drilldown correct: a daily bucket returns every X row and every Y row whose timestamp falls in that day.
The per-side row cap is 5,000. The query fetches one extra row so truncation can be detected without issuing a separate count query.
Dataset Tools
Dataset tools let users produce derivative datasets without leaving the platform. They are implemented by DatasetToolService and share the same storage model as ordinary ingestion: DuckDB transforms source Parquet, the result is written to a new Parquet file, and the new dataset becomes a normal first-class dataset.
The transformation pipeline mirrors ingestion:
- Resolve source dataset or datasets and materialize their Parquets to the local cache.
- Build a transformation plan from tool type and options.
- Save output columns to acquire fresh column IDs.
- Run DuckDB SQL and stream transformed rows through type coercion into the Parquet writer.
- Upload the output Parquet to R2 and seed the cache.
- Flip the new dataset to Ready and debit tokens in one transaction.
Output Parquet column names use the same c{ColumnId} convention as ordinary ingested datasets.
Derivative Datasets Are Real Datasets
A transformed dataset is not a temporary view. It has columns, stats, storage usage, tokens, join-key detection, publication state, and can participate in future experiments.
Auto Join-Key Heuristic
If no output column is explicitly marked as a join key, the tool pipeline scans for the first Date or DateTime column. If none exists, it falls back to the first String column. This mirrors the dataset wizard’s detection behavior so downstream experiments can time-series join derivative datasets with minimal user intervention.
The 18 Dataset Tools
The tool catalog covers common reshaping, time-series, normalization, filtering, and joining workflows.
Transpose
Converts wide data to long form using an unpivot-style operation. Value columns are cast to double before unnesting.
Difference
Computes row-over-row deltas using LAG() over deterministic row order.
LagLead
Shifts a column forward or backward with LAG or LEAD.
RollingAggregate
Computes rolling SUM, MIN, MAX, COUNT, FIRST, LAST, or AVG over a row window.
ResampleTimeSeries
Groups timestamps by period using DATE_TRUNC and aggregates numeric columns per period.
FilterRows
Builds a WHERE clause from user conditions such as equality, range, contains, starts with, ends with, null, and not-null.
Normalize
Supports z-score and min-max normalization with zero-variance guards.
PercentChange
Computes percent change from the prior row and returns null when the prior value is zero.
CumulativeSum
Uses a windowed cumulative sum from the beginning of the row order to the current row.
FillMissing
Supports forward-fill, backward-fill, numeric interpolation, and constant fill.
Deduplicate
Ranks rows by key group and keeps the first or last row per group.
GroupAggregate
Groups by one or more key columns and applies an aggregate function.
Pivot
Discovers distinct pivot values and builds one output column per value.
Rank
Computes rank over a selected column with nulls placed last.
Bin
Computes numeric buckets from precomputed min and max values.
OutlierRemoval
Filters rows whose target values exceed a configured standard-deviation threshold.
MergeJoin
Performs inner, left, right, or full outer joins on string-cast key columns.
AppendStack
Unions two datasets by discovering a case-insensitive union schema and projecting missing columns as null.
Shared Tool Infrastructure
Tools that depend on row order inject a ROW_NUMBER() OVER () AS rowid CTE. Window frames use row-based frames rather than value-based ranges.
The streaming transform path enforces DuckDB reader output into the target Parquet type. Integer outputs become Int64. Decimal outputs become Double. Temporal and string outputs become canonical strings.
Missing values are handled explicitly. Forward-fill and backward-fill use IGNORE NULLS. Interpolation tracks null positions and neighboring non-null values. Rank uses NULLS LAST. Aggregates coalesce to safe defaults where needed to avoid division-by-zero and invalid numeric output.
Corrie: The RAG Chatbot
Corrie is Correlation Studio’s in-app RAG chatbot. The name is user-facing; the historical frontend component remains StudioChat for compatibility, but the UI, empty states, button labels, error banners, and system prompt all identify the assistant as Corrie.
Corrie answers two kinds of questions:
- product and “how do I…” questions over the documentation corpus,
- and content questions over public or user-visible datasets, experiments, discoveries, portfolios, topics, and forum questions.
The chatbot is not bolted on as a generic assistant. It is wired into the same entity model and publication lifecycle as the rest of the platform.
RAG Architecture
The RAG system has three core database entities.
RagChunk: one retrievable unit with corpus type, source type, source entity ID, title, URL, chunk text, embedding, and creation date.ChatConversationandChatMessage: persistent chat history per user.ChatQueryAudit: per-query audit records containing token counts, cost estimates, retrieved chunks, and escalation flags.
Chunks belong to several corpus types: documentation, entities, and pre-seeded content. Source types cover document sections, discoveries, experiments, portfolios, topics, datasets, forum questions, and seed content.
This gives the chatbot a uniform retrieval interface across static docs and user-generated analytical content.
Embedding and Retrieval
Embedding is dual-vendor. OpenAI’s text-embedding-3-large is the primary path, reduced to 1536 dimensions. Gemini text-embedding-004 is the fallback.
The 1536-dimensional shape was retained because the existing pgvector column already used 1536 dimensions. That made the cutover to OpenAI primary a service change rather than a schema migration.
Each embedding service owns its own retry loop with jittered backoff and Retry-After support. The composite fallback service tries OpenAI first, then Gemini on transient failure. Vendor and cost data are captured separately so admin reporting can distinguish normal and fallback spend.
Documentation chunks are built from the user guide by splitting Markdown sections on H2 and H3 headings. The docs corpus is hashed at startup and re-chunked only when the hash changes.
Entity chunks are built by source-specific chunk builders. Publishing, unpublishing, deletion, and analysis changes enqueue RAG upsert or remove commands.
The RAG Index Queue
Index commands are durable rows, not in-memory events. The worker polls due rows, coalesces multiple commands for the same source, embeds upserts in vendor batches, and writes precomputed embeddings into RagChunks.
Failed rows are retried with exponential backoff and are never silently dropped.
Retrieval
Retrieval uses pgvector cosine similarity:
SELECT * FROM "RagChunks"WHERE "CorpusType" = 1 OR "SourceUserId" = @userIdORDER BY "Embedding" <=> @queryEmbeddingLIMIT @k
The default retrieval count is eight chunks. Visibility filtering ensures private entities are not retrieved for other users.
Chat Synthesis
ChatService orchestrates the full query lifecycle:
- Embed the user’s query with the fallback embedding service.
- Retrieve top-k chunks from
RagChunks. - Build a system prompt containing retrieved chunks and provenance.
- Call Claude Haiku for synthesis.
- Escalate to Claude Sonnet if response confidence appears low.
- Stream the answer to the client.
- Write a
ChatQueryAuditrow with estimated cost.
Haiku is the default because it is fast and inexpensive. Sonnet is reserved for ambiguous or low-confidence cases where a stronger model noticeably improves answer quality.
Citations are rendered inline by the Corrie panel. Provenance links are resolved from stored entity URLs at service startup.
Chat Billing and Rate Limits
Corrie uses a flat token price per query. Each chat query debits 25 tokens regardless of whether the answer used Haiku alone or escalated to Sonnet.
That pricing model lets cheaper Haiku queries subsidize occasional Sonnet escalations, while the daily per-user query cap bounds worst-case spend.
Exceeding the query cap returns HTTP 429. Insufficient token balance returns HTTP 402. The chat panel renders both as inline banners.
Token Economy and Billing
Correlation Studio’s billing model is built around non-expiring token packs. Tokens fund usage-based operations: ingestion, correlation, transforms, AI search, AI analysis, and Corrie.
The design goal is to keep browsing and lightweight exploration free while making expensive compute and AI work pay for itself.
Token Packs
Users can buy token packs in four tiers. Packs never expire and stack into one user balance.
| Pack | Price | Tokens | Rate |
|---|---|---|---|
| Small | $5 | 5,000 | $1.00 / 1K tokens |
| Medium | $20 | 25,000 | $0.80 / 1K tokens |
| Large | $50 | 65,000 | $0.77 / 1K tokens |
| Bulk | $200 | 300,000 | $0.67 / 1K tokens |
The volume discount is intentionally modest. A bulk-pack buyer should receive a better rate, but not enough to destroy margin on expensive AI-backed operations.
Operation Costs
Token pricing is tied to the kind of work being performed.
| Operation | Base | Per Unit | Unit | Approximate Retail Cost |
|---|---|---|---|---|
| Ingest Dataset | 25 | 1 | per MB | $0.025 + $0.001/MB |
| Run Correlation | 25 | 1 | per MB | $0.025 + $0.001/MB |
| Tool Transform | 25 | 1 | per MB output | Shares ingest rate |
| AI Analysis | 0 | 125 | per call | $0.125 |
| AI Search: Claude | 0 | 250 | per call | $0.25 |
| AI Search: Gemini | 0 | 125 | per call | $0.125 |
| AI Search: Grok | 0 | 125 | per call | $0.125 |
| AI Chatbot | 0 | 25 | per call | $0.025 |
The per-MB compute rate was reduced after real ingestion testing showed that the old price could make legitimate multi-file workflows feel punitive. The per-MB charge is compute markup, not storage rent; R2 storage is cheap enough that the ingest fee covers many years of storage at retail rates.
AI Analysis was also repriced after bulk-analysis workflows exposed a power-user failure mode. One click could analyze hundreds of child discoveries. Reducing analysis cost kept that path viable without damaging margin.
Atomic Debits and the Ledger
All token-spending call sites route through the same accounting service. Dataset search, AI analysis, Corrie, ingestion, experiment ingestion, and dataset transforms all use the atomic debit path.
Every token movement writes a ledger row: purchases, grants, debits, refunds, and reclamation credits. The ledger row stores a running balance, and that balance comes from the same RETURNING TokenCount value produced by the atomic PostgreSQL update.
By construction, the ledger balance matches the stored user balance.
A reconciliation path can recompute token balances from the ledger. It finds the most recent token-reset sentinel row per user, then sums all later ledger movements. Older users without a sentinel fall back to the latest ledger balance.
Deletion Reclaim
Deleting a dataset, experiment, or discovery can return part of the original per-MB token cost to the user.
The reclaim rule credits back 70% of the per-MB portion. The 25-token base fee is retained, and the remaining 30% acts as a reclamation surcharge. Very small entities whose reclaim amount would round below one token receive no credit.
Credits appear in the ledger as token reclamation entries tied to the deleted entity title. The deletion service writes the ledger entry in the same transaction as the soft-delete flag.
Stripe Integration
Stripe handles saved cards, subscriptions, token-pack purchases, and refunds.
- Saved cards use Stripe Checkout setup sessions. The application stores the payment method ID and card brand for display.
- Subscriptions fund storage tiers. Upgrades and downgrades prorate against the current period.
- Token packs use Stripe Checkout in payment mode. Webhooks are idempotent on payment intent.
- Refunds flow from user request to admin approval to Stripe refund and ledger reversal.
The Stripe webhook controller is the canonical entry point for billing state changes. It verifies webhook signatures, deduplicates events by Stripe event ID, and delegates side effects to transactional billing-service methods.
Invoices are rendered as PDFs with monotonic invoice numbers. Each invoice stores an HTML snapshot at issue time so future re-rendering cannot drift from what the user originally received.
Authentication and Authorization
Correlation Studio supports email/password login, Google OAuth, Microsoft OAuth, and JWT session tokens.
- Email/password accounts use BCrypt hashing and email verification codes.
- Password reset uses an email-token flow.
- Google OAuth uses client-side Google auth and server-side token verification.
- Microsoft OAuth follows Microsoft Identity patterns.
- JWT tokens are HS256-signed and refreshed on authenticated requests inside the refresh window.
External Login Linking
External identities map provider and provider user ID to the local user account. If a mapping exists, OAuth login signs in that user. If not, the flow either creates a new user or links the external login to the currently authenticated account.
Roles and Visibility
The role hierarchy is strict:
Anonymous=1Registered=2Moderator=3Curator=4Administrator=5
- Anonymous users can view public content and search.
- Registered users can create, edit, publish, comment, rate, tag, and react.
- Moderators can moderate content through soft delete and restore workflows.
- Curators can feature content and pin forum questions.
- Administrators get the full admin console.
Publishable entities have three visibility states:
| Visibility | Who Can See It |
|---|---|
| Private | The owner only |
| Public | Visible on Home feed, search, profiles, crawlers, and public entity pages |
| Workgroup | Members of the assigned workgroup |
Security Hardening
Several straightforward hardening layers protect accounts and public surfaces.
- IP blacklist checks run on authenticated requests.
- Email blacklist rules prevent signup from banned domains.
- Suspended users receive a 403 and a suspension-reason page.
- Password history prevents reuse of recent passwords.
- Connection logging records method, path, status, user agent, user ID, duration, and IP address.
Connection logs are retained for 30 days. They are useful not only for security but also for diagnosing slow routes and operational anomalies.
Background Services and Lifecycle
Every non-request-scoped task runs as an ASP.NET hosted service. The system relies on background workers for ingestion, cleanup, expiration, subscriptions, link verification, metrics, RAG indexing, thumbnails, active downloads, and more.
| Service | Purpose | Cadence |
|---|---|---|
IngestionQueueService | Dispatch dataset, experiment, search, index, and transform jobs | Every 2 seconds with per-lane caps |
DatasetCleanupService | Hard-delete soft-deleted datasets after grace period | Runner cycle with bounded time budget |
ExpirationService | Expire datasets based on configured expiration dates | Every 15 minutes |
SecurityCleanupService | Expire registration and password-change tokens | Daily |
UploadCleanupService | Delete old staging uploads | Every 15 minutes |
SubscriptionRenewalService | Renew subscriptions and send notices | Daily |
LinkVerificationService | HEAD-check references | Triggered on login with cooldown |
SystemMetricsCollectorService | Sample CPU, RAM, disk, connections, and slow queries | Every 30 seconds |
RagIndexWorker | Drain the RAG index queue and embed chunks | Continuous loop |
ActiveDownloadReconcileService | Recover stale active-download rows | Startup plus periodic sweep |
Ingestion Queue
IngestionQueueService is the dispatcher for long-running analytical work. It has separate lanes for dataset jobs, experiment jobs, and tool jobs.
- The dataset lane handles ingest, index, and transform jobs.
- The experiment lane handles experiment ingestion.
- The tool lane shares capacity with dataset work.
Each lane has its own concurrency semaphore. The dispatcher selects the oldest pending job up to the lane cap, flips it to running, and invokes the appropriate service.
An admin-controlled pause flag stops new dispatches without killing running jobs. This is important for maintenance: the operator can let active work drain while preventing new long-running jobs from starting.
A duplicate guard prevents concurrent jobs against the same entity. Without that, a stale redispatch could create two pipelines writing the same Parquet path.
A stale-job watchdog inspects long-running jobs. Jobs that are still reporting progress are left alone. Truly hung jobs are canceled.
Orphan Recovery on Startup
Long-running jobs can be interrupted by process crashes, server restarts, or deploys. Startup recovery hooks repair common partial states.
- Reading-state datasets: if partial Parquet exists, queue indexing; if upload file exists, queue fresh ingest; if a reference exists, restart download; otherwise reset to pending.
- Downloading-state rows: stale active downloads become partial if bytes exist or failed if the file is missing.
- Cleaning-state entities: cleanup resumes chunked deletion from the persisted cursor.
- Zombie PostgreSQL backends: old process connections are found by application-name suffix and terminated.
Crash Recovery Philosophy
Long-running work should not require perfect uptime. Every state that can be interrupted should have a deterministic next step on startup.
Notifications
In-app notifications are stored as rows and surfaced through the bell UI. Transactional emails use database-backed templates for registration verification, password reset, login alerts, renewal notices, invoices, refunds, and related account events.
Development sends through Gmail SMTP with quota limits. A production migration to SES or SendGrid is planned if outbound email volume grows beyond lightweight usage.
Frontend Architecture
The frontend is a React 18 and TypeScript SPA built with Vite. It is designed around rich client-side interaction: dataset creation, remote download progress, chart rendering, drag-and-drop portfolio editing, maps, feeds, search, and AI chat.
- React 18 with strict mode.
- TypeScript end to end, with frontend types mirroring C# DTOs.
- Vite for development and production builds.
- Tailwind CSS for utility-first styling.
- React Query for server state.
- Zustand for persisted auth state.
- React Router v6 with route-level code splitting.
- Recharts, D3, Leaflet, and canvas for specialized visualizations.
The frontend is not a thin CRUD shell. Several product-defining features happen in the browser: chart interaction, client-side thumbnail capture, wizard state persistence, optimistic feed updates, and search-as-feed navigation.
State Management
Server state lives in React Query. Query keys follow predictable shapes such as ['my-datasets'], ['experiment', id], and ['available-datasets-my']. Mutations invalidate the relevant keys on success.
One recurring frontend lesson is that similar pages may use different query keys for good reasons. Dataset creation, deletion, and publish actions therefore need to invalidate all surfaces that can show the entity, not just the page where the mutation happened.
Client state is intentionally narrow. Zustand stores the authenticated user, token, and authentication status in localStorage. A global confirmation-dialog provider replaces native browser confirms and alerts with themed, promise-based dialogs.
API Client
The frontend API layer exports named service objects such as datasetsApi, experimentsApi, discoveriesApi, and portfoliosApi. Each wraps an Axios instance with JWT handling.
A sliding-refresh interceptor watches authenticated responses. If the JWT is inside the refresh window, the client triggers a background refresh and updates the auth store. A failed refresh redirects to the login flow.
Wizard Optimization Patterns
The dataset-create wizard is one of the most performance-sensitive parts of the frontend. It can host dozens of in-flight remote downloads, each polling progress every few seconds, while preserving enough state to survive navigation and refreshes.
Several optimizations were required after large remote-search runs exposed rendering and storage bottlenecks.
Debounced Autosaves
Wizard state is persisted to sessionStorage, but progress ticks can be frequent. Without debouncing, every tick triggered a full parse, mutate, stringify, and write cycle against a large state object. Autosaves now debounce on a short timer.
Memoized Download Rows
Downloaded file rows are isolated into a memoized component. Stable callback identities prevent every row from rerendering whenever one file receives a progress update.
Stall Detection
A parent-level timer recalculates stalled status every 10 seconds. That allows “stalled at N MB” labels to update without forcing all rows to rerender on every poll.
Flex-Chain Layout
The file list uses a nested flex layout with min-h-0 and overflow-y-auto, so the list scrolls inside the wizard instead of pushing the footer below the viewport.
Frontend Scar
A 42-file remote-search run with multi-gigabyte CDC files exposed three problems at once: autosave churn, row rerender storms, and layout overflow. Fixing all three turned the wizard from a stress case into a usable workflow.
Chart Capture for Thumbnails
Discovery thumbnails are generated both client-side and server-side.
The client-side path gives users immediate visual feedback when publishing or saving a discovery:
- Generate a mini-chart SVG string from the datapoint payload.
- Load the SVG into an image object through a data URL.
- Draw the image to a canvas.
- Convert the canvas to PNG.
- Upload the PNG thumbnail to the API, which writes it to R2.
The server-side renderer uses SkiaSharp and the same scatter-bucket logic. Keeping the two paths visually synchronized requires discipline: changes to bucket math need to land in both renderers.
Search as Home Feed
Search results reuse the same rich feed cards as the normal Home feed. Pressing Enter in the header search box navigates to a Home URL with query parameters rather than a separate search-results page.
/?q=<term> /?q=<term>&entityTypeId=<n>
The backend unifies both paths by extracting feed DTO construction into a shared pipeline. The normal feed reads rows from HomeFeed. Search produces synthetic feed references from ranked entity results. Both hydrate through the same DTO-building path and receive the same cards, thumbnails, sentiment state, and mini-chart payloads.
Search pagination treats results as one deterministic stream clustered by type: discoveries, experiments, datasets, portfolios, and posts. Each type ranks within itself, then fixed type-order concatenation produces stable infinite-scroll pages.
Search rows receive synthetic feed IDs so existing React keys and optimistic sentiment updates continue working.
Multi-Term Matching and Relevance Ranking
Search is term-based rather than exact-phrase only. A shared ranking helper converts the raw query into search terms, a match predicate, and a relevance score.
- Whitespace-split words become individual terms.
- Quoted spans are preserved as exact literal terms.
- An entity matches if any term appears in title or description.
- Title hits score higher than description hits.
- All-terms-present gets a bonus.
- The full query appearing as a contiguous phrase gets the strongest bonus.
Matching and scoring run server-side as expression trees over EF.Functions.ILike. The database evaluates the score through SQL CASE expressions rather than materializing every candidate into memory.
Because ILIKE '%term%' cannot use ordinary B-tree indexes, production search relies on pg_trgm GIN indexes over titles, descriptions, and post text. That turns multi-term OR searches into bitmap index scans instead of sequential scans across the corpus.
Search Lesson
Rich search over user-facing text is not only a ranking problem. Without trigram indexes, a friendly multi-term query becomes an expensive full-table scan.
Discoverability: Crawler SSR, Structured Data, and Open Data
Version 2.4 focused on making Correlation Studio’s public corpus visible to search engines, AI crawlers, social-media unfurlers, and dataset discovery tools.
The corpus is large: public datasets, experiments, discoveries, and portfolios form a body of structured, citable analytical content. But before 2.4, that corpus lived behind a client-rendered React SPA. Non-JavaScript crawlers fetched entity URLs and received the same generic shell: a root div and script tag, without entity-specific title, body, or structured data.
Version 2.4 adds crawler-grade HTML without changing runtime behavior for human users.
Crawler SSR with EntityShellMiddleware
EntityShellMiddleware sits ahead of the SPA static-file fallback. When a request path matches a public entity route, the middleware intercepts the shell and rewrites it before bytes leave the server.
The middleware injects three kinds of crawler-visible content.
Per-Route Metadata
The middleware resolves the public ID, loads the entity, and injects a real <title>, meta description, Open Graph tags, and Twitter Card tags. Entity thumbnails become social preview images.
Server-Rendered Noscript Body
A readable entity article is emitted inside <noscript>. Human users with JavaScript hydrate the SPA as before. Non-executing crawlers receive an indexable page body containing titles, analysis text, coefficients, dataset columns, and other entity-specific content.
Schema.org JSON-LD
Each entity receives structured JSON-LD in the document head. That gives crawlers a machine-readable description of the page, its entity type, breadcrumbs, authorship, dates, and distributions where applicable.
Visibility rules are enforced in the middleware. Private and workgroup-only entities do not leak metadata or noscript content to anonymous crawlers.
Per-Entity Structured Data
Each public entity maps to an appropriate schema.org payload.
- Datasets emit
Datasetwith name, description, creator, modified date, keywords, license, and distribution. - Experiments and Discoveries emit
Articlewith the AI analysis as article body and source datasets linked throughabout. - Portfolios emit
ArticleorCollectionPage. - All entities emit
BreadcrumbListso crawlers can understand and traverse the site hierarchy.
Structured data is generated through a shared builder, keeping entity-to-schema mapping centralized.
Topic Ontology
The original product had per-user topics: useful for organization, but not a global taxonomy. Version 2.4 adds a two-level controlled vocabulary that organizes content across users.
OntologyTopicsdefines curated parent and child topics, each with a stable slug, display name, and description.EntityTopicsmaps datasets, experiments, and discoveries to ontology topics through an entity-type plus entity-ID pattern.
Assignment is automatic. A Claude Haiku classifier reads existing AI analysis text and returns allowed topic slugs from the fixed vocabulary. Out-of-vocabulary results are discarded.
This is intentionally cheap. It reuses analysis text the platform already paid to generate, and experiment topics inherit to public discoveries. That means classifying experiments and datasets can cover the discovery corpus without one classifier call per discovery.
Topic Hub Pages
Topic pages turn the public corpus into a navigable subject graph.
The /topics index and each /topics/{slug} hub are public, crawler-grade pages rendered through the same shell middleware used for entity pages. Each hub exposes a noscript body listing mapped entities, plus CollectionPage and BreadcrumbList JSON-LD.
Topic hubs also receive their own sitemap shard so crawlers can discover them directly.
Functionally, the hubs act as an internal-linking layer. They connect otherwise isolated entity pages by subject:
Topic index ↓Topic hub ↓Dataset / Experiment / Discovery ↓Breadcrumb back to topic
Topic hubs give crawlers a dense, navigable graph instead of a flat list of orphan URLs.
That matters because internal linking is one of the strongest levers for encouraging crawlers to explore a large site deeply.
IndexNow and Sitemap Improvements
Version 2.4 adds several mechanisms to shorten the lag between publishing content and having search engines revisit it.
IndexNow Pings
When a public entity is published, unpublished, or deleted, the platform sends an IndexNow ping for the changed URL. The ping runs asynchronously from the same background lifecycle path that already handles RAG indexing, so publishing remains fast for the user.
Domain ownership is proven through the standard key-file-at-root mechanism.
True lastmod Values
The sitemap index now emits real MAX(ModifiedDate) values per entity type and shard instead of static build-time timestamps. Crawlers can therefore use conditional requests correctly and distinguish changed sections from unchanged ones.
Image Sitemap Extension
Discovery scatterplot thumbnails and portfolio composite thumbnails are advertised through image-sitemap extensions. That exposes the generated visual corpus to image search, not just ordinary web search.
Open Data
The public corpus is published as open data under CC BY 4.0, with attribution to the content creator.
It is exposed through three surfaces.
/llms-full.txt
The full public corpus rendered as one Markdown document containing titles, analysis, and coefficients. It is designed as a single-fetch corpus surface for language models.
/open-data/correlation-studio-public.ndjson
A streamed bulk dump with one self-describing JSON object per public entity. Streaming avoids a memory ceiling as the corpus grows.
/data
A human-facing landing page documenting the open-data surfaces, license, and NDJSON schema.
The open-data layer is also represented in structured data. Dataset JSON-LD includes a real license and downloadable distribution. The entire public corpus dump is itself typed as a schema.org Dataset, with distribution pointing to the NDJSON file.
That makes the corpus discoverable both as individual analytical pages and as a downloadable dataset in its own right.
Operational Snapshot
The discoverability work runs against live production traffic. A 30-day operational snapshot through June 29, 2026 showed:
| Metric | Value |
|---|---|
| Requests | ~975,000 |
| Unique IPs | ~45,000 |
| Average request duration | 79 ms |
| Uptime | 100.00% |
| Error rate | 2.38% |
| Response mix | 2xx 93.2% / 3xx 4.4% / 4xx 2.4% / 5xx ~0% |
The average request duration held with SSR middleware in the request path because the shell rewrite is lightweight. It uses data the API already loads and string-templated HTML insertion, not a second round trip through an external rendering service.
Operational Patterns and Hard-Won Lessons
The most valuable architectural lessons in Correlation Studio were not anticipated in design documents. They came from production failures, load tests, and real workflows that broke assumptions.
The advantage of a solo build is not that one person can predict every failure. It is that the loop between “this broke” and “this is fixed permanently” can be short.
The Operating Principle
Delete cleverness the moment the simpler design proves more durable.
Lesson 1: Replace Boolean State Soup
Before the state-machine refactor, dataset ingestion used multiple flags: ingestion status, indexing status, queued flags, and related booleans. Every dispatcher race created another corner case.
The fix was a single DatasetStateId column backed by a DatasetStates lookup table. Experiments received the same treatment.
One column. One writer per transition. Deterministic startup recovery.
It was one of the highest-leverage structural cleanups in the project.
Lesson 2: Make the Ledger the Source of Truth
Token balances originally lived as a column on Users and were debited through read-modify-write. Under concurrency, lost updates produced silent drift.
The ledger forced correctness. Every debit now atomically updates the user balance and writes a ledger row whose balance field is the returned post-update value.
Drift recovery becomes deterministic: find the latest reset sentinel and sum every ledger movement after it.
Lesson 3: PostgreSQL Is Not a Data Lake
The v1 storage design used row tables and index tables for bulk dataset content. It failed when write amplification overwhelmed disk I/O under concurrent ingestion.
The v2 migration replaced that entire path with one columnar Parquet file per dataset, stored in R2 and queried by DuckDB. Sharding disappeared. Thousands of lines of code were deleted. The frontend stayed unchanged.
The lesson was not “optimize PostgreSQL harder.” The lesson was “stop asking PostgreSQL to do the wrong job.”
Lesson 4: Canonicalize Time Once
A time-series bug exposed a subtle mismatch. The experiment join bucketed timestamps through one DuckDB expression. Drilldown queried the raw column through a different expression. The chart showed correlations because both sides of the join collided on the same bucket string, but drilldown returned no rows because the stored key shape did not match.
The fix was to centralize bucket-key generation. Both ingest and drilldown now use the same helper, and time-series buckets are formatted in UTC ISO 8601 form.
The lesson: if a derived key is used in more than one path, it deserves one canonical builder.
Lesson 5: Retry Is Not Enough if the Pool Is Poisoned
R2 upload retries initially failed to recover from a poisoned SChannel socket because the same client pool kept returning the same bad connection.
The fix was not more backoff. The fix was to replace the client and its underlying connection pool when the failure chain indicated a Windows TLS failure.
Retrying the same broken resource is not resilience. Sometimes the resource must be evicted.
Lesson 6: Frontend Performance Fails in Layers
The remote-download wizard did not become slow because of one bug. It became slow because autosave churn, row rerender storms, and layout overflow compounded under a realistic 40-file workflow.
Fixing only one layer would not have been enough. Debounced autosaves, memoized rows, stable callbacks, stall timers, and flex-chain layout all had to land together.
Rich workflow pages need performance design just as much as backend ingestion loops.
Lesson 7: Webhooks Must Be Idempotent
Stripe retries webhooks aggressively. Without idempotency on event ID, one checkout completion can produce multiple token credits.
The webhook controller now writes a deduplication row at the start of every handler and exits immediately if the event has already been processed.
For billing systems, “probably once” is not a delivery guarantee. Build for repeated delivery.
Lesson 8: Background Work Needs Its Own Scope
Fire-and-forget work launched from a request cannot safely reuse scoped dependencies from the request. The request scope may be disposed before the background task touches its first DbContext.
The canonical pattern is to inject IServiceScopeFactory, create a fresh scope inside the background task, and resolve the required service from that scope.
That pattern now underlies background downloads, RAG indexing paths, and ingestion queue dispatch.
Lesson 9: Cross-Shard Navigation Was a Trap
In v1, users lived in the central database while analytical content lived in shards. Any accidental navigation property across that boundary could fail because the related entity lived in a different database.
The interim fix was a systemic ban on cross-shard navigation: either denormalize the fields needed for display or explicitly fan out through shard routing.
The v2 migration collapsed the system into one PostgreSQL database, so the constraint disappeared. The useful denormalized counters remained because they make common reads cheap.
Architecture Principles
Correlation Studio evolved through several major architectural generations. Although the implementation details changed dramatically, a small set of design principles remained constant throughout the project.
Compute Once
Expensive work should be performed once and reused everywhere. Dataset statistics, thumbnails, embeddings, chart data, and AI analysis are all persisted rather than regenerated on every request.
Move Data Less
Whenever possible, computation moves to the data rather than moving the data to the application. DuckDB executes directly against Parquet files instead of materializing millions of rows through EF Core.
Cache Aggressively
Local disk, R2 object storage, CDN caching, HTTP caching, database caches, and application caches all reduce latency while keeping expensive analytical operations off the critical request path.
Prefer Metadata
Metadata is inexpensive. Raw analytical data is not. The application stores lightweight metadata in PostgreSQL while leaving high-volume numerical payloads in Parquet.
Everything Is Restartable
Every long-running workflow is designed to survive crashes, deployments, connection loss, or browser refreshes. Durable state always wins over in-memory state.
AI Is a Service Layer
Large language models enhance the platform but never define its architecture. AI-generated results become first-class entities that users can edit, publish, search, cite, and build upon.
Technology Stack
Correlation Studio is built almost entirely on proven open-source technologies, selected for stability, performance, and operational simplicity rather than novelty.
| Layer | Technology |
|---|---|
| Backend | ASP.NET Core 9 |
| Language | C# 13 |
| Frontend | React 18 + TypeScript |
| ORM | Entity Framework Core |
| Database | PostgreSQL + pgvector |
| Analytics Engine | DuckDB |
| Columnar Storage | Apache Parquet |
| Object Storage | Cloudflare R2 |
| Authentication | JWT, Google OAuth, Microsoft OAuth |
| Payments | Stripe |
| Embeddings | OpenAI + Gemini |
| Language Models | Claude, Gemini, Grok |
| Maps | Leaflet |
| Charts | Canvas, SVG, D3 |
Every major technology choice was made to support a specific architectural goal: DuckDB for analytical execution, Parquet for compact columnar storage, PostgreSQL for transactional metadata, and R2 for inexpensive durable object storage.
Current Scale
As of Version 2.4, the public corpus contains:
600+Datasets
8,000+Experiments
45,000+Discoveries
165M+Rows
15,000+Columns
Every public dataset can participate in future experiments. Every experiment can generate discoveries. Every discovery can become part of a portfolio, appear in the search index, receive AI analysis, be discussed socially, and become part of the retrieval corpus used by Corrie.
That network effect is the central design objective of the platform. Correlation Studio is not simply a statistics application. It is a knowledge graph built from statistical relationships.
Conclusion
Correlation Studio began as an attempt to make correlation analysis accessible to people who were not statisticians. Along the way it evolved into something much broader: a platform for discovering, publishing, discussing, and preserving statistical relationships at scale.
The engineering journey followed the same trajectory. Early versions emphasized functionality. Later versions emphasized scalability, durability, recoverability, discoverability, and operational simplicity. Many of the most valuable improvements came not from adding features, but from removing unnecessary complexity and replacing fragile assumptions with deterministic systems.
The resulting architecture combines transactional metadata in PostgreSQL, high-performance analytical execution through DuckDB, compact Parquet storage in Cloudflare R2, AI-assisted workflows, durable background processing, and a modern React frontend into a unified platform capable of mining millions of statistical relationships while remaining approachable to non-programmers.
Correlation Studio’s mission is simple:
Make data science accessible, make discovery collaborative, and make statistical insight searchable.
Correlation Studio
Rapid Discovery Mining • Causation Analysis