Building a Bulletproof Data Infrastructure To Support Explosive Growth

Nov 21, 2024

Marketing has evolved from the smoke-filled rooms of Mad Men to a data-driven, analytical discipline focused on driving revenue and growth. In this new era, a solid data infrastructure is paramount. Without it, achieving measurable and sustainable growth becomes exponentially more challenging. Evaluating and enhancing your existing data infrastructure should be a priority to optimize your growth.

In this article, I’ll walk you through how to build a scalable and easily maintained data infrastructure that will form the foundation for your future growth.

Customer Data Platform (CDP)

A Customer Data Platform (CDP) centralizes the management of your customer data flow. It collects data from your digital products and forwards it to third-party services and databases that power your product. The CDP enables precise control over which user data is shared with each ad network, service, and database. Their SDK dramatically simplifies implementations, allowing new SDKs to piggyback on your existing events rather than requiring duplicative implementations. The CDP handles user and device-level information collection, event batch, retry logic, and much more.

Key Benefits of Using a CDP:

Enormous Opportunity Cost Savings: Building and managing the infrastructure yourself is a massive undertaking. Trust me - I’ve done it, and it was a headache, especially during spikes in usage. Don’t do it yourself. The opportunity cost is enormous. Focus on your product instead.
Simplify Your Code Base and Speed Development: Integrating a new partner requires minimal to no engineering resources. A CDP forwards existing events to partners, reducing code complexity, bundle size, and resource requirements while at the same time collecting standard user and device information and providing event batching, retry logic, and much more, all of which are time, resource-intensive, and error-prone to build.
Data Forwarding: Seamlessly forward collected user data to a wide array of services, often without client-side modifications.
Error Reduction: Without a CDP, engineering teams tend to duplicate analytical events for every SDK and service, which undoubtedly leads to errors and inconsistent metrics due to the spaghetti code. A CDP centralizes events, minimizing mistakes.
Replayability: In fast-moving startups, you will break systems, causing data loss. A CDP allows replaying and resending missing data, ensuring all relevant data is captured.
API Data Pull: CDPs support pulling data from your ad networks, CRM, and other services and sending that data to your database.

While a CDP adds to your budget, the savings in engineering resources, time, and error reduction make it a worthwhile investment.

Recommendation: Segment

Data Movement Platform (DMP)

A Data Movement Platform (DMP) efficiently manages the API connections and schemas for hundreds of providers and services. It lets your team quickly load detailed performance data from hundreds of ad networks, CRM tools, and cloud spreadsheets into your internal database. Without a DMP, you may either miss out on critical data or take on the costly, time-consuming, and error-prone task of maintaining this infrastructure internally. Having done the latter, I do not recommend it. With no prior experience, I was able to set up a custom schema within a DMP from Meta’s ad API to our analytics database in under 10 minutes.

Why Not Just Use a CDP?

While a CDP is excellent for moving customer data, it only provides access to a limited portion of the data available via APIs. For instance, valuable ad metrics related to video performance or ad quality are typically only accessible through full API pulls, which aren’t available via a CDP. This limitation makes a DMP crucial for comprehensive data analysis.

Key Benefits of Using a DMP:

Free Engineering Resources: Building and maintaining the integrations with every ad network is time-intensive. Your resources are better spent focusing on internal priorities and product development.
Reduce Downtime: Outsourcing API and schema maintenance minimizes downtime and errors, which can cost hundreds of thousands in marketing misallocations.
Quick Data Integration: Establishing a new data connection is remarkably fast, often taking as little as five minutes.
Google Sheets Integration: Easily load Google Sheets into your database, enabling marketers to track and merge manually tracked metrics or creative information with other datasets seamlessly.
Real-Time Data Streaming: Stream data from production databases to analytics databases in real-time, allowing your team to focus on data analysis rather than data management.

Incorporating a DMP into your data infrastructure ensures you have access to detailed and valuable performance data without the headache of maintaining numerous API connections and schemas. This strategic move saves time, reduces errors, and enhances your ability to make informed, data-driven decisions.

Recommendation: Fivetran

Mobile Measurement Partner - Attribution (MMP)

If you have a mobile app and plan any paid marketing or need to track the source of your users, a Mobile Measurement Partner (MMP) is a must-have. While a CDP and DMP can load raw tracking data into your database for analysis, they can't handle the complexities of mobile attribution, especially considering Apple’s SKAdNetwork privacy program, which I’ll cover more in-depth in a future article.

Key Benefit of Using an MMP:

Essential for Mobile Attribution: Accurate mobile attribution isn’t possible without an MMP. These platforms are designed to track and attribute user interactions from various sources, ensuring you understand where your users are coming from and how your marketing efforts are performing.

Recommendation: Adjust or AppsFlyer

Both Adjust and AppsFlyer offer nearly identical services. Choose the one that best fits your budget.

Cloud Database (CD)

Choosing a suitable database is crucial for your business’s long-term success. With experience using MongoDB, Redshift, BigQuery, and Snowflake, I can confidently say that selecting the wrong database can create persistent issues, including premature hair loss. While Redshift and other solutions have significantly improved, Snowflake stands out as the clear winner.

Key Advantages of Snowflake:

Eliminates Nightly Job Locking: Snowflake’s scalable architecture ensures that nightly jobs don't lock due to insufficient server resources, a common problem with many databases.
Prevents Slow Queries: Snowflake efficiently handles peak-hour queries, ensuring that data is readily available for business reviews without delay.

Snowflake's ability to scale based on usage has resolved these issues, saving hundreds, if not thousands, of personnel hours.

Opting for Snowflake as your cloud database solution ensures scalability, efficiency, and reliability. Snowflake helps your team focus on analysis and decision-making rather than troubleshooting database issues by addressing common pain points such as job locking and slow queries.

Recommendation: Snowflake

Modern ELT (formerly known as the ETL) using dbt

The rise of modern database solutions that can handle raw data and transformations at scale has transformed how companies manage their data pipelines. The paradigm shift is toward ELT (Extract - Load - Transform), where the database handles the bulk of data transformations.

Extract & Ingest

The first and most important aspect is keeping all raw data in its original, unaltered form. Regardless of the source source, all raw data should be stored in your database or a storage solution such as Amazon S3, which can then be loaded into the database.

Bugs will happen—whether it’s type-casting a field incorrectly, passing the wrong data, or accidentally deleting a table. A proper architecture will prevent data loss, ensure data integrity, and allow quick recovery. While at the same time providing the crucial training data for any future AI systems. One of the two main limiting factors to the quality of any AI system is the volume and quality of training data. You do not want to lose any data that might become the linchpin of a future AI strategy.

dbt

Your business runs on data. To build a competitive advantage, you must integrate datasets, transform them into actionable insights, and deliver them to stakeholders quickly and accurately. The challenge is to do this without creating a system that becomes a giant, untraceable, and unscalable mess.

This is coming from experience making giant messes. I’ve written rollup scripts that were thousands of lines long to clean purchase and marketing data containing multiple SQL blocks, pulling from several external sources. Failures occur if one source table doesn’t update properly or runs a tad slow. There’s nothing like getting a dozen Slacks Monday morning because the weekend jobs crashed for one reason or another. With dbt we broke those multi-thousand-line SQL files into smaller, retryable blocks, dropping the error rate to nothing.

dbt helps manage this complexity in a way that’s modular, scalable, repeatable, and governed—all directly inside your data platform. Instead of completing transformations before loading the data into your warehouse, you load raw data and perform transformations using dbt within the warehouse itself.

Using dbt, data teams can build, test, and deploy analytics code using software development best practices (like portability, CI/CD, observability, documentation, etc.) to create production-grade analytics pipelines that scale. This results in modular, clean data models that can be delivered into BI tools, LLMs, and APIs, ensuring stakeholders have accurate data when and where they need it.

Raw Data Storage - S3 vs Data Warehouse

With the drop in storage costs, loading all your raw data directly into your data warehouse is now viable. However, there are times when a lower-cost solution or an intermediary before loading the data into your database might be desirable. This is where Amazon S3 and similar solutions provide value. For instance, if you have an external API to append user information or get the geolocation of a user’s IP address, you can write the information to S3 as an intermediary before loading it into the data warehouse in case of loading issues.

Key Benefits of Using an ELT Process:

Fix Errors Fast: Correcting an error is straightforward—just adjust the transform step and rerun the script.
Near Real-Time Access To Live Data: With these tools, your team can access data as soon as 10 minutes after events occur, giving executives the most recent data possible.
Minimize Lost Data: By storing raw data, you can recover as much data as possible when errors occur, ensuring no valuable information is permanently lost.

Recommendation: Snowflake in combination with Amazon S3 using Kinesis or Kafka Streams

Note: These recommendations are unbiased and uncompensated.

Curt Geen | The science of growth

Discussion about this post