7 Data Engineering & Databases

7.1 Introduction

In marketing analytics, data is both the starting point and the greatest obstacle. Firms collect a wide variety of information: sales transactions, website clicks, loyalty card activity, advertising impressions, survey responses, or customer support logs. Social media adds another layer, with streams of likes, comments, and shares. At first glance, these sources look rich and promising. Yet in their raw form, they are rarely suitable for analysis. They tend to be fragmented, inconsistent, and overly detailed.

Data engineering is the task of turning these raw inputs into a dataset that can actually be analyzed. It is not a separate discipline—it is a core part of what marketing analysts do. In practice, it often falls to junior analysts when they join a company: before they are asked to build dashboards or predictive models, they are asked to “get the data into shape.” This involves cleaning up transaction logs, linking customer records across systems, or aggregating daily click data into weekly campaign summaries.

The goals of data engineering are straightforward but fundamental:

Cleaning: removing errors, duplicates, and inconsistencies. For example, two slightly different spellings of the same retailer can lead to double-counting.
Operationalizing: turning abstract concepts into measurable variables. “Churn” becomes a flag for whether a customer has not purchased for 90 days; “loyalty” becomes the share of purchases with a focal brand.
Validation: checking whether the data make sense. Do weekly sales figures add up to monthly totals? Are there negative prices in the dataset?
Generating insights: creating derived metrics that extend the raw information. From click data we might compute conversion rates; from transactions we might compute average basket size.

To illustrate, consider a company evaluating the effectiveness of an online advertising campaign. The marketing team has three main data sources:

Ad impressions: every time a user saw an ad, stored with a timestamp and user ID.
Transactions: records of customer purchases, also linked to user IDs, with information on products, quantities, and prices.
Weather data: daily temperature and rainfall by region, thought to influence online shopping behavior.

Each dataset has its own structure and quirks. Ad impression logs contain millions of rows for the same customer. Transaction records sometimes list the same order twice. Weather data is organized by day and region, which does not match the structure of the other two datasets. Before any analysis can happen—say, estimating the campaign’s impact on sales—the analyst must engineer these sources into a coherent dataset.

In this chapter, we begin by looking at data structures, or the way raw data is organized and what issues commonly appear. We then consider how to design a derived dataset: deciding what each row should represent and what variables it should contain. Finally, we review the set of data operations that allow us to move from messy raw inputs to a clean, analysis-ready dataset.

7.2 Data Structures

Before any transformation can take place, analysts must first understand the structure of the raw data they are working with. This step is about diagnosis rather than action: examining how information is organized and identifying the keys that uniquely define observations. Only after this groundwork can different sources be engineered into a single dataset.

Keys and Identifiers

Every dataset must contain a way to uniquely identify each row. In database terminology, this is called the primary key. Sometimes it is a single variable, such as a customer ID or a transaction number. At other times, uniqueness only emerges through a composite key, which is a combination of two or more variables.

Returning to our campaign example:

In the ad impression logs, the primary key is not a single field. Each row is defined by a combination of user ID, timestamp, and ad identifier.
In the transaction records, a transaction ID ensures uniqueness. However, in practice duplicates may still appear, for example if an order was logged twice.
In the weather dataset, the primary key is a composite of date and region.

Clear keys are critical because they allow datasets to be linked and merged reliably. If identifiers are missing, duplicated, or inconsistently coded, later stages of analysis will be compromised.

Data Representation

The same data can be organized in different shapes, most commonly long or wide formats.

In long format, each row corresponds to one observation of a variable at a point in time. The ad impression logs are naturally long: every row is one impression.

date	user_id	campaign	impressions
2025-01-01	U101	A	1
2025-01-01	U102	B	1
2025-01-02	U101	A	1

In wide format, multiple observations are spread across columns. A summary of impressions by date and campaign would look like this:

date	impressions_A	impressions_B
2025-01-01	1	1
2025-01-02	1	0

Neither form is inherently better. Instead, the choice depends on purpose: statistical models usually expect long data, while dashboards and machine learning pipelines often work with wide data. Analysts need to recognize these formats and be ready to pivot between them.

Common Issues

Even with clear keys and a chosen representation, raw data often brings practical challenges:

Missing values in important variables.
Duplicate rows caused by technical errors.
Inconsistent labels (e.g., “NYC” vs. “New York”).
Implausible figures such as negative sales or extremely large order quantities.

In our campaign example, the ad impression logs contain missing user IDs for some records, preventing them from being linked to transactions. The transaction data lists certain orders twice. The weather dataset uses Celsius while earlier company reports used Fahrenheit, creating the risk of misinterpretation.

Understanding data structures in this way—keys, representation, and common issues—provides the foundation for engineering. It gives analysts a map of what they have before they make decisions about what they want to build.

7.3 Creating Your Derived Dataset

Understanding the structure of raw data is only the first step. Analysts must also decide what the final dataset for analysis should look like. This step involves making choices about the unit of analysis—what each row in the dataset will represent—and about the variables that will be included. In other words, you are designing the blueprint of your dataset before applying the technical operations that make it real.

Deciding on the Unit of Analysis

The unit of analysis determines the level of granularity. In marketing analytics, there are many possibilities: one row per customer, per product, per transaction, per campaign impression, or per week of activity. The right choice depends on the research or business question.

In our campaign example, the raw sources are all at different levels:

Ad impression logs: one row per user per impression.
Transaction records: one row per purchase.
Weather data: one row per region per day.

Suppose the marketing team wants to evaluate the effect of advertising on weekly sales by region. The derived dataset must therefore be structured at the region–week level, with columns capturing advertising exposure, transactions, and weather. Each row now represents one region in one week.

Designing the Final Structure

Once the unit of analysis is clear, analysts can design the structure of the derived dataset. This includes:

Keys and identifiers: in the example, a composite key of region and week.
Variables: derived measures such as the number of impressions in that week and region, total sales revenue, average basket size, or average daily temperature.
Consistency: ensuring that all sources can be harmonized to the chosen unit of analysis.

The resulting dataset might look as follows:

region	week	ad_impressions	total_sales	avg_basket	avg_temp
North	W01	54,200	€125,000	€32.50	2.1°C
South	W01	33,800	€98,500	€29.80	6.4°C

Here, each row is cleanly defined, and the dataset is analysis-ready. The structure makes it possible to run models, generate reports, or design dashboards without ambiguity.

Creating a derived dataset is therefore a matter of decision-making as much as it is a technical task. Analysts must clarify what the dataset should represent before applying operations to build it. Without this step, there is a risk of merging incompatible data or constructing variables at the wrong level of detail, which can lead to misleading conclusions.

7.4 Common Data Operations

Once you know what you want your dataset to look like, the question becomes: how do we actually get there? This is where the real work of data engineering begins. Raw marketing data is rarely delivered neatly packaged. Instead, it arrives as click-level logs, line after line of transaction records, survey responses scattered across files, or contextual variables such as daily weather reports. To turn all this into a coherent dataset that matches your chosen unit of analysis, you need a set of reliable operations.

Think of these as the toolbox of the analyst. Each operation has its purpose, and together they allow you to shape, refine, and harmonize raw information into something that is not only analyzable, but also meaningful.

Cleaning and Auditing

The first responsibility of an analyst is to ensure that the data is free from obvious flaws. Just as a chef inspects ingredients before cooking, analysts must examine and clean their inputs before attempting analysis. If we don’t, the risk is clear: every model or report built on the flawed data will mislead.

Filtering Rarely do we need every single observation from a dataset. Filtering helps us focus on the relevant parts. Suppose our campaign ran only in January—then we filter the impression logs to those dates. Or perhaps the company only cares about two focal regions—then we restrict the data accordingly. Filtering reduces noise and directs attention to what matters.

Deduplication Marketing data is notoriously messy when it comes to duplicates. The same customer might appear under two slightly different IDs, or the same transaction could be logged twice because of a system hiccup. Duplicates inflate totals and distort averages. Removing them means making sure that each key (say, transaction ID) truly corresponds to only one record.

Handling missing values Missing data is everywhere. A customer leaves a survey question blank, a cookie expires mid-session, or an entire reporting system goes offline for a day. Missing values are not just an inconvenience—they can bias analysis if ignored. Sometimes it is safe to drop a few rows. Sometimes we impute values, for example by carrying forward the last known sales figure when one week of data is absent. And sometimes, the fact that data is missing is itself meaningful, so we flag it as a new variable.

Standardization Even when the data are complete, they often speak different “languages.” One system records prices in euros, another in dollars. Dates appear as “02/01/2025” in one file and “2025-01-02” in another. Standardization is about making sure that all sources use consistent formats. Without it, we risk combining apples with oranges—or euros with dollars.

Anonymization In marketing analytics, privacy is always a concern. Names, addresses, and emails are rarely needed for analysis. Instead, analysts replace them with pseudonyms such as customer IDs. This protects individuals while still allowing data to be linked. Anonymization is not only an ethical practice but increasingly a legal requirement.

Validation checks Finally, before moving on, analysts should ask: do these numbers make sense? Do weekly totals of transactions add up to the monthly numbers reported by accounting? Are there negative prices lurking in the dataset? Validation is the last inspection before the data can be trusted.

Aggregation

Once the data is clean, the next challenge is to align everything to the right unit of analysis. This is where aggregation comes in. Aggregation is one of the most powerful and practical operations in data engineering. It allows us to take highly detailed data—transactions, clicks, impressions—and roll them up into the summaries that matter for analysis.

Imagine our campaign example. Ad impression logs contain millions of rows, one for each time a customer saw an ad. By themselves, these are far too granular for analysis. But when we group them by region and week, they become meaningful: “54,200 impressions in the North in week 1.” That single number now speaks the language of our derived dataset.

Aggregation is also the moment when new measures are born. By grouping data, we can compute:

Totals, such as total sales revenue per region per week.
Counts, such as the number of distinct customers who purchased.
Averages, such as average basket size.
Shares, such as the proportion of impressions from campaign A versus campaign B.
Growth rates, such as week-over-week changes in sales.
Indicators, such as whether a customer has churned.

Take a simple example.

Raw impression logs:

user_id	region	date	campaign
U101	North	2025-01-01	A
U102	North	2025-01-01	A
U103	South	2025-01-01	B

After aggregation (region–week):

region	week	impressions	distinct_users
North	W01	2	2
South	W01	1	1

By designing the right summaries, aggregation doesn’t just reduce data—it enriches it with new, analytically useful variables.

Merging and Integration

Once datasets are aggregated to a common key, they can be brought together. This is the heart of marketing analytics: weaving disparate sources into one coherent dataset. But merging is never trivial—it requires careful choices about how to align rows.

The mechanism for combining data is the join. Each type of join handles mismatches differently:

An inner join keeps only rows present in both datasets.
A left join keeps all rows from the left dataset, filling in columns from the right when matches exist.
A full join keeps everything, leaving blanks where no match is found.

For example, consider weekly transaction totals and weekly weather data. If we inner join them, regions or weeks missing in one source disappear entirely. If we use a left join, we keep all transaction weeks, even if weather is unavailable. Each choice has consequences: we might sacrifice completeness for precision, or vice versa. The analyst must decide deliberately, based on the research question.

Reshaping and Pivoting

Datasets do not always arrive in the form we need. Sometimes they are “long,” sometimes “wide.” Reshaping allows us to move between these structures.

In long format, each row is one observation. It is tidy and often required for statistical models.
In wide format, multiple variables are spread across columns. This is intuitive for reporting and dashboards.

Take campaign impressions.

Long format:

week	campaign	impressions
W01	A	54,200
W01	B	33,800
W02	A	50,100

Wide format:

week	impressions_A	impressions_B
W01	54,200	33,800
W02	50,100	–

Neither format is superior. Analysts pivot between them depending on the task. Long format is often required for regressions, while wide format simplifies comparisons in a dashboard. Recognizing and controlling these representations prevents confusion and allows the same data to serve multiple purposes.

Working with Time and Text

Two types of variables deserve special mention because they are both common and messy in marketing data: time and text.

Time Dates rarely come in a tidy format. One dataset uses “01-02-2025,” another “2025/02/01.” Some report daily data, others weekly. The first task is to convert everything into a consistent form. Analysts often extract components like week, month, or year so that datasets can be aligned. When data are missing, methods like carrying forward the last observation or interpolating between two points help maintain continuity.

Text Text fields are equally challenging. Brand names, artist names, or product descriptions often come with typos, abbreviations, or inconsistent spellings. Left untreated, they prevent proper grouping or merging. This is where regular expressions become invaluable: they allow us to detect patterns such as “all names containing numbers” or “strings ending in ‘Ltd.’” Cleaning text transforms chaotic strings into standardized categories that can be used in analysis.

From Raw to Analysis-Ready Data

When analysts first encounter marketing datasets, it is tempting to think that analysis begins the moment the file is loaded. In practice, the real work starts much earlier, as raw data is not a finished product—it is more like raw material. Just as timber must be cut, sanded, and assembled before it becomes furniture, so must data be cleaned, reshaped, and integrated before it becomes analysis-ready.

Hence, through common data operations—cleaning, aggregating, merging, reshaping, and working with time and text—analysts gradually transform raw inputs into a dataset that truly supports analysis. Each step is both technical and conceptual, requiring judgment about what to keep, what to discard, and how to represent information faithfully. Done well, this process ensures that the final dataset is not just a collection of numbers, but a solid foundation for insight.

Note that the path from raw to analysis-ready data is rarely linear. Analysts move back and forth between the toolbox of operations described earlier, revisiting cleaning steps after a merge, or re-aggregating once they notice that a variable is too detailed. Still, the overall journey follows a recognizable logic.

Take again our campaign example. We began with three different sources: impression logs, transaction records, and weather data. Each had its own keys, quirks, and structure. The first step was to clean them: remove duplicate transaction IDs, standardize date formats, and anonymize customer identifiers. Next, we aggregated each dataset to a common unit of analysis—region by week—so that they could eventually be merged. Impressions were summed, transactions converted into total sales and average basket size, and weather averaged across days.

With this alignment, we could then merge the datasets. Using region and week as composite keys, we joined impressions, transactions, and weather into a single panel. Along the way, we made conscious choices about join types—whether to keep incomplete weeks, or to drop regions with no coverage.

Finally, we reshaped the data for analysis. Some models require long format (one row per region, week, and variable), while the marketing dashboard team prefers wide format (columns for each campaign). We also engineered additional features: conversion rates from impressions to sales, week-to-week growth in revenue, and churn indicators for customers who stopped purchasing.

The result is a derived dataset that is coherent, documented, and analysis-ready. Each row now represents one region in one week, with all relevant variables aligned. Analysts can confidently use this dataset to run regressions, build forecasts, or design dashboards.

This example illustrates a larger point: an analysis-ready dataset is not found, it is built. Every decision along the way—how to handle missing values, how to define churn, which join to use—shapes the insights that follow. Data engineering is therefore not only a technical exercise but also an analytical one. It forces us to think carefully about what the data should represent, long before we start fitting models or drawing conclusions.

7.5 Tools and Practices

The operations we have discussed—cleaning, aggregating, merging, reshaping—are concepts. They describe what analysts need to do, not necessarily how they should do it. To put these ideas into practice, we rely on software ecosystems that provide efficient and reliable ways of working with data. While each ecosystem has its own syntax, they all serve the same goal: helping analysts transform raw inputs into analysis-ready datasets.

R: `dplyr` and `data.table`

Within R, two packages dominate day-to-day data engineering: dplyr and data.table.

dplyr is designed around the idea of a “grammar of data manipulation.” Each operation—filtering rows, creating new variables, summarizing groups—is expressed through a clear, intuitive verb. These verbs can be strung together into pipelines, allowing analysts to read their code almost like a sentence: take this dataset, then filter it, then group it, then summarize it. For teaching and for collaborative work, dplyr has become the default because of this readability.

data.table, by contrast, emphasizes speed and efficiency. Its syntax can be less intuitive at first, but it is remarkably powerful for large datasets. Many analysts describe data.table as “closer to the metal”: it handles memory carefully, executes operations quickly, and scales to millions of rows without slowing down. Whereas dplyr favors clarity, data.table favors performance.

For most marketing analytics projects, either package will suffice. The important point is that both provide a consistent way to apply the data operations we have outlined.

Python Ecosystem

Python has grown into one of the most widely used languages for data science, and at the center of this ecosystem is pandas. Pandas was designed to make working with structured, tabular data easy. At its heart is the DataFrame, a table-like object that feels similar to a spreadsheet but with far greater flexibility.

With pandas, analysts can filter rows, select columns, join datasets, and perform aggregations with just a few lines of code. Much like dplyr in R, pandas supports the typical workflow of “take this dataset, then filter it, then group it, then summarize it.” The appeal of pandas lies in its versatility: it can handle anything from a small CSV file to moderately large databases.

More recently, Polars has emerged as a modern alternative to pandas. Written in Rust and optimized for multi-core processors, Polars is considerably faster and more memory-efficient, especially for larger datasets. Its syntax is similar to pandas, but it also borrows ideas from dplyr, including the use of expression chains that make workflows easier to read. For many analysts, pandas remains the default, but Polars signals the direction in which the Python ecosystem is moving: faster, more scalable, and closer in spirit to R’s data.table.

The key takeaway is that the concepts are the same: grouping, aggregating, merging, reshaping. Whether expressed in pandas, Polars, or R, the building blocks of data engineering remain identical.

SQL

If pandas and dplyr are the tools analysts use on their laptops, SQL (Structured Query Language) is the language of data inside organizations. Most companies store their marketing data in relational databases: customer tables, transaction tables, product catalogs, and campaign logs all live in these systems. To access them, analysts must speak SQL.

SQL has a reputation for being old-fashioned, but its core commands are timeless. SELECT chooses columns, WHERE filters rows, GROUP BY aggregates, and JOIN merges. These are the very same operations we have discussed in this chapter. The difference is that SQL operates directly on the database, often holding billions of rows, rather than in memory on your machine.

For example, imagine a marketing database with two tables: one containing customer transactions and another containing campaign impressions. With SQL, you can:

Summarize transactions by customer and month (GROUP BY customer_id, month).
Join them with impressions to compute conversion rates (JOIN ON customer_id).
Filter to only include active customers (WHERE last_purchase_date > '2025-01-01').

Many firms expect analysts to have at least a working knowledge of SQL, because it is the gateway to the data warehouse. Even if the actual modeling is done later in R or Python, the first steps—extracting, filtering, and aggregating—are usually written as SQL queries.

SQL also has the advantage of reproducibility: queries can be saved and rerun at any time, ensuring that the same subset of data can be regenerated. For collaborative environments, this makes SQL indispensable.

Beyond the Basics: Databases and Modern Data Platforms

Not all marketing data lives neatly in a spreadsheet-sized table. As organizations grow, so do their data infrastructures, and analysts increasingly encounter systems that are bigger, faster, and more complex than what can be handled with pandas or dplyr alone. It is important to have at least a conceptual map of this landscape, even if you never become a database administrator yourself.

Relational Databases (SQL)

Most corporate data still resides in relational databases, systems built around the familiar logic of tables, rows, and columns. Tools such as MySQL, PostgreSQL, or Microsoft SQL Server are workhorses in this space. Their strength lies in enforcing structure: every table has a schema, every row follows the same rules, and relationships across tables are defined by keys. For analysts, the advantage is clear—data is consistent and queries are predictable. The trade-off is that relational databases can struggle when data becomes highly unstructured (e.g., free text, images) or when scale grows beyond what a single machine can handle.

NoSQL Databases

To deal with new types of data, many firms also use NoSQL databases. These systems relax the rigid structure of relational databases and allow information to be stored in more flexible forms. Document stores such as MongoDB save data as JSON-like objects, key–value stores such as Redis allow lightning-fast retrieval, and graph databases such as Neo4j capture relationships between entities. For marketing analytics, NoSQL databases are especially useful when dealing with semi-structured data like web logs, clickstreams, or social network connections. The trade-off is that queries can be less standardized, and consistency is not always guaranteed.

Cloud-Based Data Warehouses

A more recent trend is the rise of cloud-based data warehouses such as Snowflake, Google BigQuery, and Amazon Redshift. These platforms combine the familiar SQL interface with massive scalability: they can store terabytes or even petabytes of data and run complex queries in seconds. For analysts, this means that datasets once considered “too big” for analysis are now accessible with a few lines of SQL. Because these systems run in the cloud, they also enable collaboration across teams, with secure access controls and integrated tools for analytics.

Choosing the Right Platform

The proliferation of options can feel overwhelming, but the principle remains simple:

Use relational databases for structured, well-defined data such as transactions and customer records.
Turn to NoSQL systems for semi-structured or highly relational data such as web logs or networks.
Rely on cloud warehouses when the scale exceeds local or on-premise capabilities.

For the analyst, the key is not to master every platform, but to understand their roles. Whether you write queries in PostgreSQL, explore JSON documents in MongoDB, or summarize terabytes of impressions in Snowflake, the same concepts apply: cleaning, aggregating, merging, reshaping. The tools evolve, but the logic of data engineering remains constant.

Good Practices

Regardless of the software ecosystem, good habits make the difference between chaotic scripts and reproducible workflows. Together, these tools and practices allow analysts not only to transform data, but to do so in a way that is efficient, transparent, and responsible. The concepts of data engineering remain constant across ecosystems—the choice of tool is secondary to the mindset of careful preparation and reproducibility.

Documentation Write down every transformation you make. A few months later, you will not remember why you recoded “NYC” to “New York” or why you dropped a certain set of customers. Documentation is not busywork—it is the memory of your analysis.

Separating raw and derived data Never overwrite your original data. Keep raw files intact and generate new, derived datasets as you go. This allows you to retrace your steps if something goes wrong and protects against accidental loss of information.

Version control Treat your scripts the way software engineers treat code. Tools such as Git record every change, making it possible to go back to earlier versions or to collaborate with others without overwriting each other’s work.

Validation Make checking a routine habit. At each stage, ask whether the results are plausible. Do totals add up? Do averages look reasonable? Validation should happen throughout the workflow, not just at the end.

Privacy-aware workflows Finally, never forget that marketing data often concerns individuals. Anonymization, pseudonymization, and secure storage are not optional—they are responsibilities. Regulations such as the GDPR in Europe make this explicit, but even where rules are looser, privacy-aware practices build trust with customers and ensure that analysis does not cause harm.

7.6 Conclusion

Data engineering may not be the most glamorous part of marketing analytics, but it is without question the most essential. Without a solid foundation, every model, dashboard, or forecast risks being built on sand. What this chapter has shown is that engineering data is more than a set of technical tricks: it is a way of thinking.

First, we learned to read the structure of raw data—what the rows represent, how observations are keyed, and what common pitfalls to look out for. Then we shifted perspective to the dataset we want to build, deciding on the appropriate unit of analysis and sketching its blueprint. From there, we opened the toolbox: cleaning and auditing to ensure quality, aggregation to align different granularities, merging to integrate multiple sources, reshaping to serve different purposes, and finally, the special care required when working with time and text. Along the way, we saw how the same logic applies whether we use R, Python, SQL, or modern cloud-based platforms.

Perhaps the most important message is this: analysis-ready datasets are not discovered, they are created. Each filtering decision, each definition of churn, each choice of join type represents a judgment that shapes the insights to come. That is why data engineering is not just a preliminary step but an analytical exercise in itself.

In practice, this work is often given to junior analysts. Yet those who master it gain an outsized advantage: they become the people others trust, because their results are reliable. Over time, this skill builds credibility.

In the next chapter, we move one step further. Once you know how to engineer a dataset, the question becomes: how do you make this process repeatable, scalable, and reliable over time? Here, concepts like pipelines, automation, and continuous integration come into play. We will explore how to move from one-off data preparation to systematic workflows that can be shared, scheduled, and reproduced—so that good data engineering becomes not just a task, but a habit.