Data Warehouse - The Ultimate Guide [2025] | Master Data Modeling

TLDR;

Alright, so this video is like a crash course on data warehousing, covering everything from the basics to advanced concepts. Ansh Lamba breaks down what data warehouses are, how they differ from regular databases, and how to build and manage them using tools like Databricks and Spark SQL. He also touches upon data modeling, dimensional data modeling, and slowly changing dimensions.

Data warehousing is a must-have skill for data engineers.
The video covers ETL, incremental loading, and different data modeling schemas.
Practical implementation using Databricks and Spark SQL is shown.
Slowly changing dimensions (SCDs) are explained with code examples.

Introduction [0:00]

The demand for data engineers is increasing, and mastering data warehousing is key to acing interviews. This video covers data warehousing from scratch, including ETL layers, incremental data loading, CDC, dimensional data modeling, star and snowflake schemas, and different types of fact and dimension tables. It also covers advanced topics like transactional vs. periodic fact tables, degenerate dimension tables, role-playing and junk dimensions, and slowly changing dimensions with code implementation. This guide is based on "The Data Warehouse Toolkit" by Kimble and includes practical implementations using the latest tools and technologies.

What is Data Warehouse? [7:15]

A data warehouse is a central location or repository where an organization stores all its data and information. This includes data regarding transactions, sales, budgets, marketing, HR, and more. It allows tracking of the total data profile that a company or organization is holding. Think of it as a luggage bag where you store every single piece of your outfit for a vacation, including formal, informal, and party wear.

Database VS Data Warehouse [12:42]

SQL databases like MySQL, PostgreSQL, and MS SQL Server are database management systems (DBMS) that manage databases. Both databases and data warehouses store data in the form of tables and columns, but the key difference lies in how data is stored and pushed between the two architectures. Databases store data in the form of transactions, which are small chunks of near real-time data. Data warehouses, on the other hand, load data in bulk using ETL processes, typically on a daily basis, to avoid slowing down the performance of the constantly interacting database.

What is Data Warehousing [23:19]

Data warehousing is the process of fetching data from a database and storing it in a data warehouse. This process is facilitated by ETL (Extract, Transform, Load) operations. A data engineer applies ETL logic to pull and store data in the data warehouse. While many tools now allow even non-data engineers to perform ETL, the core importance of a data engineer lies in data warehousing itself, specifically in how to store the data effectively, which depends on the unique business requirements.

ETL Layers [28:29]

ETL (Extract, Transform, Load) is a process that contains three stages: extract, transform, and load. Data is first pulled from the database, which acts as the source. Within the data warehouse network, there are two layers: staging and core. The staging layer receives data in its raw form to avoid putting a load on the database. Transformations such as adding, removing, or transforming columns are then applied. Finally, the transformed data is stored in the core layer, which is exposed to data analysts, data scientists, and business analysts. Data marts, which are subsets of the data warehouse, can be created for specific domains like finance or HR.

Incremental Loading [34:04]

Incremental loading involves loading only the new records to the staging area instead of the entire dataset. This is achieved using Change Data Capture (CDC). During the initial load, the maximum date is stored, and subsequent loads only fetch data with dates greater than this stored date. The staging layer is typically truncated every time to ensure only new records are processed, although there are cases where the history of records needs to be maintained. Staging layers can be transient (truncated every time) or persistent (data is kept).

Databricks Free Account [46:10]

To create a free Databricks account, search for "Databricks Community Edition" on Google and sign up. If you already have an account, click on the login link. This free account provides a workspace to run SQL queries without needing to install on-prem SQL workbenches.

Databricks Overview [48:27]

In Databricks, go to the workspace to create folders for storing notebooks. Create a new folder, such as "data warehouse," and then create a new notebook within this folder. To perform SQL operations, a compute resource (cluster) is needed. Click on the connect button and create a new resource, which is a Spark cluster. This cluster will automatically terminate after 20 minutes of inactivity.

Incremental Data Loading using Spark SQL [57:32]

To use SQL in Databricks notebooks, change the default language from Python to SQL. This can be done by clicking on the language option in a cell or by changing the default settings in the edit menu. Use %md or select "markdown" to create headings for readability. To view tables and databases, click on the catalog button. Create a database called "sales" using the SQL syntax CREATE DATABASE sales. Then, create a table within this database and insert data using SQL commands.

What is Data Modeling [1:19:02]

Data modeling is essential for efficiently storing data without redundancy and saving storage costs. It involves structuring data to give it a defined structure. Without data modeling, it is difficult to build efficient reports, make data-driven decisions, and maintain dashboards.

What is Dimensional Data Modeling [1:25:45]

Dimensional data modeling is a technique used to store data in the form of facts and dimensions tables. This is a type of logical data model. Instead of using traditional ER (entity relationship) models with normalization, data engineers use dimensional data modeling.

Fact Table and Dimension Tables [1:29:06]

A fact table is the only table available at the most granular level and stores numeric columns (facts) and foreign keys. Dimension tables, on the other hand, hold the context of the data and do not store numeric values. Dimension tables are created based on different contexts, such as customers, products, or regions. In a data model, there is one fact table and multiple dimension tables around it. The relationship between the fact table and dimension tables is always many-to-one.

STAR Schema VS SNOWFLAKE Schema [1:37:37]

When building a dimensional data model, there are two schema options: star schema and snowflake schema. Star schema consists of one fact table and multiple dimension tables directly connected to it, without any hierarchy between the dimensions. Snowflake schema, on the other hand, includes a hierarchy and indirect linkage of a third dimension with the fact table. Star schema is more commonly used because it is easier to maintain.

Dimension Tables and Fact Table using Spark SQL [1:46:23]

Create a new database called "sales_new" and a table within it. Then, create a new data warehouse. The transformation layer is created, and the data is transformed. Dimension tables are created for customers, products, and regions. Dimension surrogate keys are created using the ROW_NUMBER() function. The fact table is then created, and the dimension keys are brought into the fact table by applying joins.

Types of Fact Tables [2:30:39]

There are three types of fact tables: granular or transactional fact tables, periodic fact tables, and accumulating fact tables. Transactional fact tables keep data at the most granular level, with one transaction equaling one row. Periodic fact tables, also known as snapshot fact tables, aggregate data over a specific period (e.g., monthly). Accumulating fact tables describe a process or journey, with one row representing the entire process and including multiple date columns.

Types of Dimension Tables [2:40:34]

There are several types of dimension tables: conformed dimensions, role-playing dimensions, junk dimensions, and degenerate dimensions. Conformed dimensions are shared by more than one fact table. Role-playing dimensions are used when a dimension table is connected to a fact table on multiple conditions. Junk dimensions have only a few values. Degenerate dimensions have only one column and no contextual value.

Slowly Changing Dimensions [2:49:47]

Slowly changing dimensions (SCDs) are used to manage changes in dimension tables. The common types are Type 0 (no change), Type 1 (upsert), Type 2 (preserve history), and Type 3 (track previous value). Type 1 (upsert) updates existing records and inserts new records. Type 2 adds effective start date, effective end date, and in-use flags to track changes. Type 3 preserves the previous value by adding a new column.

Implementing SCD Type 1 in Databricks with Spark SQL [3:00:00]

A new database called "sales_SCD" is created. A table is created and populated with initial data. To implement SCD Type 1, a merge statement is used. The merge statement updates the target table (dimension table) with data from the source (view). If a record matches based on the product ID, it is updated; otherwise, a new record is inserted.