TLDR;
Alright, so this data engineering crash course, put together by Justin Chow, gives you the lowdown on what it takes to be a data engineer. You'll get a handle on databases, Docker, and analytical engineering. Plus, there's some cool stuff on building data pipelines with Airflow, doing batch processing with Spark, and handling streaming data with Kafka. And to top it off, you'll build a complete end-to-end pipeline project to test your skills.
- Data engineering is super important because a lot of big data projects fail due to unreliable data.
- Docker is a lifesaver for making sure your applications work the same everywhere.
- SQL is your go-to language for bossing around databases.
Introduction to Data Engineering [1:03]
So, why data engineering, boss? Well, a whopping 85-87% of big data projects tank because of dodgy data infrastructures and poor data quality. Earlier, data scientists were expected to handle the data infrastructure, leading to incorrect data modeling and repetitive tasks. This caused a high turnover among data scientists. Now, there's a big need for data engineers to clean up the data and extract value. Plus, data engineers are crucial for making data-driven decisions, especially with all the AI and ML stuff going on. In the US, the average median salary for data engineers is between 90 to 150k a year, but that depends on where you are.
Introduction to Docker [3:41]
Docker is an open-source platform that makes it easy to build, ship, and run applications inside containers. Think of it as a way to package your entire environment, dependencies and all, into one neat container that anyone can use. Containers are lightweight, portable, and self-sufficient, packaging everything an application needs to run. Docker helps engineers package their applications, avoid issues during testing and development, and ensure environments are reproducible. There are three main concepts in Docker: Docker files, Docker images, and Docker containers.
A Docker file is a text file with instructions (a blueprint) for creating a Docker image. The Docker image is a lightweight, standalone, executable package that includes everything needed to run a piece of software. Images are read-only and immutable. A Docker container is a runtime instance of a Docker image. It's isolated from the host and other containers, with its own file system, and can be started, stopped, and deleted independently.
To get started with Docker, first install Docker Desktop, which includes Docker Compose. Then, follow the Docker documentation's getting started guide to containerize an application. This involves cloning a sample app, creating a Docker file with instructions on how to build the image, and then building and running the image inside a container.
Containerizing an Application [9:15]
To containerize an application, clone the getting started app and create a Docker file in the root directory. The Docker file contains instructions for building the image, such as using the node 18 package, setting the working directory, copying files, running commands, and exposing a port. Use the docker build command to tag and build the image. Then, use the docker run command to start the container, mapping a local port to the container's port. You can access the containerized app through your local host.
To update the application, modify the code and rebuild the image. Remove the old container and create a new one with the updated image. To persist data, create a volume and mount it to the container. This ensures that data is saved even when the container is stopped or removed. For multi-container apps, create a network to allow the containers to communicate. Attach the containers to the network and use environment variables to configure the connections.
Docker Compose simplifies the process of defining and sharing multi-container applications. Create a compose.yaml file to define the services, images, commands, ports, volumes, and environment variables for each container. Use the docker compose up command to start the application and the docker compose down command to stop it.
Introduction to SQL [30:58]
SQL (Structured Query Language) is the standard language for database creation and manipulation. To start a playground, set up a database using Postgres within a Docker container. First, pull the Postgres image from Docker Hub. Then, create a container running the Postgres image, setting environment variables for the database, user, and password. Create a new database inside Postgres and connect to it using the psql command. Create tables with fake data to play around with.
Basic SQL Commands [34:46]
The SELECT command is used to query data from a table. Use SELECT * FROM users to select all columns from the users table. The DISTINCT command is used to select unique values from a column. For example, SELECT DISTINCT email FROM users selects all unique emails from the users table. The UPDATE command is used to modify data in a table. For example, UPDATE users SET email = '[email protected]' WHERE first_name = 'John' updates John's email in the users table.
The INSERT command is used to add new data into a table. The LIMIT command is used to limit the number of records returned by a query. For example, SELECT * FROM films LIMIT 5 returns only the first five records from the films table. Aggregate functions in SQL allow you to perform computations on a single column across all rows of a table.
Aggregate Functions and Group By [43:50]
The COUNT function counts the number of rows that match a condition. The SUM function calculates the sum of all numbers in a numeric column. The AVG function calculates the average of a numeric column. The MAX and MIN functions find the highest and lowest numeric values in a column. The GROUP BY statement groups identical data into groups, typically used with aggregate functions. For example, SELECT rating, AVG(user_rating) FROM films GROUP BY rating groups the average user rating by each rating category.
Joins, Unions, and Subqueries [49:31]
JOIN combines data from two different tables. INNER JOIN returns only the rows where there is a match in both tables. LEFT JOIN returns all rows from the left table and the matched rows from the right table, with null values for non-matching rows. Aliases are used to give tables a temporary name in a query. UNION combines the result sets of two SELECT statements, removing duplicate rows. UNION ALL combines the result sets, including duplicate rows. Subqueries are queries nested inside another query, used to retrieve data based on a condition.
Building a Data Pipeline [1:05:00]
Build a data pipeline from scratch using a Python script and Docker to host source and destination databases using Postgres. The pipeline will extract data from the source database, transform it, and load it into the destination database. The first step is to create a Docker Compose file to define the source and destination Postgres services. The Docker Compose file specifies the image, ports, networks, environment variables, and volumes for each service. Create a source DB init folder with an init SQL file to initialize the source database with fake data.
Setting Up the ELT Script [1:17:01]
The ELT (Extract, Load, Transform) script is a Python script that moves data from the source to the destination database. The script uses the subprocess module to run shell commands and the time module to add delays. The script first defines a function to wait for the Postgres databases to be ready. It then defines the source and destination database configurations, including the database name, user, password, and host. The script uses the pg_dump command to create a dump file of the source database and the psql command to load the data into the destination database.
Implementing DBT [1:31:31]
DBT (Data Build Tool) is an open-source tool used to write custom transformations or models on top of the data in the destination database. To use DBT, first install DBT core and the Postgres adapter locally using pip. Then, initialize a DBT project using the dbt init command. Configure the DBT profile in the profiles.yml file, specifying the host, port, user, password, database name, and schema for the destination database. In the dbt_project.yml file, change the materialization setting to table.
Writing DBT Models [1:38:23]
Create DBT models as SQL files in the models folder. Source the data by referencing tables in the destination database. Define the schema for the models in the schema.yml file. Create custom models to transform and modify the data. For example, create a film_ratings.sql model to append the films and actors tables and create a new table focused on film ratings. Use CTEs (Common Table Expressions) to create reusable queries within the model.
Advanced DBT Features: Macros and Gingas [1:56:18]
Macros are reusable components in DBT that allow you to avoid rewriting the same queries. Create a macro in the macros folder and then call it from your models. Gingas allow you to add control structures to your SQL queries. For example, use a Ginga to select a specific title based on a variable. This allows you to create dynamic queries that can be easily modified.
Adding a Cron Job [2:04:37]
To automate the data pipeline, add a Cron job to the Docker file. Install cron and create a start.sh file to start the cron daemon. Add a cron schedule to the Docker file to run the ELT script at a specific time. This will automate the data pipeline, running it on a schedule without manual intervention.
Implementing Airflow [2:08:21]
Airflow is an orchestration tool that allows you to have a top-down view of every task and schedule them properly. To add Airflow to the project, create an airflow folder with a dags folder and an airflow.cfg file. Add the Airflow service to the Docker Compose file, along with a Postgres service for Airflow's metadata. Create an init Airflow service to initialize the Airflow database and user. Create a Docker file for the Airflow web server, installing the necessary providers.
Writing an Airflow DAG [2:20:07]
Write an Airflow DAG (Directed Acyclic Graph) to orchestrate the tasks. Import the necessary modules, including datetime, timedelta, DAG, Mount, PythonOperator, and DockerOperator. Define default arguments for the DAG, such as the owner, start date, and catchup setting. Create a function to run the ELT script using the subprocess module. Define the DAG, specifying the name, default arguments, description, and start date. Create tasks using the PythonOperator and DockerOperator, specifying the task ID, Python callable, image, command, mounts, and dependencies. Define the order of operations using the >> operator.
Integrating Airbyte [2:41:39]
Airbyte is an open-source data integration tool that simplifies the process of syncing data from various sources to destinations. To integrate Airbyte, clone the Airbyte repository into the project. Modify the Docker file to install the Airflow providers for Docker, HTTP, and Airbyte. Create start.sh and stop.sh files to manage the Docker containers. In the Airflow DAG, replace the Python operator with the Airbyte trigger sync operator.
Configure the Airbyte connection in the Airflow UI, specifying the connection type, host, login, and password. Trigger the DAG to start the Airbyte sync. This will orchestrate the data pipeline, using Airbyte to move data from the source to the destination and DBT to transform the data.