The UDP's batch-ingest is an Apache Airflow application that runs in a Kubernetes cluster and orchestrates the extract, transform, and load (ETL) process for all context data pushed into a Unizin Data Platform instance.
Within a single UDP instance, the batch-ingest application will:
- Wake up at regular time intervals
- Fetch and verify context data for import
- Stage data for import
- Normalize and relate all context data in the Unizin Common Data Model
- Update the UDP keymap
- Update the UDP Context store
- Create a custom backup of the keymap and context store
The batch-ingest application runs every night on a schedule. Ideally, all new context data is pushed into the UDP instance prior to the scheduled import.
This section assumes that you are familiar with the basic concepts of Apache Airflow, an open-source solution that will programmatically author, schedule, and monitor workflows. It is a popular solution in ETL pipeline orchestration, which is how it is used in the Unizin Data Platform.
The ingest DAG
The UDP’s batch-ingest process is organized by a single Directed Acyclic Graph (DAG) named “ingest.”
The “ingest” DAG is composed of hundreds of individual, interdependent sub-DAGs and tasks that collectively execute the overall ETL process. In any given import cycle, the sub-DAGs and tasks operate in parallel and independently of each other. This enables the import process to efficiently use common computing resources and, also, to fail gracefully independently of each other. If any part of the ETL process fails, only it and its downstream dependencies are affected. All other, parallel processing in the ingest DAG continues and completes unaffected.
The code executed in the ingest DAG, sub-DAGs, and tasks are automatically generated and configured by the UDP. Included in this code is, for example, the automated verification of manifest files, fetching of LMS data (via available APIs), keymap maintenance, and other tasks to ensure consistency and quality in the context data.
A screenshot of the Apache Airflow UI showing the "ingest" DAG during a single run.
Parallelism and interdependence
As noted above, the ingest DAG is composed of hundreds of interdependent sub-DAGs and tasks that create a highly-parallelized, entity-based import process. The structure and order of these tasks are preserved in Apache Airflow's various DAG views, such as this graph view below.
Because the ETL process is organized on an entity-by-entity basis, it is possible for the data of any given entity to complete through to the “publish” step before other entities have been completed. Consequently, one might think of the over "ingest" DAG as the orchestrator for independent, entity-based import processes with loose couplings and interdependence.
The Apache Airflow graph view of the UDP ingest DAG.
Frequently, it can be useful to look at a history of DAG runs (i.e., daily imports of the context data) for a particular batch-ingest application and UDP instance. In the example below, we see a DAG run currently in progress alongside 4 weeks of DAG runs. The imports succeeded all days but one and, on the day when the import failed, only a small subset of the overall import process failed.
Phases of the ETL
As noted above, the context data ETL stages unfold on an entity-by-entity basis. The import process for all entities follows the same phases. These phases are reflected in the prefixes of the names for sub-DAGs and tasks in the batch-ingest DAG.
The following phases are common to the ingestion of all entities:
- Populate. Entity data from validated context datasets are imported into Postgres, cleaned, and prepared for further transformation.
- Keymap. When the context data for a particular entity from all systems are populated, then its UDP keymap is updated.
- Entity. After the keymap phase is complete, a unified presentation of an entity's data and relationships is generated.
- Backup. An entity's keymap and descriptive data are backed-up to a per-entity CSV file in Cloud Storage.
- Publish. The keymap and entity database schemas are replicated in the "context_store" database.
Every time the “ingest” DAG is run, its first sub-DAGs are designed to set environment variables that are used throughout the rest of the ETL process. In particular, dates are set for the day on which the “ingest” DAG is executed. These dates are then used to identify new UDP loading schema datasets (since each is located in a date-based folder in a Cloud Storage bucket).
You can examine which dates were set by the ingest DAG in the “Variables” section of the batch-ingest application. Under the “Admin” menu, click the “Variables” option.
You’ll be presented with the environment variables for the batch-ingest app.
Notice that the variables are key-value pairs. In the example above, a UDP instance is expecting UDP loading schema data from the SIS and the LMS. Accordingly, it sets two dates for each DAG run – one for the SIS dataset and another for the LMS dataset. Notice the pattern of the key values:
In this pattern, the
dag-run-date refers to the date of the DAG’s run or execution. By contrast, the
<sis/lms>_date variable refers to the date that was determined to be the latest available data of SIS and LMS data.
Note that the values for these “date” keys may not exactly be a date. In the case of the LMS data (where Canvas Data is ingested), for example, the value of the date key is a date plus a unique identifier for the Canvas Data dump.
The values of the date keys are used to create a path in the relevant Cloud Storage bucket where a UDP loading schema dataset is located.