Batch-ingest application

The UDP's batch-ingest is an Apache Airflow application that runs in a Kubernetes cluster and orchestrates the extract, transform, and load (ETL) process for all context data pushed into a Unizin Data Platform instance.

Within a single UDP instance, the batch-ingest application will:

Wake up at regular time intervals
Fetch and verify context data for import
Stage data for import
Normalize and relate all context data in the Unizin Common Data Model
Update the UDP keymap
Update the UDP Context store
Create a custom backup of the keymap and context store

The batch-ingest application runs every night on a schedule. Ideally, all new context data is pushed into the UDP instance prior to the scheduled import.

Apache Airflow

This section assumes that you are familiar with the basic concepts of Apache Airflow, an open-source solution that will programmatically author, schedule, and monitor workflows. It is a popular solution in ETL pipeline orchestration, which is how it is used in the Unizin Data Platform.

The ingest DAG

The UDP’s batch-ingest process is organized by a single Directed Acyclic Graph (DAG) named “ingest.”

The “ingest” DAG is composed of hundreds of individual, interdependent sub-DAGs and tasks that collectively execute the overall ETL process. In any given import cycle, the sub-DAGs and tasks operate in parallel and independently of each other. This enables the import process to efficiently use common computing resources and, also, to fail gracefully independently of each other. If any part of the ETL process fails, only it and its downstream dependencies are affected. All other, parallel processing in the ingest DAG continues and completes unaffected.

The code executed in the ingest DAG, sub-DAGs, and tasks are automatically generated and configured by the UDP. Included in this code is, for example, the automated verification of manifest files, fetching of LMS data (via available APIs), keymap maintenance, and other tasks to ensure consistency and quality in the context data.

Parallelism and interdependence

As noted above, the ingest DAG is composed of hundreds of interdependent sub-DAGs and tasks that create a highly-parallelized, entity-based import process. The structure and order of these tasks are preserved in Apache Airflow's various DAG views, such as this graph view below.

Because the ETL process is organized on an entity-by-entity basis, it is possible for the data of any given entity to complete through to the “publish” step before other entities have been completed. Consequently, one might think of the over "ingest" DAG as the orchestrator for independent, entity-based import processes with loose couplings and interdependence.

Frequently, it can be useful to look at a history of DAG runs (i.e., daily imports of the context data) for a particular batch-ingest application and UDP instance. In the example below, we see a DAG run currently in progress alongside 4 weeks of DAG runs. The imports succeeded all days but one and, on the day when the import failed, only a small subset of the overall import process failed.

Phases of the ETL

As noted above, the context data ETL stages unfold on an entity-by-entity basis. The import process for all entities follows the same phases. These phases are reflected in the prefixes of the names for sub-DAGs and tasks in the batch-ingest DAG.

The following phases are common to the ingestion of all entities:

Populate. Entity data from validated context datasets are imported into Postgres, cleaned, and prepared for further transformation.
Keymap. When the context data for a particular entity from all systems are populated, then its UDP keymap is updated.
Entity. After the keymap phase is complete, a unified presentation of an entity's data and relationships is generated.
Backup. An entity's keymap and descriptive data are backed-up to a per-entity CSV file in Cloud Storage.
Publish. The keymap and entity database schemas are replicated in the "context_store" database.

DAG configuration

Every time the “ingest” DAG is run, its first sub-DAGs are designed to set environment variables that are used throughout the rest of the ETL process. In particular, dates are set for the day on which the “ingest” DAG is executed. These dates are then used to identify new UDP loading schema datasets (since each is located in a date-based folder in a Cloud Storage bucket).

You can examine which dates were set by the ingest DAG in the “Variables” section of the batch-ingest application. Under the “Admin” menu, click the “Variables” option.

You’ll be presented with the environment variables for the batch-ingest app.

Notice that the variables are key-value pairs. In the example above, a UDP instance is expecting UDP loading schema data from the SIS and the LMS. Accordingly, it sets two dates for each DAG run – one for the SIS dataset and another for the LMS dataset. Notice the pattern of the key values:

ingest.<dag-run-date>.<sis/lms>_date

In this pattern, the dag-run-date refers to the date of the DAG’s run or execution. By contrast, the <sis/lms>_date variable refers to the date that was determined to be the latest available data of SIS and LMS data.

Note that the values for these “date” keys may not exactly be a date. In the case of the LMS data (where Canvas Data is ingested), for example, the value of the date key is a date plus a unique identifier for the Canvas Data dump.

The values of the date keys are used to create a path in the relevant Cloud Storage bucket where a UDP loading schema dataset is located.

PreviousContext data ingress NextBatch-ingest db server

Last updated 1 year ago