Unizin Product Documentation
ProductsSupport and TrainingPolicies
  • Unizin Product Documentation
  • Products
    • Content
      • Unizin Engage
        • eReader User Guide
          • Notes, Highlights, and Citations
          • Appearance Settings
          • Download for Offline
          • eReader Layout
          • Keyboard Shortcuts
          • Navigating Your eBook
          • Print
          • Text to Speech
          • Copy and Paste
          • Creating Flashcards
          • Collaboration and Note Sharing
          • Pearson Titles
        • Institution Support
          • Disabled Student Services / Alt-Format
            • Best Practices for Republishing Course Content
            • Disabled Student Services
            • Requesting eTextbook Files for Accessibility Purposes
            • WCAG 2.0 AA evaluation for Engage
            • WCAG 2.0 AA evaluation for EPUB for Engage
          • Institution's Support Responsibilities
        • Caliper 1.1 sensor
        • Release Notes
          • 2.28.22
          • 2019-09-17
          • 2019-05-29
          • 2.26.8
          • 2.26.0
          • 2.25.0
          • 2.22.0
          • 2.21.6
          • 2.21.5
          • 2.20.8
          • 2.20.5
          • 2.20.3
          • 2.19.1
          • 2.18.0
          • 2.17.0
          • 2.14.0
          • 2.12.0
          • 2.11.0
          • 2.9.0
          • 2.8.3
          • 2016-03-17
          • 2016-02-11
          • 2016-01-28
        • Using Analytics (New)
      • Unizin Order Tool
        • Overview of the User Interface
        • Key Concepts
          • Profiles
          • Ordering periods
          • Coordinator permissions
          • Program administrator permissions
        • Courses & Ordering
          • Course filtering
          • Place an order
          • Add sections to a placed order
          • Edit a placed order
          • Cancel an order
          • Reordering
        • Order History
          • Instructor Order History
          • Coordinator and Program Administrator Order History
          • Order Activity
        • Student Choice
          • Student Choice (Program Administrators)
          • Student Choice (Students)
        • Entitlements
          • Entitlements (Program Administrators)
          • Entitlements (Students)
        • Catalog Tool
        • Schedule of Classes
        • Content Request Tool
        • Order Tool Dashboard
        • Vendor Sandbox Tenant
        • Institution Support
        • Implementation
          • SIS Data Integration
            • 1.0 - SIS Integration
            • 2.0 - SIS Integration
          • SSO integration
          • UI customizations
          • Order Feed
            • 1.0 - Order Feed
            • 2.0 - Order Feed
            • 3.0 - Order Feed
          • Publisher report
          • Final declined offers feed
          • Institutional (SIS) Catalog Import
          • Student Price
          • Historical Entitlements Import
        • Release Notes
          • Order Tool Bug Fixes and Enhancements
          • Order Tool Bug Fixes
          • Order Tool Accessibility Improvements
          • Order Feed Improvements
          • Content Request Form Update and Minor Bug Fix
          • Flat Markup Fee Update
          • Ordering Email Receipt Update & Minor Bug Fix
          • Bug Fix for Public Catalog Feature
          • Catalog Search Enhancements
          • Reordering Reminder Email Notifications
          • UX Improvements & Minor Bug Fixes
          • Historical Entitlements Import
          • Student Prices
          • Reordering Feature
          • Email Enhancements
          • Ordering Enhancements
          • Bug Fix for the Institutional Catalog Import
          • Bug Fix for the Final Declined Offers Feed (FDOF)
          • Order Activity Feature and Other Enhancements
          • Bug Fixes for Order History and Report an Issue Features
          • Public catalog feature
          • Minor Bug Fixes for Ordering and Student Choice
          • Entitlements Production Release, Bug fixes, and Minor updates
          • Minor Updates and Bug Fixes for Ordering Workflows
          • Catalog Search Optimization
          • Student Choice
          • Archive Terms Feature and Integration Improvements
          • Introduces the Program Administrator role, Catalog Tool, and Schedule
          • User interface updates and improvements
          • Order feed improvements
          • Order history, UI enhancements
          • Email notification upgrades, UI improvements
          • Order feed changes
          • New features for Course coordinators and upgrades to the UI
          • Changes to the Term, Course, and Section models; introduces a Session
          • Bug fixes, import improvements, and validation improvements
          • Tracking Order History
          • Publisher Reporting
          • Fixes the order feed, automates SIS data importing, and automates the generation of order feed repor
    • Data & Analytics
      • Unizin Data Platform
        • Key concepts
          • Platform overview
          • Data categories
          • Data models
          • Loading schemas
          • Keymap
        • Unizin Common Data Model
          • Academic structures (ERD)
          • Learners (ERD)
          • Course structures (ERD)
          • Course resources (ERD)
          • Learner activities (ERD)
          • Quizzes (ERD)
          • Social (ERD)
          • Course outcomes (ERD)
        • System overview
          • Context data pipeline
            • Context data ingress
            • Batch-ingest application
            • Batch-ingest db server
            • Context store
          • Event data pipeline
            • UDP Caliper endpoint
            • Approval process for implementing Caliper compliant tools
            • UDP Event enricher
            • Event store
        • Data stores
          • Data lake
            • UDP Context store
            • UDP Event store
              • Accessing the Event store
              • Expanded table
                • Expanded table: Canvas edApp mapping
            • Synthetic Data [beta]
              • Viewing Synthetic Data datasets within the BigQuery UI
              • Query Synthetic Data via client libraries
          • Data marts
            • UDP Distributions
            • Interaction sessions
            • Learning Environment Organization
            • File Interaction
            • Last Activity
            • Long Inactivity
            • Course Status
            • Daily Course Grade Record
            • LTI Tool Use
            • LMS Tool Use
            • Tool Usage Metrics
            • Links
            • Taskforce
              • Level 1 Aggregated
              • Level 2 Aggregated
              • Level 2 Course Weekly Distribution Summary
              • Student Term Profile
              • Course Profile
            • Student Activity Score
              • Student Course Metrics
              • Student Course Section Metrics
              • Final
              • Course Final
              • Course Section Final
        • Data integrations
          • Context data integration
            • Loading schema
            • Keymap support
            • Manifest file
            • File requirements
            • Integration mechanics
          • Event data integration
          • SIS data integration
          • LMS data integration
            • Instructure Canvas
        • Release Notes
          • UDP Marts Release Notes
            • 1.0.83
            • 1.0.80
            • 1.0.79
            • 1.0.78
            • 1.0.77
            • 1.0.72
            • 1.0.67
            • 1.0.58
            • 1.0.51
            • 1.0.44
            • 1.0.42
            • 1.0.32
            • 1.0.31
            • 1.0.0
            • Level 2 Taskforce data marts now available
          • 2.0.167
          • 2.0.152
          • 2.0.138
          • 2.0.137
          • 2.0.113
          • 2.0.112
          • 2.0.111
          • 2.0.110
          • 2.0.99
          • 2.0.98
          • 2.0.83
          • 2.0.80
          • 2.0.71
          • 2.0.66
          • 2.0.59
          • 2.0.58
          • 2.0.53
          • 2.0.47
          • 2.0.25
        • Miscellaneous
          • Canvas Data additions, ~Fall 2021
          • Canvas Live Events: from SQS to HTTPS
          • Canvas New Analytics vs. UDP
          • Course Section Enrollment Role Status Mappings
          • Migrating from UDW to UDP
      • Unizin Data Warehouse
        • Implementation Guide
        • Scope of Services
        • Access Provisioning
        • Access Revocation
        • Connecting to the UDW
      • Raw Canvas Data 2
        • Flat Files
        • BigQuery Datasets
    • Hosted Services
      • My Learning Analytics
        • Install MyLA via LTI 1.3
        • Custom configure MyLA
  • Support and Training
    • Professional Development
      • Stepping Stones: A Faculty Development Curriculum for Learning Analytics Use
      • Structured Conversations initiative
    • UDP Self-paced Training
    • Resources Site Broken Links
    • Status Pages
  • Policies
    • General policies
      • Sponsor Teams
      • Browser Support Policy
      • Opt-Out & Invoicing Policy (Order Tool)
    • Support Policy
      • Unizin Engage - SP
      • Unizin Order Tool - SP
      • Unizin Data Platform - SP
      • Unizin Data Warehouse - SP
      • Unizin Data Analysis - SP
      • Pressbooks Hosting - SP
    • Privacy Policy
      • Unizin Engage - PP
      • Unizin Order Tool - PP
      • Unizin Data Platform - PP
      • RStudio service - PP
    • End User License Agreements
      • Unizin Engage - EULA
      • Unizin Order Tool - EULA
    • Terms of Use
      • Unizin Data Platform - ToU
    • Incident Reports
Powered by GitBook
LogoLogo

Unizin Homepage

  • unizin.org

Data & Analytics

  • Unizin Data Platform
  • Unizin Data Warehouse

Content

  • Unizin Engage
  • Unizin Order Tool

Hosted Services

  • My Learning Analytics

Copyright © 2023, Unizin, Ltd.

On this page
  • The ingest DAG
  • Parallelism and interdependence
  • Phases of the ETL
  • DAG configuration
  1. Products
  2. Data & Analytics
  3. Unizin Data Platform
  4. System overview
  5. Context data pipeline

Batch-ingest application

PreviousContext data ingressNextBatch-ingest db server

Last updated 1 year ago

The UDP's batch-ingest is an that runs in a Kubernetes cluster and orchestrates the extract, transform, and load (ETL) process for all context data pushed into a Unizin Data Platform instance.

Within a single UDP instance, the batch-ingest application will:

  • Wake up at regular time intervals

  • Fetch and verify context data for import

  • Stage data for import

  • Normalize and relate all context data in the Unizin Common Data Model

  • Update the UDP

  • Update the UDP Context store

  • Create a custom backup of the keymap and context store

The batch-ingest application runs every night on a schedule. Ideally, all new context data is pushed into the UDP instance prior to the scheduled import.

Apache Airflow

This section assumes that you are familiar with the basic concepts of , an open-source solution that will programmatically author, schedule, and monitor workflows. It is a popular solution in ETL pipeline orchestration, which is how it is used in the Unizin Data Platform.

The ingest DAG

The UDP’s batch-ingest process is organized by a single Directed Acyclic Graph (DAG) named “ingest.”

The “ingest” DAG is composed of hundreds of individual, interdependent sub-DAGs and tasks that collectively execute the overall ETL process. In any given import cycle, the sub-DAGs and tasks operate in parallel and independently of each other. This enables the import process to efficiently use common computing resources and, also, to fail gracefully independently of each other. If any part of the ETL process fails, only it and its downstream dependencies are affected. All other, parallel processing in the ingest DAG continues and completes unaffected.

The code executed in the ingest DAG, sub-DAGs, and tasks are automatically generated and configured by the UDP. Included in this code is, for example, the automated verification of manifest files, fetching of LMS data (via available APIs), keymap maintenance, and other tasks to ensure consistency and quality in the context data.

Parallelism and interdependence

As noted above, the ingest DAG is composed of hundreds of interdependent sub-DAGs and tasks that create a highly-parallelized, entity-based import process. The structure and order of these tasks are preserved in Apache Airflow's various DAG views, such as this graph view below.

Because the ETL process is organized on an entity-by-entity basis, it is possible for the data of any given entity to complete through to the “publish” step before other entities have been completed. Consequently, one might think of the over "ingest" DAG as the orchestrator for independent, entity-based import processes with loose couplings and interdependence.

Frequently, it can be useful to look at a history of DAG runs (i.e., daily imports of the context data) for a particular batch-ingest application and UDP instance. In the example below, we see a DAG run currently in progress alongside 4 weeks of DAG runs. The imports succeeded all days but one and, on the day when the import failed, only a small subset of the overall import process failed.

Phases of the ETL

As noted above, the context data ETL stages unfold on an entity-by-entity basis. The import process for all entities follows the same phases. These phases are reflected in the prefixes of the names for sub-DAGs and tasks in the batch-ingest DAG.

The following phases are common to the ingestion of all entities:

  1. Populate. Entity data from validated context datasets are imported into Postgres, cleaned, and prepared for further transformation.

  2. Keymap. When the context data for a particular entity from all systems are populated, then its UDP keymap is updated.

  3. Entity. After the keymap phase is complete, a unified presentation of an entity's data and relationships is generated.

  4. Backup. An entity's keymap and descriptive data are backed-up to a per-entity CSV file in Cloud Storage.

  5. Publish. The keymap and entity database schemas are replicated in the "context_store" database.

DAG configuration

Every time the “ingest” DAG is run, its first sub-DAGs are designed to set environment variables that are used throughout the rest of the ETL process. In particular, dates are set for the day on which the “ingest” DAG is executed. These dates are then used to identify new UDP loading schema datasets (since each is located in a date-based folder in a Cloud Storage bucket).

You can examine which dates were set by the ingest DAG in the “Variables” section of the batch-ingest application. Under the “Admin” menu, click the “Variables” option.

You’ll be presented with the environment variables for the batch-ingest app.

Notice that the variables are key-value pairs. In the example above, a UDP instance is expecting UDP loading schema data from the SIS and the LMS. Accordingly, it sets two dates for each DAG run – one for the SIS dataset and another for the LMS dataset. Notice the pattern of the key values:

ingest.<dag-run-date>.<sis/lms>_date

In this pattern, the dag-run-date refers to the date of the DAG’s run or execution. By contrast, the <sis/lms>_date variable refers to the date that was determined to be the latest available data of SIS and LMS data.

Note that the values for these “date” keys may not exactly be a date. In the case of the LMS data (where Canvas Data is ingested), for example, the value of the date key is a date plus a unique identifier for the Canvas Data dump.

The values of the date keys are used to create a path in the relevant Cloud Storage bucket where a UDP loading schema dataset is located.

Apache Airflow application
keymap
Apache Airflow
A screenshot of the Apache Airflow UI showing the "ingest" DAG during a single run.
The Apache Airflow graph view of the UDP ingest DAG.