Batch-ingest db server

Beyond a Kubernetes cluster, the batch-ingest application depends on a Google CloudSQL instance to perform a variety of data transformations.

The Google CloudSQL instance, also called "batch-ingest," is created during the UDP installation process.

The "batch-ingest" database server contains two databases relevant to the batch-ingest process: (1) the "ingest" database where data transformations are conducted by batch-ingest; (2) the "context_store" database that serves as the UDP Context store and where imported data is published for use.

The "batch-ingest" database server

At the heart of the Context data pipeline is a Google CloudSQL instance called "batch-ingest." The "batch-ingest" database server hosts two databases:

"ingest," which is used by the batch-ingest Airflow Application to transform context data
"context_store," where the UDP Context store is presented

The "batch-ingest" CloudSQL instance runs PostgreSQL, a relational database server. A relational database is a good choice for the UDP Context store given that the Unizin Common Data Model is a relational model and, also, given that the referential integrity of context data is enforced and preserved during the ETL process.

The "batch-ingest" database server is automatically created during the UDP installation process. So too are the "ingest" and "context_store" databases created during installation.

User management

Academic institutions must create users and passwords for the individuals and systems that it wishes to grant access to the databases on the "batch-ingest" database server. We strongly recommend that these users are read-only users with limited privileges.

Ingest database

The “ingest” database is used throughout the ETL process to perform the majority of data transformations. It serves as the workhorse that data importing, keymapping, and other important functions to ensure that data is successfully normalized and consolidated.

The ingest database’s design reflects the overall design of the context data ETL process. In most cases, each particular stage of the ETL process uses one or more database schemas (namespace) whose table definitions enable the function of the ETL stage. The three major phases of the ETL process are:

Populate, during which UDP Loading schema is imported and normalized to the UCDM
Keymap, during which the UDP ID relationships to native IDs are updated
Publish, during which the UDP Context store is produced and a backup created

Each UDP loading schema supported by a UDP instance will be represented by a distinct schema in the ingest database.

For example, in the Populate phase the SIS loading schema will be imported into the "sis" schema in the ingest database. So too, for example, will the "lms," "tii," and "tophat" database schemas be used to import and normalize context data from the LMS, TurnItIn, and TopHat applications. This logic extends to all applications that support a context data integration.

During the Keymap phase of the ETL, two schemas are used to perform the work of associating records from various loading schemas about the same entity together through new, surrogate identifiers. The "keymap" schema is used to maintain the actual UDP Keymap (extra material) for the UDP instance. In this schema, each table represents a single entity's UDP keymap. The "ingest_system" schema is used by the UDP’s batch-ingest application to orchestrate some of the keymapping logistics of the overall ETL process.

In the Publish ETL phase, consolidated data is presented in the UDP Context store. The "entity" schema's purpose is to fully represent context data that is coalesced from multiple loading schemas around a single, surrogate identifier (called the UDP identifier).

Context store database

The “context_store” database is used to present data after the data transformation phases of the context ingestion pipeline are complete. In the “context_store” database, users can connect and query data in the UCDM’s relational schema. Users may also interact with their context store’s keymap schema to understand the relationships between the surrogate keys that the UDP generates and native system keys from the source data.

For more information on the "context_store" database, please see our overview of the UDP Context store.

PreviousBatch-ingest application NextContext store

Last updated 1 year ago