# Data marts

## Overview <a href="#datamarts-overview" id="datamarts-overview"></a>

Unizin has developed new data mart construction and delivery infrastructure that aligns more closely with the rest of the batch ingest process. Previously, we built the marts in a Unizin internal GCP project based on SQL queries scheduled directly in BigQuery for refresh. The marts were then “passed through” into each tenant’s prod GCP project in *mart-prefixed* datasets containing materialized views. Image 1 below shows the previous delivery of the marts:

<div align="left"><figure><img src="https://3709019308-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKYwtHNGgdPXS3PWAlZUr%2Fuploads%2Fgit-blob-05d110dd15edff45c4fff3f33fae257d4bbc10b2%2F85459108.png?alt=media&#x26;token=6ddcd6c5-c630-478c-a476-b542021b5c25" alt=""><figcaption><p><em>Image 1: Previous Mart Presentation in BigQuery</em></p></figcaption></figure></div>

### Motivation for Change <a href="#datamarts-motivationforchange" id="datamarts-motivationforchange"></a>

Unizin’s data mart development began in 2021 as “proofs of concept” assets aiding use-case-driven analytics essential and shared across the consortium. Organically, the data marts have elevated into a first-party Unizin service core to the UDP and membership value.

The data marts have reached a point where convergence with the rest of Unizin’s cloud-based infrastructure and processes will enable more performant, scalable, and reliable data mart development moving forward.

Building the marts requires a series of dependencies and intermediary queries that we previously kept “behind the scenes” in favor of cleanliness in the prod GCP projects. However, Unizin has received feedback that seeing the “helper tables” has value on their own as well! The new infrastructure for building the marts will expose all dependencies and intermediary queries alongside the final mart presentations.

## Refresh Frequency <a href="#datamarts-refreshfrequency" id="datamarts-refreshfrequency"></a>

The refresh frequency of the marts and helper tables depends mainly on the source UDP data that fuels the tables in scope.

* **Dependency on context\_store data** - refresh frequency is daily; the current context\_store data update daily, so there is no need to refresh these tables more often than daily.
* **Dependency on event\_store data** - refresh frequency is hourly (usually); the event\_store data refresh almost realtime. We approximate “realtime” with hourly refreshes to strike a balance with refresh scanning costs in BigQuery. Nearing realtime requires more frequent scans of data which becomes more costly. Hourly refreshes is our current optimization of data freshness against refresh costs.
* **Mixed dependencies** - these tables often refresh on the most restrictive cadence. If a dependency refreshes hourly, these tables will refresh hourly too.
* **UDP Distributions** - these tables refresh on a monthly cadence.&#x20;

Each table in BigQuery has a *Details* tab that can be consulted for questions about refresh frequency; the field in scope is *Last Modified*:

<div align="left"><figure><img src="https://3709019308-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKYwtHNGgdPXS3PWAlZUr%2Fuploads%2Fgit-blob-49716d9f840f6d4d0c40b1bd0666d14088586654%2F85459109.png?alt=media&#x26;token=d620f808-6bd8-4aed-8712-478e1f7d82ca" alt=""><figcaption><p><em>Image 2: Helpful Table Timestamp Information in BigQuery</em></p></figcaption></figure></div>

## BigQuery Datasets <a href="#datamarts-bigquerydatasets" id="datamarts-bigquerydatasets"></a>

The delivery of marts for user consumption live in datasets prefixed with *mart*. The newly surfaced dependency tables all live together in *mart\_helper*. The distribution marts live in the *udp\_distributions* dataset.

The final, use-case driven marts live in the other *mart* datasets:

* mart\_general
* mart\_course\_section
* mart\_course\_offering
* mart\_taskforce
* mart\_student\_activity\_score

These *mart\_\** datasets are unchanged from the previous mart infrastructure to the new one. The table schemas, field names, datatypes, and everything else are unchanged, so existing scripts and applications do not break. The only technical difference in these datasets is that the assets inside the datasets are now native tables instead of materialized views. The previous infrastructure favored surfacing materialized views since the computation and construction all occurred in an internal Unizin GCP project. However, this is no longer a requirement, so native BQ tables are preferred for mart presentation.

### mart\_helper Deep Dive <a href="#datamarts-mart_helperdeepdive" id="datamarts-mart_helperdeepdive"></a>

The newly visible dataset in prod BQ environments is called *mart\_helper*. The tables in here are all the accumulated dependencies that build the final tables in the other *mart\_* datasets. Since all the dependencies are compressed into one hierarchical level, a consistent naming convention is enforced on each asset in the *mart\_helper* dataset. Double underscore characters are used as delimiters.

#### First Substring: context, event, or utility <a href="#datamarts-firstsubstring-context-event-orutility" id="datamarts-firstsubstring-context-event-orutility"></a>

The first part of a *mart\_helper* table name is either “context,” “event,” or “utility.” These strings dictate the base data source on which the helper table depends:

* **Context -** the data source is the UDP context\_store, governed by the UCDM
* **Event** - the data source is the UDP event\_store, governed by the Caliper standard
* **Utility -** the data source is static mappings for tool names and such. This is the only part of the *udp\_marts* that is manually compiled and maintained. However, these data will rarely (if ever) update.

#### Second Substring: Scope <a href="#datamarts-secondsubstring-scope" id="datamarts-secondsubstring-scope"></a>

The middle substring can be vague by looking at the tables in BQ, but its original purpose is as an organizational pointer to where SQL code lives in the *udp\_marts* code repository. Values for pivot scope could include “base,” “course\_offering,” “course\_section”, “taskforce,” “general,” or "student\_activity\_score." Any table with “context\_\_base” will have its SQL definition in the *context/base/* directory in the Gitlab repository.

Analytically, there is a loose interpretation of this substring, but we don’t recommend focusing on these reasons too strictly:

* **Base -** these tables are the first layer on top of raw UCDM data
* **Course\_offering -** a useful pivot in reporting for these tables may be per-course offering.
* **Course\_section** - same interpretation as course\_offering but at the more granular section level.
* **Taskforce** - these assets have this specific string to denote work directly related to the learning analytics taskforces in 2022.
* **General -** No direct pivot is specified or favored; appropriate usage can include looking across courses, terms, cohorts, etc
* **Student\_activity\_score** - these assets have this specific string to denote work related to the student activity score metric.

#### Third Substring: Asset Name <a href="#datamarts-thirdsubstring-assetname" id="datamarts-thirdsubstring-assetname"></a>

This names the scope and purpose of the table. Some values are more straightforward (such as *person*, *learner\_activity*, *course\_grade*, etc.), which point to a direct UCDM entity in scope for aggregations or denormalization. Other values are more use-case descriptive (such as *interaction\_sessions, last\_activity, weekly\_discussions,* etc.*)*. These tables likely have multiple dependencies on UCDM tables or Caliper events and are less tied to a single entity in scope. Instead, these tables likely answer a targeted analytics question.

## Code Repository and Visualized Dependencies <a href="#datamarts-coderepositoryandvisualizeddependencies" id="datamarts-coderepositoryandvisualizeddependencies"></a>

All of the SQL code and deployment scripts used to build the marts are stored in this [Unizin Community Gitlab repository called *Marts*](https://gitlab.com/unizin-community/unizin-data-platform/marts). For now, this repository is read-only and for logic reference purposes only; however, users are encouraged to [create issues ](https://gitlab.com/unizin-community/unizin-data-platform/marts/-/issues)against the repository for additions, bug fixes, and requests.

The directories in scope for viewing SQL logic defining the marts are the following:

* **Context** - all helper tables that have UDP context\_store dependencies
* **Distribution** - tables that define categorical and continuous distributions of data in the UDP context\_store
* **Event** - all helper tables that have UDP event\_store dependencies
* **Mart** - highest level tables that combine both context and event dependencies together

The other Python scripts, directories, and configuration files are totally fine to see! However, their only purpose is internal to Unizin DevOps to maintain and deploy the marts at scale.

### Interactive Dependency Charts <a href="#datamarts-interactivedependencycharts" id="datamarts-interactivedependencycharts"></a>

Each corresponding mart documentation page nested beneath this page on the resources site has an interactive dependency visualization (built using [Mermaid](https://mermaid.js.org/)) showing the sequence of helper tables built from foundational UDP data to the use-case driven marts.

Below is an example of the *mart\_general.lms\_tool* dependency visualization:

{% hint style="info" %}
[Click here to open the interactive chart](https://assets.public.unizin.org/udp-marts/mart/mart__general__lms_tool.svg).
{% endhint %}

<div data-full-width="true"><figure><img src="https://3709019308-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKYwtHNGgdPXS3PWAlZUr%2Fuploads%2Fgit-blob-a499e16048d47500863a8754c4a4dd35d7afe256%2Flms_tool.png?alt=media" alt=""><figcaption><p>mart_general.lms_tool</p></figcaption></figure></div>

Read left-to-right, this shows the transformation steps enforced from foundational UDP data to the final mart. The following color coding applies:

* **Blue** - these correspond to SQL in the *context* directory of the repo; faded/light blue is a foundational table in the context\_store. Light blue rectangles are not clickable; these are foundational UDP entities and not defined with SQL.
* **Red** - these correspond to SQL in the *event* directory of the repo; faded/light red is the UDP’s foundational *expanded* table in BigQuery. Light red rectangles are not clickable; these are foundational UDP entities and not defined with SQL.
* **Green** - these correspond to SQL in the *utility* directory of the repo
* **Gray** - these are for all other queries. These may be referencing a BigQuery public dataset (e.g. bringing in census names instead of using real student names). Also, *starter.sql* files fall in this category. Starter files are helpful for partitioned tables, and these need to run once just to provision the table schemas in BigQuery correctly. No actual data get passed through these *starter.sql* files, however.

Clicking on a node in the dependency diagram will take users to the corresponding SQL file in the GitLab repository. Note that you may need to `ctrl+click` (for Windows) or `cmd+click` (for Mac) to open GitLab in a new tab.

<details>

<summary>A full list of charts can be found here.</summary>

* [mart\_course\_offering.file\_interaction](https://assets.public.unizin.org/udp-marts/mart/mart__course_offering__file_interaction.svg)
* [mart\_course\_offering.status](https://assets.public.unizin.org/udp-marts/mart/mart__course_offering__status.svg)
* [mart\_course\_section.file\_interaction](https://assets.public.unizin.org/udp-marts/mart/mart__course_section__file_interaction.svg)
* [mart\_course\_section.status](https://assets.public.unizin.org/udp-marts/mart/mart__course_section__status.svg)
* [mart\_general.lms\_tool](https://assets.public.unizin.org/udp-marts/mart/mart__general__lms_tool.svg)
* [mart\_general.lti\_tool](https://assets.public.unizin.org/udp-marts/mart/mart__general__lti_tool.svg)
* [mart\_taskforce.course\_profile](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__course_profile.svg)
* [mart\_taskforce.level1\_aggregated](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__level1_aggregated.svg)
* [mart\_taskforce.level2\_aggregated](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__level2_aggregated.svg)
* [mart\_taskforce.level2\_course\_week\_files](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__level2_course_week_files.svg)
* [mart\_taskforce.level2\_course\_week\_metrics](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__level2_course_week_metrics.svg)
* [mart\_taskforce.level2\_course\_week\_tools](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__level2_course_week_tools.svg)
* [mart\_taskforce.level2\_course\_weekly\_distribution\_summary](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__level2_course_weekly_distribution_summary.svg)
* [mart\_taskforce.student\_term\_profile](https://assets.public.unizin.org/udp-marts/mart/mart__taskforce__student_term_profile.svg)
* [mart\_course\_offering.interaction\_sessions](https://assets.public.unizin.org/udp-marts/mart/mart_course_offering__interaction_sessions.svg)
* [mart\_course\_offering.last\_activity](https://assets.public.unizin.org/udp-marts/mart/mart_course_offering__last_activity.svg)
* [mart\_course\_offering.daily\_course\_grade\_record](https://assets.public.unizin.org/udp-marts/mart/mart__course_offering__daily_course_grade_record.svg)
* [mart\_course\_section.interaction\_sessions](https://assets.public.unizin.org/udp-marts/mart/mart_course_section__interaction_sessions.svg)
* [mart\_course\_section.last\_activity](https://assets.public.unizin.org/udp-marts/mart/mart_course_section__last_activity.svg)
* [mart\_course\_section.daily\_course\_grade\_record](https://assets.public.unizin.org/udp-marts/mart/mart__course_section__daily_course_grade_record.svg)
* [mart\_student\_activity\_score.student\_course\_metrics](https://assets.public.unizin.org/udp-marts/mart/mart__student_activity_score__student_course_metrics.svg)
* [mart\_student\_activity\_score.final](https://assets.public.unizin.org/udp-marts/mart/mart__student_activity_score__final.svg)

</details>

## Questions and Requests For Changes/Additions <a href="#datamarts-questionsandrequestsforchanges-additions" id="datamarts-questionsandrequestsforchanges-additions"></a>

Emailing <support@unizin.org> is the best way to ask questions and request additions or changes to the marts. The Services and DSS team will follow up directly on each request.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://resources.unizin.org/products/data-and-analytics/unizin-data-platform/data-stores/data-marts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
