Synthetic Data [beta]

In addition to production UDP data, Unizin has created a synthetic dataset modeled identically to UDP data. This dataset contains fake students and courses, and can be safely used to complement production and real data usage.

Please note that the synthetic dataset is a beta version which is still undergoing development and testing before its official release. The dataset is provided on an “as is” and “as available” basis. The primary purpose of this beta testing is to obtain feedback on the contents and the identification of defects. Should you encounter any bugs, glitches, lack of functionality or other problems with this dataset, please let us know immediately so we can rectify these accordingly. Your help in this regard is greatly appreciated.

Synthetic Data Creation Process

At the highest level, the creation of synthetic data follows this procedure:

Define the size of a synthetic institution, based on student FTE. Currently, our dataset has a student population of 20,000 FTE.
Calculate distributions of possible values for data rows based on a consortium view of real UDP instances. No PII data for students is collected or stored at any part of this process, and this is the only part of the process that uses real data. See the UDP Distributions section for more details.
Iteratively generate rows for all UDP context store entities, considering dependencies. For example, we must generate the learner_activity_result entity before we generate the course_grade entity. The learner_activity_result entity contains assignment scores that will inform believable letter grades students can receive in the course_grade entity.
Iteratively generate Caliper events to mimic believable clicks and student activity in the synthetic courses. We have a slightly different distribution calculation process for the Caliper events.
Generate all Unizin-defined data marts on top of the synthetic UDP data. The goal is to mimic production UDP environments with fake, believable data. Since we deliver the data marts to every tenant, we choose to build the data marts on top of the synthetic data as well.
Deliver the final datasets and tables in Google BigQuery.

UDP Distributions

To produce believable datasets, synthetic and production UDP data are correlated via the distribution of values in a given dataset. The same way a sample may be taken with an assumed distribution of values for a given parameter, synthetic data generated by Unizin is built on calculated frequencies of given values found in member's UDP tables. The source of most of the context store distributions is the UDP Distributions mart, which defines the distributions of categorical and continuous fields in the UDP for each tenant.

We build these distributions for both the context data and the event data found in a UDP instance.

Context Data Distributions - Categorical

Context data often includes categorical fields. For example, the role field in course_section_enrollment takes a defined, predictable set of values. For categorical fields, we build discrete distributions based on the proportion of each value we see in real UDP data sources. Using the role field again, the distribution we build looks like the following:

When we generate rows in the synthetic course_section_enrollment entity, there is an 89.3% chance the enrollment will be a Student, a 7.8% chance it will be a Teacher, 1.5% chance it will be Faculty, etc.

Some distributions depend on the intersection of two fields. For instance, the role_status field in course_section_enrollment depends on the role field. It's more common for Student enrollments to withdraw from courses than other roles, like TeachingAssistant or Observer. The role_status distribution for Student enrollments is:

Whereas, the role_status distribution for Teacher enrollments is:

The likelihood for role_status values look very different for students and teachers, so our synthetic categorical distributions must account for this.

Context Data Distributions - Continuous

Other fields in the UDP context store are continuous fields, which are fields that can take any numeric value. Quiz and assignment scores are common examples of continuous fields. The number of files in a course, number of discussions, etc. also fall in the continuous category of fields. To define distributions for continuous fields, we calculate the average, standard deviation, minimum, and maximum values.

For example, we calculate these metrics for the number of learner activities per group in a course. This helps us know how assignments should be organized together in our synthetic courses. This distribution looks like:

Metric

Distribution Value

Average

6.477

Standard Deviation

7.517

Min

Max

On average, there are a little over 6 assignments per assignment group in a course. However, we can see as few as 1 and as many as 50 assignments in groups. We take these metrics into account when synthetically generating values for continuous fields.

Event Data Distributions

The event data in the UDP is unique due to the time component throughout academic terms. We need to mimic the ebb and flow of students interacting with tools throughout semesters. We define the distribution of event data along four primary intersections:

Course CIP Category: Math, Engineering, English, etc.
Course Level: freshman, sophomore, junior, senior
Day in term: day 1 through the last day of the term
Tool: Assignments, Modules, Zoom, Kaltura, Turnitin, etc.

The course CIP category and level intersections characterize how different types of courses at different levels of instruction also differ in terms of behavior patterns. The tool intersection accounts for courses leveraging different types of tools for instruction. The day in term intersection accounts for the highs and lows of activity throughout a term. During Spring break, activity dips. Before major mid-term exams, activity usually spikes.

The result of this is a per-course category, per-level, per-day, per-tool view of the number of events that could exist for people in our synthetic courses. This gives us a believable timeline view of tool usage and behaviors throughout academic terms.

Accessing and Using Synthetic Data

The synthetic data lives in a GCP project called unizin-shared. This project is a Unizin-managed space where consortium-shared datasets will live. Currently, synthetic data is the only resource in this project, but more resources can live there in the future!

Approval Access Process

The approval process is similar to production UDP access requests. However, we need separate approval, even for previously approved users, because we have custom Google Cloud roles that allow users to query from unizin-shared. The process is as follows:

Submit a service ticket request via email to [email protected]
In the service ticket email, explicitly request access to the synthetic data in unizin-shared. This lets us know we are not provisioning access to production data.
Attach written institution Data Steward approval.
Once we receive items 1-3, Unizin will provision the correct role access in Google Cloud.
Unizin Services team will reply to the service ticket confirming access has been granted.

Query Examples

Even though the data live in unizin-shared, users will still use their production BigQuery environments to run queries. This helps us with usage tracking of the synthetic data across the consortium. In queries to synthetic data, users will specify the unizin-shared project.

For example, let's assume a user at University of Nebraska has access both to production UDP data and synthetic UDP data. A production query to list all academic terms would look like the following:

The user is logged into the udp-unl-prod GCP project, and their query scans the production academic_term entity.

Now, this user wants to query the synthetic academic_term entity in the synthetic data. Their query changes to the following:

The user still remains in udp-unl-prod to run the query! They do not change GCP projects to run this query. Instead, their query explictly calls unizin-shared instead of udp-unl-prod in the query editor. This will pull the data from unizin-shared instead of the real, production data.

In order to see all of the available datasets and tables in the synthetic data living in unizin-shared, this query can be run from the appropriate BigQuery environment:

SELECT
  table_catalog as PROJECT_ID,
  table_schema as DATASET_ID,
  table_name as TABLE_ID
FROM
  `unizin-shared.region-us`.INFORMATION_SCHEMA.TABLES

To get more granular column and schema information about a particular table, run the following query:

-- replace {DATASET_ID} and {TABLE_ID} with proper values
SELECT 
  *
FROM
  `unizin-shared.{DATASET_ID}`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
  table_name = "{TABLE_ID}"

Please note: If you are attempting to query synthetic data using a service account (.json file) method, you will need to update your application's settings to reference_udp-<tenant>-prod_ to return results.

PreviousExpanded table: Canvas edApp mapping NextViewing Synthetic Data datasets within the BigQuery UI

Last updated 4 months ago