# Synthetic Data \[beta]

In addition to production UDP data, Unizin has created a synthetic dataset modeled identically to UDP data. This dataset contains fake students and courses, and can be safely used to complement production and real data usage.

{% hint style="info" %}
Please note that the synthetic dataset is a beta version which is still undergoing development and testing before its official release. The dataset is provided on an “as is” and “as available” basis. The primary purpose of this beta testing is to obtain feedback on the contents and the identification of defects. Should you encounter any bugs, glitches, lack of functionality or other problems with this dataset, please let us know immediately so we can rectify these accordingly. Your help in this regard is greatly appreciated.
{% endhint %}

## Synthetic Data Creation Process <a href="#udpcontextstore-accessingthecontextstore" id="udpcontextstore-accessingthecontextstore"></a>

At the highest level, the creation of synthetic data follows this procedure:

1. Define the size of a synthetic institution, based on student FTE. Currently, our dataset has a student population of 20,000 FTE.
2. Calculate distributions of possible values for data rows based on a consortium view of real UDP instances. No PII data for students is collected or stored at any part of this process, and this is the only part of the process that uses real data. See the [UDP Distributions](#udpcontextstore-unizinmemberaccess) section for more details.
3. Iteratively generate rows for all UDP context store entities, considering dependencies. For example, we must generate the *learner\_activity\_result* entity before we generate the *course\_grade* entity. The *learner\_activity\_result* entity contains assignment scores that will inform believable letter grades students can receive in the *course\_grade* entity.
4. Iteratively generate Caliper events to mimic believable clicks and student activity in the synthetic courses. We have a slightly different distribution calculation process for the Caliper events.
5. Generate all Unizin-defined data marts on top of the synthetic UDP data. The goal is to mimic production UDP environments with fake, believable data. Since we deliver the data marts to every tenant, we choose to build the data marts on top of the synthetic data as well.
6. Deliver the final datasets and tables in Google BigQuery.

## UDP Distributions <a href="#udpcontextstore-unizinmemberaccess" id="udpcontextstore-unizinmemberaccess"></a>

To produce believable datasets, synthetic and production UDP data are correlated via the distribution of values in a given dataset. The same way a sample may be taken with an assumed distribution of values for a given parameter, synthetic data generated by Unizin is built on calculated frequencies of given values found in member's UDP tables. The source of most of the context store distributions is the [UDP Distributions](/products/data-and-analytics/unizin-data-platform/data-stores/data-marts/udp-distributions.md) mart, which defines the distributions of categorical and continuous fields in the UDP for each tenant.&#x20;

We build these distributions for both the context data and the event data found in a UDP instance.

### Context Data Distributions - Categorical

Context data often includes categorical fields. For example, the *role* field in *course\_section\_enrollment* takes a defined, predictable set of values. For categorical fields, we build discrete distributions based on the proportion of each value we see in real UDP data sources. Using the *role* field again, the distribution we build looks like the following:

<figure><img src="/files/CKtETfbeQHrPHLKOjaBA" alt="" width="375"><figcaption></figcaption></figure>

When we generate rows in the synthetic *course\_section\_enrollment* entity, there is an 89.3% chance the enrollment will be a Student, a 7.8% chance it will be a Teacher, 1.5% chance it will be Faculty, etc.

Some distributions depend on the intersection of two fields. For instance, the *role\_status* field in *course\_section\_enrollment* depends on the *role* field. It's more common for *Student* enrollments to withdraw from courses than other roles, like *TeachingAssistant* or *Observer.* The *role\_status* distribution for Student enrollments is:

<figure><img src="/files/jTwXvktDo3bypVXDVsuK" alt=""><figcaption></figcaption></figure>

Whereas, the *role\_status* distribution for Teacher enrollments is:

<figure><img src="/files/LytLbrYnCIt2C3tG8XT2" alt=""><figcaption></figcaption></figure>

The likelihood for *role\_status* values look very different for students and teachers, so our synthetic categorical distributions must account for this.

### Context Data Distributions - Continuous

Other fields in the UDP context store are continuous fields, which are fields that can take any numeric value. Quiz and assignment scores are common examples of continuous fields. The number of files in a course, number of discussions, etc. also fall in the continuous category of fields. To define distributions for continuous fields, we calculate the average, standard deviation, minimum, and maximum values.

For example, we calculate these metrics for the number of learner activities per group in a course. This helps us know how assignments should be organized together in our synthetic courses. This distribution looks like:

| Metric             | Distribution Value |
| ------------------ | ------------------ |
| Average            | 6.477              |
| Standard Deviation | 7.517              |
| Min                | 1                  |
| Max                | 50                 |

On average, there are a little over 6 assignments per assignment group in a course. However, we can see as few as 1 and as many as 50 assignments in groups. We take these metrics into account when synthetically generating values for continuous fields.

### Event Data Distributions

The event data in the UDP is unique due to the time component throughout academic terms. We need to mimic the ebb and flow of students interacting with tools throughout semesters. We define the distribution of event data along four primary intersections:

1. **Course CIP Category**: Math, Engineering, English, etc.
2. **Course Level**: freshman, sophomore, junior, senior
3. **Day in term**: day 1 through the last day of the term
4. **Tool:** Assignments, Modules, Zoom, Kaltura, Turnitin, etc.

The course CIP category and level intersections characterize how different types of courses at different levels of instruction also differ in terms of behavior patterns. The tool intersection accounts for courses leveraging different types of tools for instruction. The day in term intersection accounts for the highs and lows of activity throughout a term. During Spring break, activity dips. Before major mid-term exams, activity usually spikes.

The result of this is a per-course category, per-level, per-day, per-tool view of the number of events that could exist for people in our synthetic courses. This gives us a believable timeline view of tool usage and behaviors throughout academic terms.

## Accessing and Using Synthetic Data

The synthetic data lives in a GCP project called *unizin-shared*. This project is a Unizin-managed space where consortium-shared datasets will live. Currently, synthetic data is the only resource in this project, but more resources can live there in the future!

### Approval Access Process

The approval process is similar to production UDP access requests. However, we need separate approval, even for previously approved users, because we have custom Google Cloud roles that allow users to query from *unizin-shared*. The process is as follows:

1. Submit a service ticket request via email to <support@unizin.org>
2. In the service ticket email, explicitly request access to the synthetic data in *unizin-shared*. This lets us know we are not provisioning access to production data.
3. Attach written institution Data Steward approval.
4. Once we receive items 1-3, Unizin will provision the correct role access in Google Cloud.
5. Unizin Services team will reply to the service ticket confirming access has been granted.

### Query Examples

Even though the data live in *unizin-shared*, users will still use their production BigQuery environments to run queries. This helps us with usage tracking of the synthetic data across the consortium. In queries to synthetic data, users will specify the *unizin-shared* project.

For example, let's assume a user at University of Nebraska has access both to production UDP data and synthetic UDP data. A production query to list all academic terms would look like the following:

<figure><img src="https://lh7-us.googleusercontent.com/BDA-KERutOCUajXJlPJ8EHf7S7iI51Ei6QUyJA37ZdfWM-u6iBAlFCvGgcAMAbPVrAOkgsB5JJVZYOizqwsmXy57NGfeKmHfXO1m1kfQehSKbVYJ1v9ceUuFUn4sqAoyreA6saVjrI-tukVp1bePkQ6iiA=s2048" alt=""><figcaption></figcaption></figure>

The user is logged into the *udp-unl-prod* GCP project, and their query scans the production *academic\_term* entity.

Now, this user wants to query the synthetic academic\_term entity in the synthetic data. Their query changes to the following:

<figure><img src="https://lh7-us.googleusercontent.com/LlL5rchKGo4VwgS9QBV8pCvNvPPQ2wBO26rPNuYNV7Dxx93-f86-a_hytncx9wxaI9hXPpggjSK8hrqMtvUIN-UmKJPT63v8ifBr8Jq7jkJ4dBKMQF9FBZ9bAhUjT7pyikdOaA5XxXb3hh7yIdPzTWqAaQ=s2048" alt=""><figcaption></figcaption></figure>

The user still remains in *udp-unl-prod* to run the query! They do not change GCP projects to run this query. Instead, their query explictly calls *unizin-shared* instead of *udp-unl-prod* in the query editor. This will pull the data from *unizin-shared* instead of the real, production data.

In order to see all of the available datasets and tables in the synthetic data living in *unizin-shared,* this query can be run from the appropriate BigQuery environment:

```
SELECT
  table_catalog as PROJECT_ID,
  table_schema as DATASET_ID,
  table_name as TABLE_ID
FROM
  `unizin-shared.region-us`.INFORMATION_SCHEMA.TABLES
```

To get more granular column and schema information about a particular table, run the following query:

```
-- replace {DATASET_ID} and {TABLE_ID} with proper values
SELECT 
  *
FROM
  `unizin-shared.{DATASET_ID}`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
  table_name = "{TABLE_ID}"
```

{% hint style="info" %}
Please note: If you are attempting to query synthetic data using a service account (.json file) method, you will need to update your application's settings to reference\_`udp-<tenant>-prod`\_ to return results.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://resources.unizin.org/products/data-and-analytics/unizin-data-platform/data-stores/data-lake/synthetic-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
