Unizin Product Documentation
ProductsSupport and TrainingPolicies
  • Unizin Product Documentation
  • Products
    • Content
      • Unizin Engage
        • eReader User Guide
          • Notes, Highlights, and Citations
          • Appearance Settings
          • Download for Offline
          • eReader Layout
          • Keyboard Shortcuts
          • Navigating Your eBook
          • Print
          • Text to Speech
          • Copy and Paste
          • Creating Flashcards
          • Collaboration and Note Sharing
          • Pearson Titles
        • Institution Support
          • Disabled Student Services / Alt-Format
            • Best Practices for Republishing Course Content
            • Disabled Student Services
            • Requesting eTextbook Files for Accessibility Purposes
            • WCAG 2.0 AA evaluation for Engage
            • WCAG 2.0 AA evaluation for EPUB for Engage
          • Institution's Support Responsibilities
        • Caliper 1.1 sensor
        • Release Notes
          • 2.28.22
          • 2019-09-17
          • 2019-05-29
          • 2.26.8
          • 2.26.0
          • 2.25.0
          • 2.22.0
          • 2.21.6
          • 2.21.5
          • 2.20.8
          • 2.20.5
          • 2.20.3
          • 2.19.1
          • 2.18.0
          • 2.17.0
          • 2.14.0
          • 2.12.0
          • 2.11.0
          • 2.9.0
          • 2.8.3
          • 2016-03-17
          • 2016-02-11
          • 2016-01-28
        • Using Analytics (New)
      • Unizin Order Tool
        • Overview of the User Interface
        • Key Concepts
          • Profiles
          • Ordering periods
          • Coordinator permissions
          • Program administrator permissions
        • Courses & Ordering
          • Course filtering
          • Place an order
          • Add sections to a placed order
          • Edit a placed order
          • Cancel an order
          • Reordering
        • Order History
          • Instructor Order History
          • Coordinator and Program Administrator Order History
          • Order Activity
        • Student Choice
          • Student Choice (Program Administrators)
          • Student Choice (Students)
        • Entitlements
          • Entitlements (Program Administrators)
          • Entitlements (Students)
        • Catalog Tool
        • Schedule of Classes
        • Content Request Tool
        • Order Tool Dashboard
        • Vendor Sandbox Tenant
        • Institution Support
        • Implementation
          • SIS Data Integration
            • 1.0 - SIS Integration
            • 2.0 - SIS Integration
          • SSO integration
          • UI customizations
          • Order Feed
            • 1.0 - Order Feed
            • 2.0 - Order Feed
            • 3.0 - Order Feed
          • Publisher report
          • Final declined offers feed
          • Institutional (SIS) Catalog Import
          • Student Price
          • Historical Entitlements Import
        • Release Notes
          • Order Tool Bug Fixes and Enhancements
          • Order Tool Bug Fixes
          • Order Tool Accessibility Improvements
          • Order Feed Improvements
          • Content Request Form Update and Minor Bug Fix
          • Flat Markup Fee Update
          • Ordering Email Receipt Update & Minor Bug Fix
          • Bug Fix for Public Catalog Feature
          • Catalog Search Enhancements
          • Reordering Reminder Email Notifications
          • UX Improvements & Minor Bug Fixes
          • Historical Entitlements Import
          • Student Prices
          • Reordering Feature
          • Email Enhancements
          • Ordering Enhancements
          • Bug Fix for the Institutional Catalog Import
          • Bug Fix for the Final Declined Offers Feed (FDOF)
          • Order Activity Feature and Other Enhancements
          • Bug Fixes for Order History and Report an Issue Features
          • Public catalog feature
          • Minor Bug Fixes for Ordering and Student Choice
          • Entitlements Production Release, Bug fixes, and Minor updates
          • Minor Updates and Bug Fixes for Ordering Workflows
          • Catalog Search Optimization
          • Student Choice
          • Archive Terms Feature and Integration Improvements
          • Introduces the Program Administrator role, Catalog Tool, and Schedule
          • User interface updates and improvements
          • Order feed improvements
          • Order history, UI enhancements
          • Email notification upgrades, UI improvements
          • Order feed changes
          • New features for Course coordinators and upgrades to the UI
          • Changes to the Term, Course, and Section models; introduces a Session
          • Bug fixes, import improvements, and validation improvements
          • Tracking Order History
          • Publisher Reporting
          • Fixes the order feed, automates SIS data importing, and automates the generation of order feed repor
    • Data & Analytics
      • Unizin Data Platform
        • Key concepts
          • Platform overview
          • Data categories
          • Data models
          • Loading schemas
          • Keymap
        • Unizin Common Data Model
          • Academic structures (ERD)
          • Learners (ERD)
          • Course structures (ERD)
          • Course resources (ERD)
          • Learner activities (ERD)
          • Quizzes (ERD)
          • Social (ERD)
          • Course outcomes (ERD)
        • System overview
          • Context data pipeline
            • Context data ingress
            • Batch-ingest application
            • Batch-ingest db server
            • Context store
          • Event data pipeline
            • UDP Caliper endpoint
            • Approval process for implementing Caliper compliant tools
            • UDP Event enricher
            • Event store
        • Data stores
          • Data lake
            • UDP Context store
            • UDP Event store
              • Accessing the Event store
              • Expanded table
                • Expanded table: Canvas edApp mapping
            • Synthetic Data [beta]
              • Viewing Synthetic Data datasets within the BigQuery UI
              • Query Synthetic Data via client libraries
          • Data marts
            • UDP Distributions
            • Interaction sessions
            • Learning Environment Organization
            • File Interaction
            • Last Activity
            • Long Inactivity
            • Course Status
            • Daily Course Grade Record
            • LTI Tool Use
            • LMS Tool Use
            • Tool Usage Metrics
            • Links
            • Taskforce
              • Level 1 Aggregated
              • Level 2 Aggregated
              • Level 2 Course Weekly Distribution Summary
              • Student Term Profile
              • Course Profile
            • Student Activity Score
              • Student Course Metrics
              • Student Course Section Metrics
              • Final
              • Course Final
              • Course Section Final
        • Data integrations
          • Context data integration
            • Loading schema
            • Keymap support
            • Manifest file
            • File requirements
            • Integration mechanics
          • Event data integration
          • SIS data integration
          • LMS data integration
            • Instructure Canvas
        • Release Notes
          • UDP Marts Release Notes
            • 1.0.83
            • 1.0.80
            • 1.0.79
            • 1.0.78
            • 1.0.77
            • 1.0.72
            • 1.0.67
            • 1.0.58
            • 1.0.51
            • 1.0.44
            • 1.0.42
            • 1.0.32
            • 1.0.31
            • 1.0.0
            • Level 2 Taskforce data marts now available
          • 2.0.167
          • 2.0.152
          • 2.0.138
          • 2.0.137
          • 2.0.113
          • 2.0.112
          • 2.0.111
          • 2.0.110
          • 2.0.99
          • 2.0.98
          • 2.0.83
          • 2.0.80
          • 2.0.71
          • 2.0.66
          • 2.0.59
          • 2.0.58
          • 2.0.53
          • 2.0.47
          • 2.0.25
        • Miscellaneous
          • Canvas Data additions, ~Fall 2021
          • Canvas Live Events: from SQS to HTTPS
          • Canvas New Analytics vs. UDP
          • Course Section Enrollment Role Status Mappings
          • Migrating from UDW to UDP
      • Unizin Data Warehouse
        • Implementation Guide
        • Scope of Services
        • Access Provisioning
        • Access Revocation
        • Connecting to the UDW
      • Raw Canvas Data 2
        • Flat Files
        • BigQuery Datasets
    • Hosted Services
      • My Learning Analytics
        • Install MyLA via LTI 1.3
        • Custom configure MyLA
  • Support and Training
    • Professional Development
      • Stepping Stones: A Faculty Development Curriculum for Learning Analytics Use
      • Structured Conversations initiative
    • UDP Self-paced Training
    • Resources Site Broken Links
    • Status Pages
  • Policies
    • General policies
      • Sponsor Teams
      • Browser Support Policy
      • Opt-Out & Invoicing Policy (Order Tool)
    • Support Policy
      • Unizin Engage - SP
      • Unizin Order Tool - SP
      • Unizin Data Platform - SP
      • Unizin Data Warehouse - SP
      • Unizin Data Analysis - SP
      • Pressbooks Hosting - SP
    • Privacy Policy
      • Unizin Engage - PP
      • Unizin Order Tool - PP
      • Unizin Data Platform - PP
      • RStudio service - PP
    • End User License Agreements
      • Unizin Engage - EULA
      • Unizin Order Tool - EULA
    • Terms of Use
      • Unizin Data Platform - ToU
    • Incident Reports
Powered by GitBook
LogoLogo

Unizin Homepage

  • unizin.org

Data & Analytics

  • Unizin Data Platform
  • Unizin Data Warehouse

Content

  • Unizin Engage
  • Unizin Order Tool

Hosted Services

  • My Learning Analytics

Copyright © 2023, Unizin, Ltd.

On this page
  • Synthetic Data Creation Process
  • UDP Distributions
  • Context Data Distributions - Categorical
  • Context Data Distributions - Continuous
  • Event Data Distributions
  • Accessing and Using Synthetic Data
  • Approval Access Process
  • Query Examples
  1. Products
  2. Data & Analytics
  3. Unizin Data Platform
  4. Data stores
  5. Data lake

Synthetic Data [beta]

PreviousExpanded table: Canvas edApp mappingNextViewing Synthetic Data datasets within the BigQuery UI

Last updated 2 months ago

In addition to production UDP data, Unizin has created a synthetic dataset modeled identically to UDP data. This dataset contains fake students and courses, and can be safely used to complement production and real data usage.

Please note that the synthetic dataset is a beta version which is still undergoing development and testing before its official release. The dataset is provided on an “as is” and “as available” basis. The primary purpose of this beta testing is to obtain feedback on the contents and the identification of defects. Should you encounter any bugs, glitches, lack of functionality or other problems with this dataset, please let us know immediately so we can rectify these accordingly. Your help in this regard is greatly appreciated.

Synthetic Data Creation Process

At the highest level, the creation of synthetic data follows this procedure:

  1. Define the size of a synthetic institution, based on student FTE. Currently, our dataset has a student population of 20,000 FTE.

  2. Calculate distributions of possible values for data rows based on a consortium view of real UDP instances. No PII data for students is collected or stored at any part of this process, and this is the only part of the process that uses real data. See the section for more details.

  3. Iteratively generate rows for all UDP context store entities, considering dependencies. For example, we must generate the learner_activity_result entity before we generate the course_grade entity. The learner_activity_result entity contains assignment scores that will inform believable letter grades students can receive in the course_grade entity.

  4. Iteratively generate Caliper events to mimic believable clicks and student activity in the synthetic courses. We have a slightly different distribution calculation process for the Caliper events.

  5. Generate all Unizin-defined data marts on top of the synthetic UDP data. The goal is to mimic production UDP environments with fake, believable data. Since we deliver the data marts to every tenant, we choose to build the data marts on top of the synthetic data as well.

  6. Deliver the final datasets and tables in Google BigQuery.

UDP Distributions

To produce believable datasets, synthetic and production UDP data are correlated via the distribution of values in a given dataset. The same way a sample may be taken with an assumed distribution of values for a given parameter, synthetic data generated by Unizin is built on calculated frequencies of given values found in member's UDP tables. The source of most of the context store distributions is the mart, which defines the distributions of categorical and continuous fields in the UDP for each tenant.

We build these distributions for both the context data and the event data found in a UDP instance.

Context Data Distributions - Categorical

Context data often includes categorical fields. For example, the role field in course_section_enrollment takes a defined, predictable set of values. For categorical fields, we build discrete distributions based on the proportion of each value we see in real UDP data sources. Using the role field again, the distribution we build looks like the following:

When we generate rows in the synthetic course_section_enrollment entity, there is an 89.3% chance the enrollment will be a Student, a 7.8% chance it will be a Teacher, 1.5% chance it will be Faculty, etc.

Some distributions depend on the intersection of two fields. For instance, the role_status field in course_section_enrollment depends on the role field. It's more common for Student enrollments to withdraw from courses than other roles, like TeachingAssistant or Observer. The role_status distribution for Student enrollments is:

Whereas, the role_status distribution for Teacher enrollments is:

The likelihood for role_status values look very different for students and teachers, so our synthetic categorical distributions must account for this.

Context Data Distributions - Continuous

Other fields in the UDP context store are continuous fields, which are fields that can take any numeric value. Quiz and assignment scores are common examples of continuous fields. The number of files in a course, number of discussions, etc. also fall in the continuous category of fields. To define distributions for continuous fields, we calculate the average, standard deviation, minimum, and maximum values.

For example, we calculate these metrics for the number of learner activities per group in a course. This helps us know how assignments should be organized together in our synthetic courses. This distribution looks like:

Metric
Distribution Value

Average

6.477

Standard Deviation

7.517

Min

1

Max

50

On average, there are a little over 6 assignments per assignment group in a course. However, we can see as few as 1 and as many as 50 assignments in groups. We take these metrics into account when synthetically generating values for continuous fields.

Event Data Distributions

The event data in the UDP is unique due to the time component throughout academic terms. We need to mimic the ebb and flow of students interacting with tools throughout semesters. We define the distribution of event data along four primary intersections:

  1. Course CIP Category: Math, Engineering, English, etc.

  2. Course Level: freshman, sophomore, junior, senior

  3. Day in term: day 1 through the last day of the term

  4. Tool: Assignments, Modules, Zoom, Kaltura, Turnitin, etc.

The course CIP category and level intersections characterize how different types of courses at different levels of instruction also differ in terms of behavior patterns. The tool intersection accounts for courses leveraging different types of tools for instruction. The day in term intersection accounts for the highs and lows of activity throughout a term. During Spring break, activity dips. Before major mid-term exams, activity usually spikes.

The result of this is a per-course category, per-level, per-day, per-tool view of the number of events that could exist for people in our synthetic courses. This gives us a believable timeline view of tool usage and behaviors throughout academic terms.

Accessing and Using Synthetic Data

The synthetic data lives in a GCP project called unizin-shared. This project is a Unizin-managed space where consortium-shared datasets will live. Currently, synthetic data is the only resource in this project, but more resources can live there in the future!

Approval Access Process

The approval process is similar to production UDP access requests. However, we need separate approval, even for previously approved users, because we have custom Google Cloud roles that allow users to query from unizin-shared. The process is as follows:

  1. Submit a service ticket request via email to support@unizin.org

  2. In the service ticket email, explicitly request access to the synthetic data in unizin-shared. This lets us know we are not provisioning access to production data.

  3. Attach written institution Data Steward approval.

  4. Once we receive items 1-3, Unizin will provision the correct role access in Google Cloud.

  5. Unizin Services team will reply to the service ticket confirming access has been granted.

Query Examples

Even though the data live in unizin-shared, users will still use their production BigQuery environments to run queries. This helps us with usage tracking of the synthetic data across the consortium. In queries to synthetic data, users will specify the unizin-shared project.

For example, let's assume a user at University of Nebraska has access both to production UDP data and synthetic UDP data. A production query to list all academic terms would look like the following:

The user is logged into the udp-unl-prod GCP project, and their query scans the production academic_term entity.

Now, this user wants to query the synthetic academic_term entity in the synthetic data. Their query changes to the following:

The user still remains in udp-unl-prod to run the query! They do not change GCP projects to run this query. Instead, their query explictly calls unizin-shared instead of udp-unl-prod in the query editor. This will pull the data from unizin-shared instead of the real, production data.

In order to see all of the available datasets and tables in the synthetic data living in unizin-shared, this query can be run from the appropriate BigQuery environment:

SELECT
  table_catalog as PROJECT_ID,
  table_schema as DATASET_ID,
  table_name as TABLE_ID
FROM
  `unizin-shared.region-us`.INFORMATION_SCHEMA.TABLES

To get more granular column and schema information about a particular table, run the following query:

-- replace {DATASET_ID} and {TABLE_ID} with proper values
SELECT 
  *
FROM
  `unizin-shared.{DATASET_ID}`.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
  table_name = "{TABLE_ID}"

Please note: If you are attempting to query synthetic data using a service account (.json file) method, you will need to update your application's settings to reference_udp-<tenant>-prod_ to return results.

UDP Distributions
UDP Distributions