UDP Distributions

The udp_distributions.context_store_distributions mart defines the distributions of categorical and continuous fields in the UDP context store. The distribution of a categorical field is defined as the proportion of records associated with each possible field value to the total records for the field. The distribution of a continuous field is defined as the average, median, standard deviation, minimum, and maximum value of the field. Some categorical distributions are defined with respect to other fields, to account for relationships between fields. Distributions are defined for most fields and entities in the UDP context store.

BQ Prod Dataset Location

udp_distributions

Schema

udp_distributions.context_store_distributions

Categorical Distributions

In this mart, distributions of categorical fields are defined as the proportion of the number of records associated with each possible value for the field to the total number of records for the field. For example, a boolean field will have three possible values, true, false and null. The proportion of records associated with each of these three values makes up the distribution of that field.

Some categorical distributions are defined with respect to other fields. These distributions have a non-zero number_of_dependencies value. These dependent distributions are included in the mart to make sure important relationships between fields are maintained. For example, role_status in course_section_enrollment has a distribution with one dependency, with respect to role, and enrollment_status, in turn, has a distribution with two dependencies, with respect to role_status and role. For distributions with dependencies, the proportion of records is no longer the proportion of the number of records for each possible field value to the total number of records for the field. Instead, it is the proportion of the number of records of each possible field value to the number of records of each possible value for the related fields.

Continuous Distributions

Distributions of continuous fields are defined as the average, median, standard deviation, minimum, and maximum values of the field. When defining the distributions for this mart, we wanted to make sure that they can both serve as an accurate reflection of how the UDP data looks as well as a useful representation of learners and learning behavior. To achieve the latter, we remove some outliers from the calculation of the continuous distributions metrics. We detect possible outlier values for a field using the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). We remove any values below Q1-1.5*IQR or above Q3+1.5*IQR.

Staging Helper Tables

Three other tables exist in the udp_distributions dataset:

  • categorical_distributions - fields from the UCDM, without dependency on other fields, that describe the proportion of categorical values (e.g. role, grade_state, completion_type, etc.)

  • continuous_distributions - fields from the UCDM that are numeric and of a continuous nature (e.g. score, weight, time_limit, etc.)

  • categorical_dependent_distributions - categorical fields from the UCDM that have dependency on other fields (e.g. role_status that depends on role, grade_on_official_transcript that depends on grading_basis, is_anonymous_peer_reviews that depends on has_peer_reviews)

The schema for all three of these staging tables matches the schema above. These three tables are created and the final context_store_distributions table is the union of these three tables.

Last updated

Logo

Copyright © 2023, Unizin, Ltd.