UDP Distributions

The udp_distributions.context_store_distributions mart defines the distributions of categorical and continuous fields in the UDP context store. The distribution of a categorical field is defined as the proportion of records associated with each possible field value to the total records for the field. The distribution of a continuous field is defined as the average, median, standard deviation, minimum, and maximum value of the field. Some categorical distributions are defined with respect to other fields, to account for relationships between fields. Distributions are defined for most fields and entities in the UDP context store.

BQ Prod Dataset Location

udp_distributions

Schema

udp_distributions.context_store_distributions

Field

Type

Description

entity

STRING

The entity that includes the field for which the distribution is defined.

field_name

STRING

The field for which the distribution is defined.

field_value

STRING

For categorical distributions, a possible value of the field. Null for continuous distributions.

distribution_type

STRING

Whether the distribution is categorical or continuous.

avg_field_value

NUMERIC

For continuous distributions, the average of the values associated with the field. Null for categorical distributions.

median_field_value

NUMERIC

For continuous distributions, the median of the values associated with the field. Null for categorical distributions.

sd_field_value

NUMERIC

For continuous distributions, the standard deviation of the values associated with the field. Null for categorical distributions.

min_field_value

NUMERIC

For continuous distributions, the minimum value of the values associated with the field. Null for categorical distributions.

max_field_value

NUMERIC

For continuous distributions, the maximum value of the values associated with the field. Null for categorical distributions.

field_source

STRING

Whether the field is sourced directly from the UCDM or is a custom field. Currently, this field will always be 'UCDM'.

num_records

INTEGER

For categorical distributions, the number of records associated with the field value. Null for continuous distributions.

pct_records

FLOAT

For categorical distributions, the number of records for that field value compared to the number of records for all possible values for that field. Null for continuous distributions.

with_respect_to_field_name_1

STRING

If the distribution is defined with respect to other fields, the first field that the distribution may be defined with respect to.

with_respect_to_field_name_2

STRING

If the distribution is defined with respect to other fields, the second field that the distribution may be defined with respect to.

with_respect_to_field_name_3

STRING

If the distribution is defined with respect to other fields, the third field that the distribution may be defined with respect to.

with_respect_to_field_value_1

STRING

If the distribution is defined with respect to other fields, a possible value for the first field that the distribution may be defined with respect to.

with_respect_to_field_value_2

STRING

If the distribution is defined with respect to other fields, a possible value for the second field that the distribution may be defined with respect to.

with_respect_to_field_value_3

STRING

If the distribution is defined with respect to other fields, a possible value for the third field that the distribution may be defined with respect to.

number_of_dependencies

INTEGER

The number of dependencies of the distribution. If the distribution is defined with respect to no other fields, the number of dependencies is zero; if the distribution is defined with respect one other field, the number of dependencies is one; and so on. Currently, the maximum possible number of dependencies is three.

Categorical Distributions

In this mart, distributions of categorical fields are defined as the proportion of the number of records associated with each possible value for the field to the total number of records for the field. For example, a boolean field will have three possible values, true, false and null. The proportion of records associated with each of these three values makes up the distribution of that field.

Some categorical distributions are defined with respect to other fields. These distributions have a non-zero number_of_dependencies value. These dependent distributions are included in the mart to make sure important relationships between fields are maintained. For example, role_status in course_section_enrollment has a distribution with one dependency, with respect to role, and enrollment_status, in turn, has a distribution with two dependencies, with respect to role_status and role. For distributions with dependencies, the proportion of records is no longer the proportion of the number of records for each possible field value to the total number of records for the field. Instead, it is the proportion of the number of records of each possible field value to the number of records of each possible value for the related fields.

Continuous Distributions

Distributions of continuous fields are defined as the average, median, standard deviation, minimum, and maximum values of the field. When defining the distributions for this mart, we wanted to make sure that they can both serve as an accurate reflection of how the UDP data looks as well as a useful representation of learners and learning behavior. To achieve the latter, we remove some outliers from the calculation of the continuous distributions metrics. We detect possible outlier values for a field using the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). We remove any values below Q1-1.5*IQR or above Q3+1.5*IQR.

Staging Helper Tables

Three other tables exist in the udp_distributions dataset:

categorical_distributions - fields from the UCDM, without dependency on other fields, that describe the proportion of categorical values (e.g. role, grade_state, completion_type, etc.)
continuous_distributions - fields from the UCDM that are numeric and of a continuous nature (e.g. score, weight, time_limit, etc.)
categorical_dependent_distributions - categorical fields from the UCDM that have dependency on other fields (e.g. role_status that depends on role, grade_on_official_transcript that depends on grading_basis, is_anonymous_peer_reviews that depends on has_peer_reviews)

The schema for all three of these staging tables matches the schema above. These three tables are created and the final context_store_distributions table is the union of these three tables.

PreviousData marts NextInteraction sessions

Last updated 9 months ago