UDP Distributions
The udp_distributions.context_store_distributions mart defines the distributions of categorical and continuous fields in the UDP context store. The distribution of a categorical field is defined as the proportion of records associated with each possible field value to the total records for the field. The distribution of a continuous field is defined as the average, median, standard deviation, minimum, and maximum value of the field. Some categorical distributions are defined with respect to other fields, to account for relationships between fields. Distributions are defined for most fields and entities in the UDP context store.
BQ Prod Dataset Location
udp_distributions
Schema
udp_distributions.context_store_distributions
Field | Type | Description |
---|---|---|
entity | STRING | The entity that includes the field for which the distribution is defined. |
field_name | STRING | The field for which the distribution is defined. |
field_value | STRING | For categorical distributions, a possible value of the field. Null for continuous distributions. |
distribution_type | STRING | Whether the distribution is categorical or continuous. |
avg_field_value | NUMERIC | For continuous distributions, the average of the values associated with the field. Null for categorical distributions. |
median_field_value | NUMERIC | For continuous distributions, the median of the values associated with the field. Null for categorical distributions. |
sd_field_value | NUMERIC | For continuous distributions, the standard deviation of the values associated with the field. Null for categorical distributions. |
min_field_value | NUMERIC | For continuous distributions, the minimum value of the values associated with the field. Null for categorical distributions. |
max_field_value | NUMERIC | For continuous distributions, the maximum value of the values associated with the field. Null for categorical distributions. |
field_source | STRING | Whether the field is sourced directly from the UCDM or is a custom field. Currently, this field will always be 'UCDM'. |
num_records | INTEGER | For categorical distributions, the number of records associated with the field value. Null for continuous distributions. |
pct_records | FLOAT | For categorical distributions, the number of records for that field value compared to the number of records for all possible values for that field. Null for continuous distributions. |
with_respect_to_field_name_1 | STRING | If the distribution is defined with respect to other fields, the first field that the distribution may be defined with respect to. |
with_respect_to_field_name_2 | STRING | If the distribution is defined with respect to other fields, the second field that the distribution may be defined with respect to. |
with_respect_to_field_name_3 | STRING | If the distribution is defined with respect to other fields, the third field that the distribution may be defined with respect to. |
with_respect_to_field_value_1 | STRING | If the distribution is defined with respect to other fields, a possible value for the first field that the distribution may be defined with respect to. |
with_respect_to_field_value_2 | STRING | If the distribution is defined with respect to other fields, a possible value for the second field that the distribution may be defined with respect to. |
with_respect_to_field_value_3 | STRING | If the distribution is defined with respect to other fields, a possible value for the third field that the distribution may be defined with respect to. |
number_of_dependencies | INTEGER | The number of dependencies of the distribution. If the distribution is defined with respect to no other fields, the number of dependencies is zero; if the distribution is defined with respect one other field, the number of dependencies is one; and so on. Currently, the maximum possible number of dependencies is three. |
Categorical Distributions
In this mart, distributions of categorical fields are defined as the proportion of the number of records associated with each possible value for the field to the total number of records for the field. For example, a boolean field will have three possible values, true, false and null. The proportion of records associated with each of these three values makes up the distribution of that field.
Some categorical distributions are defined with respect to other fields. These distributions have a non-zero number_of_dependencies value. These dependent distributions are included in the mart to make sure important relationships between fields are maintained. For example, role_status in course_section_enrollment has a distribution with one dependency, with respect to role, and enrollment_status, in turn, has a distribution with two dependencies, with respect to role_status and role. For distributions with dependencies, the proportion of records is no longer the proportion of the number of records for each possible field value to the total number of records for the field. Instead, it is the proportion of the number of records of each possible field value to the number of records of each possible value for the related fields.
Continuous Distributions
Distributions of continuous fields are defined as the average, median, standard deviation, minimum, and maximum values of the field. When defining the distributions for this mart, we wanted to make sure that they can both serve as an accurate reflection of how the UDP data looks as well as a useful representation of learners and learning behavior. To achieve the latter, we remove some outliers from the calculation of the continuous distributions metrics. We detect possible outlier values for a field using the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). We remove any values below Q1-1.5*IQR
or above Q3+1.5*IQR
.
Staging Helper Tables
Three other tables exist in the udp_distributions
dataset:
categorical_distributions
- fields from the UCDM, without dependency on other fields, that describe the proportion of categorical values (e.g.role
,grade_state
,completion_type
, etc.)continuous_distributions
- fields from the UCDM that are numeric and of a continuous nature (e.g.score
,weight
,time_limit
, etc.)categorical_dependent_distributions
- categorical fields from the UCDM that have dependency on other fields (e.g.role_status
that depends onrole
,grade_on_official_transcript
that depends ongrading_basis
,is_anonymous_peer_reviews
that depends onhas_peer_reviews
)
The schema for all three of these staging tables matches the schema above. These three tables are created and the final context_store_distributions
table is the union of these three tables.
Last updated