# UDP Distributions

Last updated

Last updated

Unizin Homepage

unizin.orgHosted Services

My Learning AnalyticsCopyright © 2023, Unizin, Ltd.

The *udp_distributions.context_store_distributions *mart defines the distributions of categorical and continuous fields in the UDP context store. The distribution of a categorical field is defined as the proportion of records associated with each possible field value to the total records for the field. The distribution of a continuous field is defined as the average, median, standard deviation, minimum, and maximum value of the field. Some categorical distributions are defined with respect to other fields, to account for relationships between fields. Distributions are defined for most fields and entities in the UDP context store.

BQ Prod Dataset Location

*udp_distributions*

Schema

udp_distributions.context_store_distributions

Field | Type | Description |
---|---|---|

Categorical Distributions

In this mart, distributions of categorical fields are defined as the proportion of the number of records associated with each possible value for the field to the total number of records for the field. For example, a boolean field will have three possible values, *true,* *false *and* null.* The proportion of records associated with each of these three values makes up the distribution of that field.

Some categorical distributions are defined with respect to other fields. These distributions have a non-zero *number_of_dependencies* value. These dependent distributions are included in the mart to make sure important relationships between fields are maintained. For example, *role_status* in *course_section_enrollment* has a distribution with one dependency, with respect to *role,* and *enrollment_status*, in turn, has a distribution with two dependencies, with respect to *role_status* and *role*. For distributions with dependencies, the proportion of records is no longer the proportion of the number of records for each possible field value to the total number of records for the field. Instead, it is the proportion of the number of records of each possible field value to the number of records of each possible value for the related fields.

Continuous Distributions

Distributions of continuous fields are defined as the average, median, standard deviation, minimum, and maximum values of the field. When defining the distributions for this mart, we wanted to make sure that they can both serve as an accurate reflection of how the UDP data looks as well as a useful representation of learners and learning behavior. To achieve the latter, we remove some outliers from the calculation of the continuous distributions metrics. We detect possible outlier values for a field using the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). We remove any values below `Q1-1.5*IQR`

or above `Q3+1.5*IQR`

.

Staging Helper Tables

Three other tables exist in the `udp_distributions`

dataset:

`categorical_distributions`

- fields from the UCDM, without dependency on other fields, that describe the proportion of categorical values (e.g.`role`

,`grade_state`

,`completion_type`

, etc.)`continuous_distributions`

- fields from the UCDM that are numeric and of a continuous nature (e.g.`score`

,`weight`

,`time_limit`

, etc.)`categorical_dependent_distributions`

- categorical fields from the UCDM that have dependency on other fields (e.g.`role_status`

that depends on`role`

,`grade_on_official_transcript`

that depends on`grading_basis`

,`is_anonymous_peer_reviews`

that depends on`has_peer_reviews`

)

The schema for all three of these staging tables matches the schema above. These three tables are created and the final `context_store_distributions`

table is the union of these three tables.

entity

STRING

The entity that includes the field for which the distribution is defined.

field_name

STRING

The field for which the distribution is defined.

field_value

STRING

For categorical distributions, a possible value of the field. *Null* for continuous distributions.

distribution_type

STRING

Whether the distribution is categorical or continuous.

avg_field_value

NUMERIC

For continuous distributions, the average of the values associated with the field. *Null* for categorical distributions.

median_field_value

NUMERIC

For continuous distributions, the median of the values associated with the field. *Null* for categorical distributions.

sd_field_value

NUMERIC

For continuous distributions, the standard deviation of the values associated with the field. *Null* for categorical distributions.

min_field_value

NUMERIC

For continuous distributions, the minimum value of the values associated with the field. *Null* for categorical distributions.

max_field_value

NUMERIC

For continuous distributions, the maximum value of the values associated with the field. *Null* for categorical distributions.

field_source

STRING

Whether the field is sourced directly from the UCDM or is a custom field. Currently, this field will always be 'UCDM'.

num_records

INTEGER

For categorical distributions, the number of records associated with the field value. *Null* for continuous distributions.

pct_records

FLOAT

For categorical distributions, the number of records for that field value compared to the number of records for all possible values for that field. *Null* for continuous distributions.

with_respect_to_field_name_1

STRING

If the distribution is defined with respect to other fields, the first field that the distribution may be defined with respect to.

with_respect_to_field_name_2

STRING

If the distribution is defined with respect to other fields, the second field that the distribution may be defined with respect to.

with_respect_to_field_name_3

STRING

If the distribution is defined with respect to other fields, the third field that the distribution may be defined with respect to.

with_respect_to_field_value_1

STRING

If the distribution is defined with respect to other fields, a possible value for the first field that the distribution may be defined with respect to.

with_respect_to_field_value_2

STRING

If the distribution is defined with respect to other fields, a possible value for the second field that the distribution may be defined with respect to.

with_respect_to_field_value_3

STRING

If the distribution is defined with respect to other fields, a possible value for the third field that the distribution may be defined with respect to.

number_of_dependencies

INTEGER

The number of dependencies of the distribution. If the distribution is defined with respect to no other fields, the number of dependencies is zero; if the distribution is defined with respect one other field, the number of dependencies is one; and so on. Currently, the maximum possible number of dependencies is three.