Links

The links mart presents extracted links from content in the LMS. This mart is based on the 'Links Datamart' developed by Jason Heffner at Pennsylvania State University. This mart employs the Javascript libraries cheerio.js and URI.js. The cheerio.js library is used to parse the HTML of the content. URLs included in the HTML are extracted, as well as attributes and contextual information about the HTML element. The URI.js library is used to parse the URL, extracting components such as the host or path of the URL.

Jason's own overview of the Links datamart is as follows:

The links data mart extracts and catalogs every URL found in Canvas course content (Pages, Assignments, Discussions, Modules, Announcements) and gives us a centralized view of link usage across courses.

It was originally built to help with transitions between services, like updating to a new version of Kaltura’s API or larger migrations from Box to OneDrive. It’s been a key tool for identifying and updating embedded links in Canvas content during these kinds of changes.

Instead of relying on regex or basic text matching, it walks the DOM and captures the full picture: the URL, the link text, the context around it and how it was embedded, whether as inline text, a button, an iframe, etc. That extra detail makes a big difference when trying to understand how the link is being used.

When we’re working through a platform transition or deprecation, we can quickly:

  1. Search by hostname (like *.box.com)

  2. Identify embedded tools that need migration

  3. Audit identified links that are broken, outdated, or still actively used

It’s taken a lot of the manual work and uncertainty out of link migrations, across thousands of Canvas pages and different content types, and helped ensure that content keeps working for students without surprises.

BQ Prod Dataset Locations

  • mart_general

Interactive Mart Dependency Diagram

The following visualization shows the construction of this data mart defined in the UDP marts repository. More information on the repository and diagram can be found on this marts page.

mart_general.links

Schema

Field name
Type
Description

udp_link_id

STRING

An unique ID of the link, generated by combining the UDP content ID, the occurrence number, the content type, and the UDP course offering ID.

lms_link_id

STRING

An unique ID of the link, generated by combining the LMS content ID, the occurrence number, the content type, and the LMS course offering ID.

udp_content_id

INTEGER

The UDP ID of the content that contains the link.

lms_content_id

STRING

The LMS ID of the content that contains the link.

udp_parent_content_id

INTEGER

The UDP ID of the parent of the content. May be null if the content does not have a parent.

lms_parent_content_id

STRING

The LMS ID of the parent of the content. May be null if the content does not have a parent.

udp_course_offering_id

INTEGER

The UDP ID of the course offering.

lms_course_offering_id

STRING

The LMS ID of the course offering.

sis_course_offering_id

STRING

The SIS ID of the course offering.

udp_person_id

INTEGER

The UDP ID of the person if a person is associated with the content.

lms_person_id

STRING

The LMS ID of the person if a person is associated with the content.

sis_person_id

STRING

The SIS ID of the person if a person is associated with the content.

content_type

STRING

The type of the content associated with the link, i.e. 'learner_activity', 'discussion', etc.

parent_content_type

STRING

The type of the parent content, i.e. 'learner_activity', 'quiz', etc. May be null if the content does not have a parent.

content

STRING

The full HTML text of the content.

content_name

STRING

The name or title of the content.

status

STRING

The status of the content, i.e. 'active', 'deleted', 'published', etc.

is_active

BOOLEAN

A boolean field indicating if the status of the content is 'active', 'published', 'available', or 'post_delayed'.

created_date

DATETIME

The date the content was created.

updated_date

DATETIME

The date the content was last updated.

url

STRING

The extracted URL from the content.

context

STRING

The HTML element that contains the URL.

word_count

INTEGER

The number of words in the text content of the HTML element.

char_count

INTEGER

The number of characters in the text content of the HTML element.

usage

STRING

How the link is used in the content, i.e. 'hyperlink', 'embed', 'image', etc.

scheme

STRING

The scheme of the URL, i.e. 'http' or 'https'.

host

STRING

The host or domain name of the URL, ie. 'example.com'.

path

STRING

The path of the URL, i.e. 'path/to/resource'.

query_string

STRING

The query or search string of the URL, i.e. 'key=value&anotherKey=anotherValue'.

file_extension

STRING

If the link contains a file name, the extension of the file, i.e. 'html' or 'pdf'.

tag

STRING

The tag of the HTML element, i.e. 'a', 'iframe', 'embed', etc.

attribute

`STRING

The attribute of the HTML element, i.e. 'src' or 'href'.

classes

STRING

The class of the HTML element.

is_safe_link

BOOLEAN

A boolean field that indicates if the URL is a safe link.

is_canvas

BOOLEAN

A boolean field that indicates if the URL is a Canvas URL.

is_shortener

BOOLEAN

A boolean field that indicates if the URL is a shortened URL.

occurrence

INTEGER

The occurrence number of the link in the content.

service_name

STRING

The name of the service linked to in the URL.

Javascript Usage

As mentioned in the introduction, this mart employs Javascript libraries to extract information from HTML and URLs. This is done using the Javascript UDFs functionality in BigQuery. The OPTIONS section in the UDF references the Javascript libraries. The libraries are publicly available resources:

  • "gs://assets.public.unizin.org/udp-marts/links-datamart/cheerio.js"

  • "gs://assets.public.unizin.org/udp-marts/links-datamart/uri.js"

Using BigQuery SQL, the Javascript UDF is called on content from the LMS in order to extract links and additional contextual information.

Content Sources

This mart extracts URLs from content in the LMS. The content is defined as any HTML text fields associated with tools in the LMS. The content types are defined based on the LMS tool. The possible content types and the corresponding UDP fields of the content are:

  • annotation

    • annotation.message

  • conversation_message

    • conversation_message.body

  • discussion

    • discussion.body

  • discussion_entry

    • discussion_entry.body

  • learner_activity

    • learner_activity.description

  • learner_activity_result

    • learner_activity_result.body

  • learning_outcome

    • learning_outcome.description

  • learning_outcome_group

    • learning_outcome_group.description

  • learning_outcome_rubric_criteria

    • learning_outcome_rubric_criteria.description

  • module_item

    • module_item.url

  • quiz

    • quiz.description

  • quiz_item

    • quiz_item.body

  • quiz_item_response

    • quiz_item_response.body

  • syllabus

    • course_offering.syllabus_content

  • wiki

    • wiki.front_page_url

  • wiki_page

    • wiki_page.body

The udp_link_id and lms_link_id fields do not correspond to any existing IDs in the UDP or Canvas. They are unique identifiers generated for the links and are defined based on other fields included in this mart. The udp_link_id field is defined as:

  CONCAT(CAST(cu.udp_content_id AS STRING),'-',CAST(cu.occurrence AS STRING),'-',cu.content_type,'-',cu.udp_course_offering_id) as udp_link_id,

or the combination of udp_content_id, occurrence, content_type, and udp_course_offering_id. Similarly, lms_link_id is defined as the combination of lms_content_id, occurrence, content_type, and lms_course_offering_id.

References

  • Cheerio.js - Javascript library used to parse and extract elements of HTML.

  • URI.js - Javascript library used to parse and extract elements of URLs.

  • url-shorteners - Project with list of common URL shortener domains used to define the is_shortener field.

  • file-extensions-list - Project with list of common file extensions used to define the file_extension field.

Last updated