# Links

The *links* mart presents extracted links from content in the LMS. This mart is based on the 'Links Datamart' developed by Jason Heffner at Pennsylvania State University. This mart employs the Javascript libraries [cheerio.js](https://github.com/cheeriojs/cheerio) and [URI.js](https://github.com/medialize/URI.js). The *cheerio.js* library is used to parse the HTML of the content. URLs included in the HTML are extracted, as well as attributes and contextual information about the HTML element. The *URI.js* library is used to parse the URL, extracting components such as the host or path of the URL.&#x20;

Jason's own overview of the Links datamart is as follows:

> The links data mart extracts and catalogs every URL found in Canvas course content (Pages, Assignments, Discussions, Modules, Announcements) and gives us a centralized view of link usage across courses.
>
> It was originally built to help with transitions between services, like updating to a new version of Kaltura’s API or larger migrations from Box to OneDrive. It’s been a key tool for identifying and updating embedded links in Canvas content during these kinds of changes.
>
> Instead of relying on regex or basic text matching, it walks the DOM and captures the full picture: the URL, the link text, the context around it and how it was embedded, whether as inline text, a button, an iframe, etc. That extra detail makes a big difference when trying to understand how the link is being used.
>
> When we’re working through a platform transition or deprecation, we can quickly:
>
> 1. Search by hostname (like \*.[box.com](http://box.com/))
> 2. Identify embedded tools that need migration
> 3. Audit identified links that are broken, outdated, or still actively used
>
> &#x20;\
> It’s taken a lot of the manual work and uncertainty out of link migrations, across thousands of Canvas pages and different content types, and helped ensure that content keeps working for students without surprises.

## **BQ Prod Dataset Locations**

* mart\_general

## Interactive Mart Dependency Diagram <a href="#user-content-schema" id="user-content-schema"></a>

The following visualization shows the construction of this data mart defined in the [UDP marts](https://gitlab.com/unizin-community/unizin-data-platform/marts) repository. More information on the repository and diagram can be found on [this marts page](https://resources.unizin.org/products/data-and-analytics/unizin-data-platform/data-stores/data-marts).

{% hint style="info" %}
[Click here to open the interactive chart.](https://assets.public.unizin.org/udp-marts/mart/mart__general__links.svg)
{% endhint %}

<figure><img src="https://3709019308-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FKYwtHNGgdPXS3PWAlZUr%2Fuploads%2Fme1lDZOIEHjYzDYksmNb%2Fmart__general__links.svg?alt=media&#x26;token=7f99a38c-dbd5-49f2-b6d4-9b89ec79ab7a" alt=""><figcaption><p>mart_general.links</p></figcaption></figure>

## Schema <a href="#user-content-schema" id="user-content-schema"></a>

### **mart/general/links**

| Field name                | Type     | Description                                                                                                                                   |
| ------------------------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| udp\_link\_id             | STRING   | An unique ID of the link, generated by combining the UDP content ID, the occurrence number, the content type, and the UDP course offering ID. |
| lms\_link\_id             | STRING   | An unique ID of the link, generated by combining the LMS content ID, the occurrence number, the content type, and the LMS course offering ID. |
| udp\_content\_id          | INTEGER  | The UDP ID of the content that contains the link.                                                                                             |
| lms\_content\_id          | STRING   | The LMS ID of the content that contains the link.                                                                                             |
| udp\_parent\_content\_id  | INTEGER  | The UDP ID of the parent of the content. May be null if the content does not have a parent.                                                   |
| lms\_parent\_content\_id  | STRING   | The LMS ID of the parent of the content.  May be null if the content does not have a parent.                                                  |
| udp\_course\_offering\_id | INTEGER  | The UDP ID of the course offering.                                                                                                            |
| lms\_course\_offering\_id | STRING   | The LMS ID of the course offering.                                                                                                            |
| sis\_course\_offering\_id | STRING   | The SIS ID of the course offering.                                                                                                            |
| udp\_person\_id           | INTEGER  | The UDP ID of the person if a person is associated with the content.                                                                          |
| lms\_person\_id           | STRING   | The LMS ID of the person if a person is associated with the content.                                                                          |
| sis\_person\_id           | STRING   | The SIS ID of the person if a person is associated with the content.                                                                          |
| content\_type             | STRING   | The type of the content associated with the link, i.e. 'learner\_activity', 'discussion', etc.                                                |
| parent\_content\_type     | STRING   | The type of the parent content, i.e. 'learner\_activity', 'quiz', etc. May be null if the content does not have a parent.                     |
| content                   | STRING   | The full HTML text of the content.                                                                                                            |
| content\_name             | STRING   | The name or title of the content.                                                                                                             |
| status                    | STRING   | The status of the content, i.e. 'active', 'deleted', 'published', etc.                                                                        |
| is\_active                | BOOLEAN  | A boolean field indicating if the status of the content is 'active', 'published', 'available', or 'post\_delayed'.                            |
| created\_date             | DATETIME | The date the content was created.                                                                                                             |
| updated\_date             | DATETIME | The date the content was last updated.                                                                                                        |
| url                       | STRING   | The extracted URL from the content.                                                                                                           |
| context                   | STRING   | The HTML element that contains the URL.                                                                                                       |
| word\_count               | INTEGER  | The number of words in the text content of the HTML element.                                                                                  |
| char\_count               | INTEGER  | The number of characters in the text content of the HTML element.                                                                             |
| usage                     | STRING   | How the link is used in the content, i.e. 'hyperlink', 'embed', 'image', etc.                                                                 |
| scheme                    | STRING   | The scheme of the URL, i.e. 'http' or 'https'.                                                                                                |
| host                      | STRING   | The host or domain name of the URL, ie. 'example.com'.                                                                                        |
| path                      | STRING   | The path of the URL, i.e. 'path/to/resource'.                                                                                                 |
| query\_string             | STRING   | The query or search string of the URL, i.e. 'key=value\&anotherKey=anotherValue'.                                                             |
| file\_extension           | STRING   | If the link contains a file name, the extension of the file, i.e. 'html' or 'pdf'.                                                            |
| tag                       | STRING   | The tag of the HTML element, i.e. 'a', 'iframe', 'embed', etc.                                                                                |
| attribute                 | \`STRING | The attribute of the HTML element, i.e. 'src' or 'href'.                                                                                      |
| classes                   | STRING   | The class of the HTML element.                                                                                                                |
| is\_safe\_link            | BOOLEAN  | A boolean field that indicates if the URL is a safe link.                                                                                     |
| is\_canvas                | BOOLEAN  | A boolean field that indicates if the URL is a Canvas URL.                                                                                    |
| is\_shortener             | BOOLEAN  | A boolean field that indicates if the URL is a shortened URL.                                                                                 |
| occurrence                | INTEGER  | The occurrence number of the link in the content.                                                                                             |
| service\_name             | STRING   | The name of the service linked to in the URL.                                                                                                 |

## Javascript Usage&#x20;

As mentioned in the introduction, this mart employs Javascript libraries to extract information from HTML and URLs. This is done using the [Javascript UDFs](https://cloud.google.com/bigquery/docs/user-defined-functions#javascript-udf-structure) functionality in BigQuery. The `OPTIONS` section in the UDF references the Javascript libraries. The libraries are publicly available resources:

* `"gs://assets.public.unizin.org/udp-marts/links-datamart/cheerio.js"`
* `"gs://assets.public.unizin.org/udp-marts/links-datamart/uri.js"`

Using BigQuery SQL, the Javascript UDF is called on content from the LMS in order to extract links and additional contextual information.&#x20;

## Content Sources

This mart extracts URLs from content in the LMS. The content is defined as any HTML text fields associated with tools in the LMS. The content types are defined based on the LMS tool. The possible content types and the corresponding UDP fields of the content are:

* annotation&#x20;
  * *annotation.message*
* conversation\_message
  * *conversation\_message.body*
* discussion
  * *discussion.body*
* discussion\_entry
  * *discussion\_entry.body*
* learner\_activity
  * *learner\_activity.description*
* learner\_activity\_result
  * *learner\_activity\_result.body*
* learning\_outcome
  * *learning\_outcome.description*
* learning\_outcome\_group
  * *learning\_outcome\_group.description*
* learning\_outcome\_rubric\_criteria
  * *learning\_outcome\_rubric\_criteria.description*
* module\_item
  * *module\_item.url*
* quiz
  * *quiz.description*
* quiz\_item
  * *quiz\_item.body*
* quiz\_item\_response
  * *quiz\_item\_response.body*
* syllabus
  * *course\_offering.syllabus\_content*
* wiki
  * *wiki.front\_page\_url*
* wiki\_page
  * *wiki\_page.body*

## Link IDs

The *udp\_link\_id* and *lms\_link\_id* fields do not correspond to any existing IDs in the UDP or Canvas. They are unique identifiers generated for the links and are defined based on other fields included in this mart. The *udp\_link\_id* field is defined as:

```
  CONCAT(CAST(cu.udp_content_id AS STRING),'-',CAST(cu.occurrence AS STRING),'-',cu.content_type,'-',cu.udp_course_offering_id) as udp_link_id,
```

or the combination of *udp\_content\_id*, *occurrence*, *content\_type*, and *udp\_course\_offering\_id*. Similarly, *lms\_link\_id* is defined as the combination of *lms\_content\_id*, *occurrence*, *content\_type*, and *lms\_course\_offering\_id*.&#x20;

## References

* [Cheerio.js](https://github.com/cheeriojs/cheerio) - Javascript library used to parse and extract elements of HTML.
* [URI.js ](https://github.com/medialize/URI.js/) - Javascript library used to parse and extract elements of URLs.&#x20;
* [url-shorteners](https://github.com/PeterDaveHello/url-shorteners) - Project with list of common URL shortener domains used to define the *is\_shortener* field.&#x20;
* [file-extensions-list](https://github.com/dyne/file-extension-list) - Project with list of common file extensions used to define the *file\_extension* field.
