The batch-ingest Airflow application will ensure that it doesn't mix LMS data from distinct context data sets. If it encounters a condition where it believes that restarting an import task will mix LMS data from distinct context datasets, the batch-ingest application will abort the task.

Diagnosis

Occasionally, an Airflow task that imports context data from a CSV file into the ingest database will present the following error:

ERROR - [opt/ingest/batch_ingest/fetch_sources/pg_copy_ffrom/copy_blogs.py: 185] Skipping copying blogs to table: to prevent accidentally mixing dumps, cd.file_dim is expected to be empty.


In this error, the batch-ingest Airflow application is reporting that it believes that, by executing the task, it will be mixing LMS data for the File entity from multiple, distinct context datasets.

Root cause

This error occurs when the batch-ingest Airflow application runs an import task that, if executed, would potentially mix data from two separate context datasets. The batch-ingest Airflow application can encounter this state if a previous import failed and was improperly terminated, leading the batch-ingest application to infer that a previous import is simply not completed. In such a case, the batch-ingest app will not execute the newer import task.

Solution

If a previous DAG run has not completed, complete the previous DAG run before clearing/restarting the failed import task.

However, if the previous DAG run has completed, then you can force the batch-ingest application to restart the last DAG run. To do that, you must clear (restart) the upstream task where the relevant LMS data files are stream-inserted from Google Cloud Storage to the Cloud SQL instance even though those upstream tasks were successful.

Identify the relevant task

The batch-ingest Airflow app decomposes its overall ETL process into many parallel tasks. In aggregate, however, these tasks are organized by a UCDM entity. The error at the top of this article, for example, is about the File entity (notice the reference to file_dim

To troubleshoot this error, you must find the upstream cd_to_gcs task that corresponds to the failed task.

Although the cd_to_gcs task completed properly, this is the task you want to clear.

Clear the task and its downstream tasks

To clear the task, follow our guide on clearing tasks.

  • No labels