Predictive Workflow Warning: Failed to Fetch Databricks Job Status – Amperity

Overview

This article explains how to investigate and resolve Predictive workflow warnings where Amperity is unable to fetch the Databricks job status, even though the predictive jobs themselves have completed successfully.

This issue is caused by a temporary Databricks API availability issue and does not indicate a failure of the prediction models or workflow.

Symptoms

You may observe one or more of the following:

Workflow shows a warning or failure in the Predictions section

Error message similar to:

Failed to fetch Databricks job status for task <task_id>:
The service at /api/2.1/jobs/runs/get-output is temporarily unavailable.

Affected tasks are typically:

Product Affinity – Inference
PCLV – Inference
No recent workflow or configuration changes
Next scheduled workflow run completes successfully

Root Cause

The Databricks Jobs API was temporarily unavailable when Amperity attempted to retrieve the job output.
Important clarifications:
The predictive job did run successfully
Only the status/output fetch failed
This is a transient infrastructure issue, not a tenant or query problem

Investigation Steps

Follow the steps below in order.

Step 1: Identify the Affected Predictive Tasks

Open the failed workflow run
Note which predictive tasks show warnings or failures
Capture the task ID(s) referenced in the error message

Step 2: Confirm Error Type

Verify the error includes:

/api/2.1/jobs/runs/get-output
Confirm the message states “temporarily unavailable”
If yes, continue to Step 3
If no, escalate for further investigation

Step 3: Verify Job Execution

Confirm there is no evidence of job execution failure
Ensure the error only relates to fetching job status, not job execution
This confirms the issue is status retrieval only.

Step 4: Review Version History

Check workflow and prediction version history

Confirm:

No recent edits
No configuration changes prior to the run
This rules out tenant-side causes.

Step 5: Check Logs (If Needed)

Review Honeycomb or internal logs

Validate that:

The Databricks service was temporarily unavailable
No repeated or persistent failures are logged

Step 6: Monitor the Next Run

Check the next scheduled workflow run

Confirm:

Workflow completes successfully
Predictive tasks run without warnings
If the next run succeeds, the issue is resolved.

Resolution

No corrective action is required.
Once the next run completes successfully:
Inform the customer the issue was transient
Confirm no impact to predictions or downstream data

When to Escalate

Escalate only if any of the following occur:
The same error appears in multiple consecutive runs
Predictive jobs fail to execute (not just status fetch)
Multiple tenants experience the issue at the same time

If escalating, include:

Tenant name and ID
Workflow ID
Task ID(s)
Error timestamps