Cluster Error Scenarios

Understand common error scenarios that may occur during clustering.
Description Node Steering Producers Consumers Result Notes
Job is submitted to the cluster and no consumers are available. No 1 0 Job is accepted and function key is returned. AsyncFunctionStatus can be retrieved and shows the job is not done and not started. Job is processed when a consumer becomes available.
Job is submitted to the cluster and no consumers are available. Yes 1 0 Job is accepted and function key is returned. AsyncFunctionStatus can be retrieved and shows the job is not done and not started. Job is processed when a consumer becomes available.
Job is submitted to the cluster, consumer available, node steering invalid. Yes 1 1 Job is not accepted. I18nNoAvailableNodeForTaggedTaskException is thrown. Expected behavior
Job is submitted to the cluster, consumer available, no node steering, consumer crashes before claiming task. No 1 1 Job is accepted. Consumer crashes and task stays in NOT STARTED state. Consumer picks up task when it restarts.
Job is submitted to the cluster, consumer available, no node steering, consumer crashes before claiming task. No 1 2 Job is accepted. Consumer crashes and task stays in NOT STARTED state. Consumer picks up task when it restarts.
Job is submitted to the cluster, consumers available, no node steering, consumer crashes after claiming task. No 1 1 Job is accepted. Consumer crashes and task remains trapped in progress. 1
Job is submitted to the cluster, consumers available, no node steering, consumer crashes after claiming task. No 1 2 Job accepted, started, and in progress. consumer dies and job gets orphaned as in progress. 2
Job is submitted to the cluster, consumers available, no node steering, task errors. No 1 1 Job is accepted, started, and in progress. Job errors. Subsequent status calls on the cron job continue to throw the exception when retrieving the future status. 3
Job is submitted to the cluster after error, consumer available, no node steering. No 1 1 Job is accepted, started, and in progress. Job completes.
Multiple worker based jobs for same case submitted to cluster. consumer available, no node steering. No 1 2 Jobs are accepted and started and in progress. Job completion depends on when case gets closed either by REST or the Nuix Engine which closes the ES case prior to ingestion for a brief moment. Race condition and not supported by Nuix Engine. ES client errors are not serializable. Subsequent status calls throw exception on server that executed the job. This leads to looking like a problem when we call collectMetrics.
Job submitted to cluster, producer crashes before task is claimed. No 1 1 Job remains not started until claimed. Consumer processes even though producer is down. Job completes. 4
Job submitted directly to a consumer No 0 1 Job is processed locally by consumer.
All nodes crash No 1 2 All jobs are lost. Jobs exist in memory only. This is expected.

Footnotes

The following responses are from the /cluster/queue endpoint. The response you see depends on the error scenario you encounter.


  1. When the result is a trapped task:

    {
      "type": "com.nuix.us.ws.function.AsyncBulkIngestionFunction",
      "processedBy": "nuix-restful-server-2",
      "caseId": "70a7c712d6df402eac2b2a75cadcb87e",
      "status": "IN_PROGRESS"
    }
    
    ↩︎
  2. When the result is an orphaned task:

    {
      "type": "com.nuix.us.ws.function.AsyncBulkIngestionFunction",
      "processedBy": "nuix-restful-server-3",
      "caseId": "70a7c712d6df402eac2b2a75cadcb87e",
      "status": "IN_PROGRESS"
    }
    
    ↩︎
  3. When the result is an errored task:

    {
        "type": "com.nuix.us.ws.function.AsyncBulkIngestionFunction",
        "processedBy": "nuix-restful-server-2",
        "caseId": "70a7c712d6df402eac2b2a75cadcb87e",
        "status": "ERROR"
    }
    

    The REST logs show a stacktrace of the error, for example:

    Caused by: com.nuix.us.ws.exception.NuixServiceException: Could not open case.: java.net.ConnectException: Connection refused
    at com.nuix.us.ws.caseaccess.UserNuixEnvironment.openCase(UserNuixEnvironment.java:182)
    at com.nuix.us.ws.security.UserNuixEnvironmentManager.openCase(UserNuixEnvironmentManager.java:172)
    at com.nuix.us.ws.caseaccess.AsyncFunction.openCase(AsyncFunction.java:323)
    at com.nuix.us.ws.caseaccess.AsyncFunction.call(AsyncFunction.java:258)
    at com.nuix.us.ws.caseaccess.AsyncFunction.lambda$executeAsync$0(AsyncFunction.java:358)
    at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
    at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
    at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
    ... 3 common frames omitted
    
    ↩︎
  4. When the result is a completed task:

    {
      "type": "com.nuix.us.ws.function.AsyncBulkIngestionFunction",
      "processedBy": "nuix-restful-server-2",
      "caseId": "70a7c712d6df402eac2b2a75cadcb87e",
      "status": "COMPLETE"
    }
    
    ↩︎