This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Operations should be grouped to indicate likely symptoms and affected upstream services (those responsible for creating/updating workflow state, those related to task tracking etc).