![]() KISS (Keep It Simple)īefore we get into the meat and potatoes of how to configure thresholds and alerts, please remember to keep it simple. So instead, break the 3 batch servers out into their own subservice and keep the other 17 grouped together. It also makes leveraging per entity thresholds and anomaly detection nearly impossible down the road. If all 20 servers are defined as entities in one service, KPIs like average CPU are nearly meaningless in aggregate (and therefore difficult to accurately threshold) because the batch servers are expected to exhibit different behavior. Let's then assume that 3 of those 20 servers are solely responsible for processing nightly batch operations jobs. You may be asking, why is this the case? Let’s use the batch servers example above and assume I have a farm of 20 app servers associated to a critical business app that I’m monitoring in Splunk ITSI. Entities spanning different architectural tiers (DB, Web, Application, etc.) should be broken out.Batch servers or dedicated purpose servers should typically be broken out from their general purpose counterparts.Entities that span two different data centers should typically be broken out into DC-specific subservices.Let’s use a handful of examples to clarify: Predictably different entities should be broken out to their own subservice. Typically, you’ll want to ensure that each entity in your service behaves about the same as every other entity. Therefore, grouping the right entities together in your service is important to ensure success with thresholds and alerts. The entities selected for your service directly impact the aggregate and per-entity results for each KPI. It's a separate and final configuration using notable event aggregation policies that turn one or more notable events into your traditional alerts like emails, tickets, etc. Put the process all together and it looks like this: Correct Entity Groupings Are Paramount Notable events are not alerts-at least in the traditional sense-they are simply events of interest viewable from the Notable Events Review dashboard. We configure ITSI to continuously monitor KPI statuses and service health scores when we detect problems or concerns, we can then create notable events. Notable events are created first, which then lead to one or more traditional alerts. We’ll dig in to these configurations later, but for now, we just want to acknowledge the difference between the two concepts.Īt the risk of being pedantic, what is an alert anyway? Is it an email? A text message? A ticket to a ticketing system? A flashing red light in the NOC? Something else? Within ITSI, we take a two-tiered approach to generating alerts. KPI severities are viewable in the service analyzer dashboard, deep dives, and other UI locations, but in and of themselves don’t generate alerts.Īlerts are generated from additional configurations, driven from KPI severity and service health score changes. ![]() Thresholds apply only to KPIs they dictate when a KPI severity (or status, as they are sometimes referred) changes from normal to critical, high, low, etc. Let’s first clarify the difference between thresholds and alerts-in ITSI, these are related but separate concepts. Dependent services are also optional and are simply references to other already configured ITSI services on which this service depends. ![]() KPIs are optional and when defined, will require threshold configurations.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |