Complex workloads are common in data warehousing. This poses a performance problem, since different workloads are competing for the same CPU, memory and I/O resources. In order to maximize investment in a data warehouse, organizations need to use a mix of workload balancing, monitoring and alerting to control the workloads and to ensure acceptable performance.
To illustrate data warehouse traffic, consider the roads and off- and onramps of a city’s highway system. The workload of a typical highway can be supply trucks, passenger vehicles, construction and maintenance vehicles, and emergency vehicles. As each type of workload goes on a highway entrance ramp, it enters the highway system and it requires the system’s resources (i.e. highway infrastructure). As each type of workload gets on an exit ramp, it leaves the highway system and releases the system’s resources.
Each workload has its own characteristics and demands: supply trucks are heavy and slower than the rest of the workloads and typically have tight distribution deadlines. Passenger vehicles are smaller, but more numerous and their drivers have individual time expectations for their commute. Construction and maintenance vehicles tend to put an overhead on the highway system – for example lane closures or slower speed limits due to their work. However, they perform necessary work that must be carefully scheduled to minimize impact on the highway system. Emergency vehicles, with their sudden and needed use of the highway system, must have absolute priority.
Above workloads put stress on the highway system, especially during peak usage hours. Infrastructure expansion (e.g. adding new highway lanes) is a costly and lengthy process, hence a number of techniques are applied to this system to make it work efficiently: speed limits for accident prevention, High Occupancy Vehicle (HOV) lanes to encourage behavioral changes, highway shoulders for accidents or emergency vehicle traffic, metered ramps for controlling flow of new traffic into the system, highway patrol to control and reinforce highway regulations, etc.
Similar to the above highway system, a typical data warehouse has a mix of workloads: batch, end user, system and tactical. The main goal is to maximize use of current infrastructure in order to provide consistent and acceptable query response times. In order to help accomplish this goal, current workloads must be analyzed and understood.
During peak usage times with a data warehouse, it is critical to start controlling (throttling) the number of new queries that are reaching the system. Similar to the congested highway, database performance will come to a grind if more queries are attempting to run than what a system can handle.
To establish “metered ramps,” relational database management systems (RDBMS) must support some sort of active workload management mechanism. In order to establish appropriate query throttles, multiple approaches can be taken:
- Set up query limits for system observation. For example, if by observation it is determined that Workload A’s response times start to degrade after 20 concurrent queries, implement a throttle that will delay any query that is in excess of 20 concurrent queries. RDBMS mechanism will release a delayed query as soon as the limit of active queries goes below 20.
- Trial and error. Pick a throttle limit, monitor system performance and measure query throughput. Adjust the throttle limit as needed. Repeat the process until throttle query limit is appropriate.
- Extensive simulation with benchmark data and queries. It is important that the system is inactive while the simulation is running. Run the various concurrent queries and measure response times and CPU consumed. Chart the query response times vs. CPU consumed/total number of queries and determine the breaking point where query response times start to degrade. Regression analysis can also be used.
- Metrics analysis on existent system data. This approach is somewhat hard if system workload varies with different days or times. Average total number of concurrent active queries and total CPU per time slice (e.g. every five minutes): take a snapshot of active queries every five minutes. Add up CPU and number of concurrent active queries for each time period over past month: e.g. 10:05 AM, 10:10 AM, 10:15 AM, etc. Average CPU and number of concurrent active queries for each time period by dividing CPU and query totals with the number of days in the past month. Do the same chart/regression analysis as the simulation approach.
A number of resource constraints can be placed on an ID to help avoid an “accident”:
- Resource limitations (hard limit on memory, space, CPU, and/or I/O) are used frequently to prevent run-away queries from causing performance issues on the system. Once an ID exceeds this limit, query execution is automatically aborted;
- Limit number of concurrent queries per ID (delay/reject any query that exceeds the concurrent limit threshold);
- Limit number of concurrent sessions per ID; and,
- Limit CPU and I/O per query. Downgrade or abort if it exceeds the threshold. A note of caution: be very judicious in ridding of write queries (e.g. INSERT, UPDATE) because queries that have uncommitted transactions will roll back , causing an additional system performance hit.
It is also important to track frequent speed limit “violators’ and suggest training or provide tips and techniques. If a business unit has a high number of violators, it is beneficial to schedule training or classes for the members of that business unit.
Queries that require minimal system resources and that are typically sub-second in duration should get most of the system CPU. By doing this, you get these queries out of the system as quickly as possible and remove them from the rest of the lanes of traffic, which averts further congestion. Creating an HOV Lane for such queries also rewards users of the system and encourages SQL tuning and optimization.
To maintain good traffic control of a data warehouse, there must be policing and rule enforcement. It is important to understand these enforcements should by no means be a dictatorship. Communication with query offenders should be cooperative in nature and should provide guidance and education. In practice, most violators will eventually bethankful the matter was raised since the correction techniques should help them do their job more quickly and more efficiently.
Metrics for each workload measure:
- Average total CPU and I/O consumed per day. These metrics should exclude higher than normal system utilization such as month end processing or year-end reporting Add all CPU and I/O consumed by a workload in the past month, then divide by number of days in the past month. Weekends can also be excluded if a workload is primarily active from Monday through Friday.
- Average total impact CPU per day. For highly parallel systems, impact CPU of a query is calculated as maximum CPU consumed on a virtual processor, multiplied by the number of virtual processors. This number essentially captures how “badly” a query impacted the system. Large difference between total CPU and total impact CPU for a given workload indicates that workload is not efficiently using system parallelism and that workload/query tuning is required.
- Average total CPU and I/O for an ID of a workload per day. Add all CPU for a workload for the past month, then divide by number of days in the past month and then divide by the number of IDs that belong to that workload.
- Average total impact CPU for an ID of a workload per day. Add all impact CPU for a workload for the past month, then divide by number of days in the past month. Then, divide by the number of IDs that belong to that workload.
Once the above metrics are captured, set them up below trend alerts , such as Autosys scheduled jobs. Since the metrics above are daily averages, the alert should run multiple times each day. The threshold for each alert will depend on how aggressively to pursue the violators and how many DBA resources are available. Lower the threshold and there is more work for DBA to pursue each violation, but also more control of the system and fewer accidents that can potentially slow down the entire database performance. The DBA needs to evaluate if the violator is causing a system-wide performance impact. If that is the case, an ID should be either placed in a workload with low CPU (i.e. penalty box) or aborted and locked until the ID owner can be reached.
Database ID is exceeding its normal CPU or I/O usage
Add CPU and I/O for each ID for the current day. If a database ID exceeds a chosen threshold (e.g. 150 percent above normal CPU usage of an ID within that workload, as in workload measure metric number 3 above) ,send an alert to the DBA. Also, if an ID exceeds a threshold percentage of a workload (e.g. ID is consuming more than 30 percent of the total CPU consumed by its workload , as in workload measure metric number 1 above ), send an alert to the DBA. Sometimes there is a valid reason than an ID is exceeding its threshold . For example: month end processing or quarterly reports are running. However, in practice, most of the violations are due to an issue.
Database ID is exceeding its normal CPU impact
Similar to the alerts above, except use workload measure metrics 2 and 4.Common issues exposed by this alert are:
- Missing table statistics
- New users unfamiliar to the database running poorly written SQL
- Tables involved have poor physical design
- Source file issue
Canary query response times are trending high
Setup canary queries, an often automated query launched periodically to detect the health of a system, and capture canary query response times. If canary query exceeds the threshold response time (e.g. response time is three times longer than the average canary query response time) for multiple collection periods, and each collection period’s response time is longer than the previous collection period, send an alert to the DBA. This trending approach avoids sending too many canary alerts that are frequent due to brief spikes in system resource utilization. A trending approach will send alerts only when it is clear that query response times are consistently slowing down.
The above techniques all have common goals of maximizing current DW infrastructure, encouraging SQL/process tuning and optimization, and reducing query response times and the number of outages needed for system expansion. At a minimum, the following metrics should be reported each week and shown on a trend graph: query response times for each workload; CPU, memory and I/O utilization; query delay queue; and impact CPU per workload. It is important to monitor the system and adjust the above techniques as appropriate.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access