Data growth rates are climbing at an astonishing pace. Oracle databases today are holding many terabytes and in some cases petabytes worth of data. In most cases, the data is not deleted regularly even though the new transactions continue to occur within the application. Companies either ignore this issue or resolve it temporarily by adding more space because, after all, disk is cheap.
Well, its not all about saving disk space. Instead, its about truly protecting corporate data for long periods of time in order to meet regulatory requirements. Its also about achieving maximum performance of applications through a reduced amount of data to process. Not only will applications run faster, but utility functions such as backups, recoveries and reorganizations will execute more efficiently. Its also about deleting any information that can be legally used against the company. As a result of these, you might also save some disk space.
Archiving data requires creating a strategy for long-term data retention. This strategy is usually determined in collaboration with IT, line of business and legal teams. This strategy could be impacted by one of two types of requirements - internal requirements, where for example a company must keep all customer records accessible and online for a period of 24 months, or requirements that stem from government regulations like Sarbanes-Oxley, HIPAA, PCI, etc. For example, HIPAA requires that medical records be retained up to two years after a persons death. Sarbanes-Oxley requires audit data be kept for seven years after the conclusion of the audit and CFR (Life Sciences) requires that pharmaceutical companies retain clinical trial data for 35 years. Requirements are extremely stringent for companies that must retain data for legal purposes, and most are not prepared to hold data for such long time frames, let alone guarantee it hasnt been altered in any way.
The life span of data in Figure 1 begins with data inserted into the operational system that is needed for regular business transactions. Later, it is used primarily for reporting purposes, where decision support and data warehousing applications come into play. This data is often read only and highly summarized. When the data is no longer required for quick online accessiblity, it should be moved to a data archive. At this stage, the data can be accessed, just not typically at the same efficiency as within the application. Finally, when the data is no longer required for long-term retention, it can and should be discarded.
Archive or Reference Database?
Implementing a reference database involves taking operational data from one Oracle database and placing it in another read-only Oracle database. Sometimes this data is summarized as it is loaded, other times it is just moved as is. This should not be confused with a database archive. A database archive is not tied to any particular database or application. It stands on its own and can outlive the application or the database and is truly read-only.
Reference databases serve their purpose, but data that is truly an archive candidate should be locked up, put away and independent of the application or the database. The archive should be directly accessible via SQL while also supporting the ability to be reloaded back into a database. Archives should support schema changes (new columns, deleted columns, different data type lengths, etc.) that will not be inherently supportable in a reference database as the application evolves over many years and the characteristics of the data change.
Organizations should develop data retention policies based on a) its strategy and b) the data itself. Data to be archived must be done so in a manner that encapsulates the entire set of data (or business transaction); otherwise it is not a true representation of the data and its related data. For example, a typical business transaction may result in inserting data into 10 to 20-plus tables. If archiving orders, we would most likely need to archive data from tables named ORDERS, ORDER_LINE_ITEMS and possibly data from tables that are also related to these. This process may be fairly straightforward when referential integrity constraints are defined within Oracle; however, it gets much more difficult when defined through the application itself.
Data destruction is the process of deleting data that is no longer needed by the company, but even more importantly, it is data that needs to be discarded. Think of it as a digital shredder you dont need the documents you shred any longer and you dont want anyone else to have access to them. The same applies to corporate data. A discard policy simply establishes the amount of time to retain data before it is automatically shredded from the archive. While an organization may be legally obligated to store the data for a specified amount of time, it would be wise to delete the data when that obligation expires. Not only would this free disk space, it would also free the company from any possible legal action as a result of still having that information. For example, when dealing with pacemaker data within the health care industry, the required retention period is death of patient plus seven years. That data should be discarded at the end of that time period, thereby reducing the possibility of lawsuits.
When moving data to a true database archive, it needs to be copied from Oracle to the archive data store. This process is known as data extraction. While it may seem quite trivial to select rows from one data source and put them into another, it gets more complicated if we plan to guarantee the data and all of its relations are in tact. Most database designs do involve relationships via referential integrity constraints, and it is very important that all of that data is captured if we ever plan to access it again. What data items need to be archived? How should relationships be reflected in the archive? Are there relationships (perhaps coded in the application) that are not represented in the database? There are many decisions to be made concerning how and what actually needs to be archived.
Operationally, our mission is to take rows out of our production tables and place them into an archive. What are the different options that we have when removing the rows from Oracle tables? Below are some different row removal options - the best one is determined by the nature of the application architecture, the data structures and the quantity of data to be archived.
- SQL Delete. The most obvious approach, but one that might not always be the best fit. A SQL delete statement will let the database do all of the work for us. It will also allow Oracle to perform cascade deletes where referential integrity constraints are enabled. If integrity constraints do exist, it is important that foreign key columns are indexed. Otherwise, a full scan will be performed for each parent row, and that is very resource intensive.
- Truncate. The truncate command will remove all rows in a table or partition, and this process is not able to be undone (or rolled back). That is why it is so speedy; it does not require undo resources or rollback segments. Congratulations if your application lends itself to such a technique (but thats probably not the case).
- CTAS. Create Table As Select - This approach involves copying the rows that we want to keep into a temporary table and then either truncating the original table or dropping it and renaming the temporary one back to the original name. This technique works very well if the number of rows we want to keep is a small percentage of the table. Before you delete 100 million rows and keep only 100,000, just copy the 100,000 over and drop the table. Then rename the interim table back to the table name that was dropped, rebuild the indexes and you are done.
- Oracle Partitioning. If you have the partitioning option, Oracle provides some features that aid in aging data. Partitioning allows the separation of data by specifying a partition key column (usually a date). Each partition has its own storage characteristics, and older partitions (less-accessed data) can be placed on cheaper storage. Basic partitioning options have been around since Version 8 of Oracle and beginning in Version 11G, Oracle introduced Interval Partitioning in which it will automatically create a new partition when new data is inserted that doesnt match an existing partition a perfect rolling window. For example, Oracle could create a new partition for every month of the year, and a partition for February 2008 would be automatically created as soon as the first record for this month is inserted.
One would expect instant benefits after a large number of rows have been removed from a table things like queries executing faster and the space used by those rows be freed for other purposes. However, none of these things happen on their own, and now the database looks like Swiss cheese instead of provolone.
After a considerable number of rows have been deleted from a table, there are some things that should be done to increase efficiency of database operations. First, we will reset the high water mark (HWM) for the table. The HWM is the marker within Oracle that tracks the mark on each table that has contained data. In Version 9.2, this can be done by moving the table to another tablespace. The tablespace must be a locally managed tablespace (LMT) and the table can be moved into its same tablespace (as long as its a LMT). In Version 10, Oracle makes this a bit simpler for us via the SHRINK command, which does a basic reorganization in place and will reset the HWM.
Why do we want to reset the HWM? Any SQL operations that require a full scan will always scan to the HWM. If a table had 1 million rows and they were all deleted, any operation performing a full scan including the select count(*) command, would take the same amount of time because it scans all the way to the HWM.
After the HWM has been reset, we want to aid the optimizer even more by regenerating statistics for the table and indexes. These statistics include number of rows as well as other vital information to aid the optimizer in making intelligent decisions for how to process a query. This can be done with a simple call to DBMS_STATS, of which there is plenty of online documentation (outside the scope of this article).
Data archiving is a very important topic and one that should be planned at the start of every project, just like disaster recovery and performance. Given the phenomenal growth rates of data, most applications will require pruning data from operational databases either for compliance, internal business requirements or application performance. Once in a true data archive, the data can be retained for the specified time frame or automatically purged via a discard policy.
When deleting large amounts of data from Oracle tables, there are many considerations that must be explored. From resetting the HWM to gathering fresh optimizer statistics to resizing Oracle data files, the goal is to remove data from the database and then optimize so the database can perform better due to less data being stored.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access