Continue in 2 seconds

CoSORT Tackles Major Korean ETL Project

  • December 01 2001, 1:00am EST

CATEGORY: DW Design, Administration & Management

REVIEWER: Byung-Chul, Min, adviser for NACF.

BACKGROUND: Korea's National Agricultural Cooperative Federation (NACF) was established in 1961 to increase agricultural productivity and enhance the welfare of five million member farmers in some 1,200 cooperatives. The NACF's business network handles marketing, processing, supply of farm inputs and consumer goods, credit and banking, insurance, warehousing, transportation, farm extension and relevant support activities such as research and publication. The NACF has a 40 percent share of Korea's agricultural market and is also one of the country's largest deposit banks.

PLATFORMS: The NACF's data warehouse is staged on a multi-CPU IBM RS/6000 S80 server running AIX 4.3.3.

PROBLEM SOLVED: The NACF's first attempt at a data warehouse project failed in 1999 because of the complexity of the database design and a tight schedule. The consultants had difficulty finding a proper ETL solution for handling the massive data volumes involved. The initial table data extracted from the main DB log was estimated above 100GB, and the data to be loaded exceeded 400GB. The total size of the data warehouse was approximately 1TB. In addition to the initial load, daily jobs needed to process 3GB of data and load 10GB to the data warehouse. NACF's second attempt at building the warehouse began in mid-2000, with successful project completion in the spring of 2001. By choosing and using CoSORT as a massive flat file manipulation solution, it took only two months to design and finish all of our ETL process after the database design was set up. Both our consultants and engineers were surprised at and grateful to CoSORT Korea for the implementation turnaround.

PRODUCT FUNCTIONALITY: The major features of CoSORT's sort control language (sortcl) program used in the NACF data warehouse are sorting, merging, joining (matching), data conversion, record filtering and summarizing (aggregation); in other words, flat file ETL. As the data sizes are very big, CoSORT's performance tuning options were naturally critical. Through resource control files, we can tweak how much memory each sort process will use (maximum and minimum amounts), shared memory, I/O buffer sizes, workspace assignments, and so on. User exits, as well as CoSORT's major features, were written and linked with sortcl to create tables using complex business logic.

STRENGTHS: CoSORT's speed in sorting and joining massive, multigigabyte files was the key to NACF's data warehouse success. CoSORT's built-in data type conversion feature (EBCDIC to ASCII) was also helpful and reduced the number of data transformation steps we need to perform.

WEAKNESSES: CoSORT was unable to perform joins on more than two files at a time. Our developers would like to see a multitable join within a single sortcl script; currently they accomplish this by piping scripts.

SELECTION CRITERIA: CoSORT was chosen because of the speed given to the process. Sorts of 1GB were completed in less than four minutes on our heavily loaded system. Massive file joins at I/O speed had another very strong appeal for NACF developers. NACF developers had reviewed a popular ETL tool prior to CoSORT, but gave up on it because of its slow sort and join performance and relative inconvenience and expense.

DELIVERABLES: Although CoSORT is actually a suite of sorting programs, third-party sort replacements (drop-ins) and API calls, we focused on its most feature-rich interface ­ sortcl. The sort control language is a mainframe sort and SQL-like 4GL for defining and manipulating various flat file formats. Whether data definition files, job specification scripts or some combination, sortcl scripts are text files we easily create and modify using VI or CoSORT's Java GUI for sortcl called gui2scl. The! GUI allows users with no sortcl knowledge to read, write, modify and launch sortcl job scripts on the client where the GUI is and/or a remote Windows or UNIX server. The user exits (in this case, the input procedure) have to be customized by the user, but the product is flexible enough to accept them in combination with natively supported commands. The sortcl program can produce all manner of structured, summary and detail reports, as well as handoffs to the database.

VENDOR SUPPORT: Both the developers at CoSORT/IRI in the USA and the assisting implementers at CoSORT Korea were highly responsive to our customization requests.

DOCUMENTATION: The online (and matching hard copy) documentation was translated into Korean, including many sortcl scripting examples. Basic usage of sortcl was so easy that the CoSORT manual was rarely used.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access