I started my career as an Oracle developer, and the firstthing I liked about it was the structured query language, though it was after ayear or so when I realized that SQL was not proprietary to Oracle.
An offshoot of the 70s Sequel standard, SQL has matured andis still the most widely used language by database users around the world. Thepart I liked most about SQL was its simplicity (basic English words), but atthe same time, it has the capability of providing answers to most of the complexproblems. SQL is one of the easiest languages used today both to learn and touse. ANSI-SQL is a standard used by virtually all the major relationaldatabases such as Oracle, DB2 and Microsoft SQL Server, which greatly improves portability,in case the customer decides to switch from one database to the other.
I enjoyed creating complex Oracle programming language/SQL codefor approximately five years, before I was introduced to an ETL tool. Being anexpert in SQL, my Initial ETL mappings were Source Structure -> Complex SQL-> Target, with all the complex transformation logic written in SQL.
I rarely used many of the excellent features which the ETLtool provided. Dont get me wrong, I am not advocating SQLs to be used insteadof an ETL tool, because ETL tools have much more to offer than just SQL. Infact, I am a very big proponent of ETL tools because of their ease of use, graphicaluser interface, and the ability to connect to different environments for extractionand load. Many ETL tools today have the ability to read data from disparatesource systems like COBOL files, flat files, XML, databases and to an extentsupport real-time data capture.
Some Drawbacks of ETL
However robust an ETL tool is, the shortcoming of an ETL toolis that it is still at times unable to figure out how to write an efficient SQLquery and get the best performance out of the database. I have experienced thesekinds of issues in data warehousing projects where the volume of data in someof the tables is in billions, maybe trillions. The drawbacks are due to thegeneric way ETL tools are built to cater to all kinds of Databases and thus notable to use certain features of databases which would increase the performanceof an ETL process. The strength of the SQL, if used intelligently, can increasethe performance and reduce complexity of ETL mappings.
Types of Database Commands
The Standard Databases have three major types of commands,DDL, DML and DCL.
Data Definition Language (DDL) - Used to define the databasestructure (e.g., create, alter, drop etc.)
Data Manipulation Language (DML) - Used for managing data (e.g.,select insert, update, and delete etc.)
Data Control Language (DML) - Used for managing security (e.g.,Grant, Revoke etc.)
The Built-In Database ETL
An Insert with Select statement is the best and basicexample of ETL, where extract and transformation is performed by the selectstatement and the load is performed by the insert statement. In the below statement, the calculation of bonusis a very good example of transformation applied at the SQL query level.
E.g., Insert into EMP_BONUS select empno, Sal, (Sal*0.1)Bonus from EMP;
There are many complex transformations you can perform atthe database side with the help of SQL featureslike CASE statement, which works as an If ,Else if, and Else logic. Ihave used this feature many times to create complex transformation logic at thedatabase side.
E.g., Select Empid,
CASE WHEN sal < 500 THEN POOR
WHEN sal between 500 and 1000 THEN MID
WHEN sal > 1000 THEN RICH
END As Status
Some Real-Life Experiences
I would like to share an experience where SQL played a majorpart in providing a critical solution on an engagement for a large client, whenit mattered most. In one of the ETL mappings, we had logic to provide thelatest record for a particular ID based on timestamp. This logic was replicatedin many jobs via a reusable component. We had all the logic tested andvalidated in our development area before moving it to a test area for UAT.Suddenly we realized that the tool was behaving rather oddly and provided thewrong records for some IDs, although it worked perfectly in development. Justtwo days to go before the client team would start the UAT, an entire day wasspent trying to debug the issue in the tool, without success. Afterbrainstorming for hours, I finally suggested, Lets use the analyticalfunction feature in SQL query and get the records correctly.
Analytics Rocks. I realized how valid this statement wasfrom the great Oracle Guru Thomas Kyte was. The team started rewriting theentire logic at the database side using SQL query and the Analytical functionfeatures. The output from this query gave us the perfect answers, and we wereready with the UAT.
I have reproduced a simple sample logic which provided usthe answer and demonstrates the power of SQL.
SELECT a.Custid, a.Trans_date
Row_Number() OVER(PARTITIONBY Custid ORDER BY custid, Trans_date desc) Rowno
WHERE a. Rowno = 1;
Apart from the Row_Number function used in the example,there are many other analytic functions that are very useful and can be used inqueries, for example rank and dense rank, lead and lag etc.
I wont get into the detail of each function, because as youcan gather information related to the function along with examples with anInternet search.
The Great Escape
I will use another example from my career to demonstrate theimportance of SQL knowledge. While working for a big pharmaceutical client, ourteam created an ETL job which was working beautifully in terms of performance indevelopment and test environments. It had passed all the stress tests and receivedUAT signoff from the client. The code was moved to production, and on the veryfirst run, the ETL job was crawling on its knees. The team did not understandwhat went wrong, and after lots of firefighting, the team realized that theoptimizer setting in the production environment was set as rule based, whereasthe development and test environment was cost based. Because of this, theoptimizer was not using an index, which should have ideally been used formaking the query access fast. With not enough time to tamper with the optimizersettings on the production environment and the criticality involved in changinga database setting, the solution we arrived at was to create a view with a hintin the select statement, so that the query uses the desired index. The creditgoes to the team who had this excellent knowledge on SQL and knew how to makeuse of the hint in the SQL.
With all these advantages, I would also like to highlight aproblem with regards to using complex SQL in ETL. It will not be an easy task toprovide clear, end-to-end logical flow in the ETL for companies who are planningfor metadata management.
In my career span, I have met many ETL developers, gettinginto situations where the ETL tool is not able to provide any help and are totallyblank, when they have to go behind the screen and work on the database, mainlyon SQL.
My experience has shown that an ETL developer shouldunderstand the database nitty-gritty. The developer should understand the basicconcepts of database like data types, views, indexes, partitioning, etc. Theseconcepts can help you in situations like when the ETL tool is providing lowperformance and every tips and trick for the tool has been exhausted.
I would recommend developers build expertise not only in ETLtools but also in SQL, which at times may act as a backup in situations where theETL tool is not able to provide the solution.
The key point I want to convey is the importance of SQLknowledge for the ETL developer. My main intention is to create awareness ofhow SQL and ETL tools are closely connected and can be used to make your ETL projecta success.
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access