In firms where the majority of systems are outsourced, the real challenge is data management. Vendor systems don’t necessarily talk to each that well, many still preferring CSV or XML format as the primary interface. Often, overlapping data with different names exists in each of the vendor systems. In these firms, gluing the systems together becomes the major IT challenge. Over time, the glue - extract, transform and load - can sometimes become onerous, going through multiple revisions and data formats as system versions change. Commercial ETL tools are powerful but expensive. To fully utilize them, professional services or experienced staff are required. For firms lucky enough to agree and standardize on one ETL tool, it becomes the new application around which much of the IT work happens. But what if the ETL logic can become data itself, fluidly and flexibly adapting to changing conditions (read: business requirements)? What if you could avoid the lead time of ETL infrastructure upgrades too? This article shows how the glue can stay pliable yet strong through data-driven ETL.

Most firms possess a reasonable balance between internally developed applications and vendor systems, but occasionally you encounter a firm where they are extreme on one side. The focus here is the set of unique challenges facing firms with the majority of their systems outsourced. What observations can we reasonably make (in comparison to firms favoring internal development)?

  • Change requests (regarding data feeds) are often developed and managed by the application vendors (and are prioritized and costed with other requests).
  • IT staff is usually more geared toward managing coordination and support with the vendors (as opposed to serious application development).
  • There is less-than-complete understanding of what happens to the data traveling between and transformed within the vendor systems.
  • There are more variations of the same data, such as different names and formats of the same entities in different systems – i.e., many versions of the same customer.

The major IT work becomes gluing the systems together – the data starts in one system and usually travels to other systems, where it is enriched/analyzed and then travels again. The lowest common denominator is still CSV or XML files. Files fly around, usually on a batch schedule, either via FTP or dropped on a shared folder for pickup by the destination system. 
Any reasonably sized organization is going to have a few major systems and a plethora of suppliers and consumers connected to them. The cumulative ETL code can turn into a monster to manage. Even for firms lucky enough to agree and standardize on one ETL tool, it becomes the “internal” application around which much of the IT work happens. [(But if the ETL development and support itself is outsourced, then one can characterize it as a vendor app).

ETL platforms are certainly maturing. Many come with graphical tools, rich libraries of functions and full-blown development languages. Sure, the ETL logic is generally centralized and organized, when you use a tool. But these platforms are expensive, and as I have observed, require just as much time to develop a solution as good old Java or .NET. To be fair, these ETL tools are particularly good at providing infrastructure for different messaging methods. But, some vendors upgrade their systems as frequently as quarterly, necessitating a review or enhancement to the data formats of the files. An upgrade to the ETL logic requires the usual requirements, design, code, test and approval if you are following the waterfall approach. Change requests compete with other priorities and usually don’t make their way into production for quite some time. As a result, business users are tempted to apply a workaround to deal with the deficiencies. This scenario plays itself over and over as the upgrade and compatibility cycle continues. And don’t discount that with the power of commercial ETL packages comes the need for experienced (expensive) practitioners.

One might resist the metadata solution I am about to describe because “once you set up interfaces, they never change.” In a perfect world, I’d agree, but in my experience, tweaks are common, whether it is a system upgrade, a missed requirement, an enhancement the business wants or even defects that made it through to production.

So, if constant change is a given (just reflect on how often business requirements changed in your last project, even at the last minute), why wouldn’t we want to try a different approach? The remainder of this article describes how a firm, choosing to heavily outsource its systems, can deal with the challenges noted.

If we were faced with this situation, ideally, what we would we want?

  • To insulate ourselves from version compatibility,
  • Be able to respond to and implement change requests quickly, as soon as possible,
  • Not require specialized or hard-to-find expertise,
  • Not spend the time or money to install a large ETL platform, and
  • To understand our own data better.

The solution I propose is based on a metadata repository. The repository holds information about the entities involved in a data feed. They are: applications (i.e., vendor systems), data elements (fields), interfaces (source application plus destination application), element maps (one or more data elements from the source system connected to one or more data elements from the destination system) and value maps (value translation from the source to destination). Along with this, we have a small app that processes the metadata – call it a metadata engine. There are a few more pieces we’ll need, but we’ll cover that later. By breaking these out and turning them into data, we are starting to replace what is normally ETL language code into data.
At a high level, the metadata processor runs through all the element maps to transform a source file into another format (or into a SQL table). The transformation “language” is SQL, something most developers and many business systems analysts know. So, a simple transformation can draw upon SQL’s rich set of functions like dest_field=substring(src_fld,2,3). Although not mandatory, we may wish to put a graphical user interface on top of the repository, especially if non-IT types will be supporting/managing the metadata. Ideally, you’d want a data management group or even a technical data steward to have access and get comfortable enough to make straightforward changes, without getting into the IT work queue. But the solution can work just as well without a GUI - in this case, you’d need IT to perform updates as required.

Consider if you have a file or data in a SQL table that you need to load into another system. Here’s a complete set of the key components/entities of the solution that would be stored in the metadata repository:

  • Application – a repository of data (i.e., market data system).
  • Data element – a column or attribute in an application (i.e., name of a financial security); they have given names, types and maximum length.
  • Interface – a source application and a destination application and a direction (data travels from the source application to the destination application); defines the file name or SQL table to process and for files and whether it is positional (POS) or delimited (CSV).
  • Element map – a link from one or more data elements in the source application to one or more data elements in the destination application.
  • Filter rules – rules specifying which rows to include/exclude from the source document.
  • Element validations – rules that, if true, mark the data as invalid and won’t be loaded into the destination application.
  • Value map – discrete translations of specific values.
  • Metadata engine – reads and parses the source document according to the element map, validates it, performs transformations and/or value mappings and inserts it into the destination application.
  • Common data elements – basic elements common to your business domain (like trade date, broker, counterparty) that you can inherit for new interfaces.
  • Every change is captured in a set of audit tables via simple triggers – like a customer verification system.
  • Header/footer – when writing out to files for consumer systems, parameterized like: {current date} {number of rows} [the metadata engine replaces with actual values].
  • Data elements can be used in multiple interfaces.

The following are the steps that the metadata engine would normally follow:

  1. Read in the metadata from the repository.
  2. Read in the source document into a ‘RAW” table (unparsed, one column).
  3. Apply filters to remove unwanted source data.
  4. Parse the data into fields.
  5. Perform element validations to mark.
  6. Apply transformations as defined in the element maps.
  7. Store the data in the destination location.

Common change requests that I have observed which can be handled easily with this solution include:

  • An additional data element (i.e., field) is added to source application which needs to be loaded into the destination (additional data element and element map).
  • The length of one field is expanded.
  • A new value needs to be mapped (i.e., new portfolio code sent from the source system).

While the solution looks (an is) basic at the high level, here are a few insights that show the solution can tackle some tougher data feed challenges.

  • Transformations can be direct (no transformation), well-formed SQL fragments (using data element names) or “value map.”
  • The result of a previous transformation can be used in a subsequent transformation (chaining).
  • Complex transformations can be achieved by applying multiple smaller transformations in sequence (element maps are ordered).
  • Data elements that exist in the destination but not the source can be assigned a literal or calculate value.
  • For source documents containing multiple record layouts, define one interface per layout.
  • To depivot data (data exists as multiple columns, or fields, in the same row but each needs to be a separate row in the destination, create one interface per column to be depivoted).
  • Any component can be made active/inactive.

Error Handling

No data feed is complete without rejected records. These can be syntactic and semantic errors, as well as element validations defined in the interface. Where should they go? In a fully developed solution, you’d have an exception queue where the error records sit. You’d associate exception codes with each record and give the user the ability delete the record or to reprocess after adjusting one or more fields (fully audited of course). This is where the GUI makes a difference.

Deployment

Using the GUI is sufficient for small changes, but large, new interfaces built in a development environment need an automated deployment tool to promote test and production environments. This can be constructed via stored procedures and linked servers (hint: some of the metadata engine code can be reused for this). Such a deployment tool should use global unique identifiers that stay constant across environments, not identity type columns. 

Version Control

Version control is a little different on metadata than it is with software. While many software systems use a four segment number (major.minor.patch.hotfix), an interface might be better off with InterfaceChg.ElementMapChg.DataElementChg.ValueMapChg, which is roughly ordered according to importance. And one would be best advised to automatically increment the segments via triggers.

Necessary work goes along with solution like this: like file watchers, logging, user access control, configuration files (i.e., pointing to the right database) and archiving. Moving business logic out of commercial ETL tools and into the database makes the glue between systems more pliable, quicker and easier to adapt and maintain. But it doesn’t replace software completely - you still need a database management system, a metadata engine and a few supporting stored procedures. You’ll need a couple of good developers and data architects and most importantly, senior management backing to go in this direction. Once established, this framework will get you back in control your data feeds.