April 15, 2011 – A survey conducted by the Financial Technologies Forum in February found that 67 percent of banks, buy-side and sell-side firms are using public clouds to handle back office applications, as well as for clearing, reconciliation and settlement.

In addition, specialized services for securities firms are cropping up from familiar names including Bloomberg, Cisco Systems, Microsoft and SunGard Data Systems. NYSE Euronext opened a multi-tenant datacenter in Mahwah, New Jersey in August, in which it wants to house the trading engines and matching engines of as many market players as possible.

The challenge is finding ways to bring on enough computing capacity, on-demand, while controlling costs.

This means the cloud, both in its private and public forms, will need a new generation of components to fully deliver on the promise of elastic computing.

Infrastructure that can’t “stretch” to meet the needs of volatile markets and deal with unusual circumstances like the May 6, 2010, Flash Crash won’t suffice. In cloud computing, it’s not acceptable for the network to become the bottleneck. Applications have to have uninterrupted access to huge flows of market data at all times, the processing power to analyze it, and sub-millisecond performance to execute on it.

One emerging new technology could help overcome bottlenecks and enable fast, parallel data analysis spread across many cloud-based servers.

This is what is called a distributed data grid. It employs middleware in the cloud to permit applications to use large datasets hosted in memory located across multiple virtual servers. This gives applications instant access to the data needed to quickly perform their computations.

In addition, these data grids can integrate a computational engine for performing data in map-reduce style analysis. In such a framework for working on large datasets, a master node on the grid takes in the input and divvies it up into smaller sub-problems. These get distributed to worker nodes. Then, the master node takes the answers to the sub-problems and combines them to create the final output.

This grid technology holds the promise of both accelerating performance and shortening development cycles.

Consider, for example, an application in computational finance that analyzes a large dataset of stock price histories for tens of thousands of ticker symbols to backtest and optimize stock trading strategies.

By structuring the analysis code using the map/reduce programming pattern popularized by Google search and by open source technologies such as Hadoop, the stock price histories quickly can be examined in parallel across a set of cloud-based virtual servers.

To minimize the time required for analysis, this application can store the dataset in a memory-based, distributed data grid instead of in disk-based storage. This avoids the need to migrate the dataset into memory for analysis. By doing so, it can deliver both elasticity and fast turnaround time. Users can quickly retune and retest their strategies while keeping the dataset spread through the cloud, across multiple test runs.

This approach to hosting and analyzing data stands in contrast to the more conventional technique of hosting the dataset to be analyzed in disk based storage, such as databases and file systems, and migrating data into memory each time it is needed to run a computation for analysis.

A recent performance study by consulting firm Lab49 examined the use of distributed data grids to perform map/reduce analysis in comparison to alternative techniques.

The dataset in this study stored 4,096 stock price histories – each requiring 2 megabytes of storage -- in a distributed data grid and backtested trading strategies on this data.

The distributed data grid was able to scale up immediately, whereas alternative approaches required data to be loaded into servers first. This step added significant network overhead which limited performance.

“Though adequate in situations involving relatively few or small inputs, the approach of requiring each grid node worker to retrieve its data from remote storage at runtime can generate debilitating levels of network traffic. This traffic can potentially saturate network bandwidth and hinder if not completely prohibit throughput gains normally associated with scaling out a grid,” the study stated.

Distributed data grids can hold multiple terabytes of data in memory, with computation being performed in parallel across hundreds of virtual servers, which means quants in capital markets can make highly effective use of the cloud for quickly developing, testing, and refining their models.

This column originally appeared on Securities Technology Monitor.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access