Posted by Jason Elizaitis ● Apr 14, 2017 5:16:37 PM
Quantamentals: Bridging from Financial Services to Big Data
Although the term “Big Data” causes an individual to think solely in terms of data volume, big data is really about selecting appropriate data stores and appropriate tools to process data. For example, in the data center, the staff, budget, and complexity of operating services motivates choosing a few data stores -- like the relational database, and then trying to get as much from that service as possible.
In the cloud, the economics shift. If you need one NoSQL table, you don’t need to hire a whole team to operate the service. The Cloud invites “And” -- choosing one datastore, such as NoSQL, does not mean excluding all others. Unlike in the data center, the shared capital infrastructure also means that firms can choose the best data store for a specific use case. In the Cloud we find a multitude of options when deciding where to store and how to process data and this alone can make a tremendous difference in how we approach data processing.
What motivates the many storage and data processing options we find in the Public Cloud? What tools and data stores are now fundamental to financial services? What are the underlying qualities of these “new” engine, tools, and stores?
Technology in the data center may progress, yet the “escape,” or migration of infrastructure capital to the Public Cloud has now been ongoing for over ten years. No matter how big your organization is, you will simply not be able to keep pace. By all means run applications in the data center, but innovate in the Cloud and don’t pretend there’s an alternative. You may be doing something in the data center that’s important, but it’s not innovation.
Now let’s rethink processing via abstraction from the ground up with the idea of continuous integration, serverless computing, and DevSecOps in mind. Think of DevSecOps as the integration of Development, Security, and Operations, but also a “feedback loop” and pipeline that itself continuously emits information. In the data center process is fraught with manual approvals based on very little empirical data. In the Cloud we build automated processes that enable information about code tests, vulnerability scans, and provenance to be captured and reported as and when needed, without human intervention. Finally the computers are doing some work! It turns out people aren’t actually really good at scanning volumes of log files. So what is it people actually do when they approve code releases in the data center? How much evidence can they actually review?
We can begin by removing the physical layer by moving our underlying physical resources to the cloud. By introducing GPU or FPGA processing in addition to traditional processing resources, we gain elasticity for storage and compute, providing endless scale for the massive data volume. We also need to address the long-term availability of our data, so Information Lifecycle Management (ILM,) is managed by utilizing multiple seamless storage platforms. These platforms can work together throughout the lifecycle of your information, from creation to destruction. In addition to capability, we gain time for our entire storage management team and reduce processing times for large batch cycles and recovery from failed processing.
As we approach the data layer, we can bring a broader variety of tools, in addition to old friends we know well, for processing and analysis. We can use an integrated toolset to manage both batch and real time data processing. We can matrix inputs and outputs into stream processes from targets such as Message Queues, Data Warehouses, Real Time Streams, or Files. In addition to our traditional tasks, we can rapidly prototype new approaches and technologies like containers, serverless computing, the hadoop ecosystem, and a variety of NoSQL, Graph, and object datastores in a cost-effective manner. Unstructured data can be processed using any number of tools, with the added benefit of not needing to create additional copies of our data. Again, providing innovation time for our DBAs, Data Architects, Data Scientists, and many others. Data transformation is more reliable and the broader toolset allows innovation and optimization efforts move forward more rapidly.
To address the veracity of data and our regulatory requirements, we follow our DevSecOps approach to data management and processing. Our “Infrastructure as Code” to Software Development LifeCycle (SDLC) is automated using fully audited and repeatable processes to satisfy regulatory reporting burdens. Organizations subject to Sarbanes-Oxley (SOX,) FINRA, or other auditing bodies, can reduce the time spent on evidence production from weeks to moments. Auditing artifacts are created easily as an integral part of the development process.
Addressing processing in this way allows us to perform continuous integration on our entire stack and master change control. Not only does this eliminate the fear that many shops have – change – it provides cost optimization. Because we can scale horizontally or vertically as needed, we will always process in the most efficient manner possible, optimizing costs and time, allowing our operations staff to spend more time innovating and less time fearing change.
The cloud and Big Data processing go well together. The ability to address the challenges of Financial Data processing, represented by the “Vs”, so easily and flexibly utilizing Big Data processing techniques and cloud services enables us to address our challenges over a long time horizon and enable innovation.
The amount of data which needs to be stored, analyzed, or reported is staggering. Think of all of the tapes, tape infrastructure, SANs, replicas, NAS, etc… that exist just for storing data at the physical level. Then all of the databases, analytics, and document data stores are stacked on those. Over time, we also need to save more data for a longer period of time. The volume challenge speaks for itself.
The speed at which new data becomes available, or needs to be output, can vary from the microsecond to annually and anything in between. Creating a system to manage, normalize, and process the various speeds is challenging. Many duplicates of a data point are created as it moves through the processing environment,and maintaining consistency of the data and calculations used in processing are also issues.
As we have evolved, more unstructured (text) data has become integral to the research process. If we include the massive volumes of data stored in our file systems, there are substantial amounts of unstructured data which have not typically been part of the numerically-driven financial community.
When performing research, not only do we need to trust our sources, but backtesting of algorithms needs to be done rapidly and accurately. Not only do we need to manage the truthfulness of our research, but regulatory requirements are daunting. From the data stored, to how it is processed, to trade and financial reporting, the regulatory bodies have strict requirements. Veracity of the data is important from both an input and output perspective.
More Data, Desire to keep original data and historical data. Data does not fit into relational database. Relational database does not scale well for complex workflow applications. Physical infrastructure is finite, moreover, tightly coupled patterns make change difficult at best. We build it and leave it alone until we are ready to upgrade it.
Elasticity. Scale the capacity to ingest any amount of data. Scale the rate at which we can process requests for objects
Ability to escape 3 refresh cycle, elastic storage, elimination of archival issues. Multiple data stores and at lower cost than data center managed data stores.
Speed from microseconds to quarters. Shift from Batch to Real-time views is as fundamental as the shift from procedural to object-oriented programming. Various platforms which had to be integrated for final outputs. Objects are stored in a relational database regardless of fit to purpose because often the RDBMS is the only choice. Specific data stores, including streams, which can scale-out to ingest any amount of data, and durably buffer that data so that it can be processed and replayed. Faster time to market than traditional ETL, ability to matrix inputs and outputs.
Working with streams enables “natural” modeling in changes in the state of complex systems.Labor intensive backtesting process.. Lean operations teams must shuffle data around and try to serve many applications using a constrained storage capacity. Innovation and Strategy must compete with increasingly complex regulatory data requirements. Integrated DevOps approach to data management, SDLC for models and ability to rapidly prototype
Low Communication and Coordination noise as web services are the contract. Rapid Backtesting under more scenarios, Rapid Innovation due to low barrier to entry, yet maintaining tight source control and auditing artifacts.
Stream and replay any amount of data at production volumes when testing strategies and new products.