We’ve all heard Big Data is coming. The raw marketing information from mouse-overs, key-clicks and in-store activities constitutes a potentially huge increase in total data flow, all of which has to be temporarily stored, processed and the results saved. One of the challenges of the next wave of IT systems and software is the efficient, and economical, handling of this data.
Big Data is so-called “unstructured” data, typically small records consisting of an identifying key and a data field, which can vary in content type from record to record. This contrasts with records in typical applications and especially traditional databases, where the data is formally structured into a set of fields, which in aggregate constitute a record.
Generally, analysis of Big Data concentrates on large files that aggregate many key/data pairs. These are processed by parallel computer nodes or massively parallel GPU chips. Over the last two or three years, the capability to parallel process has become a standard offering in public clouds, allowing the datacentre broad choices between in-house and public cloud workflows, including a hybrid approach where excess workload is “cloudbursted” to a public cloud.
These approach choices underline the need for a good data strategy in a modern datacentre. The old view of data as primary or secondary isn’t adequate for the complexity of a hybrid cloud, providing neither agility nor optimal economics. This is especially true of Big Data.
We need additional characterizations for data, adding hot/cold, transient or long-term, or high/low security to our view. Let’s look at a couple of examples. Mouse-over data is intrinsically somewhat valuable, since it indicates potential customer interest in a product or supplier, but that value diminishes rapidly over time. In fact, the derived analysis that a customer is interested in gas grills is much more valuable and longer lasting. Thus the raw data goes from hot/transient/low security to cold/disposable/ low-security right after it is processed, while the processed results may be hot/medium-term/medium-security. All the while, the customer ID data is hot/long-term/high-security.
Deciding how and where to store this data is thus a bit complex. What follows should shed some light on how best to approach this.
Let’s assume that the chosen architecture for the solution is a hybrid cloud. Over 70% of enterprises have made that choice. This choice runs immediately into a major data positioning issue. Work that is processed across both a public and private cloud needs movement and, with our woefully slow WANs, that is a critical issue in cost and efficiency.
The most feasible current solutions are to either parse the data, much like the “map” process of Hadoop, so that part of the data is in the public cloud and the rest in the private space; or, alternatively, to keep all raw data in the cloud and use in-house cache controllers to hide the latency questions this creates.
The first of these, mapping the data, actually makes sense for Big Data itself, since a good vehicle for sharing that data is to store it up in the cloud from its point of origin, so making it available for a variety of analytics including SaaS-based solutions. Caching works best for data that must be shared between the clouds and for structured data that has multiple hits on the same records within a short timeframe. Likely, caching is the choice for analytic results and for other quasi-static data, including databases of customer information.
Moving customer information into the public cloud raises security questions. Generally, opinions now hold the cloud to be more secure than local private storage. Cloud service providers can afford to spend a lot more on security issues than a typical large company.
Storing Big Data is usually done with object storage, simply because the storage paradigm maps well to large files. Some have suggested Hadoop file storage, but, while the key/data filing scheme fits both data and processing, it isn’t very efficient or fast in the cloud. It’s generally better here to convert from object format to key/data in-memory.
Structured data from apps or databases should use traditional formats, mostly residing in old-school block-IO or filer structures. With gateway software from companies such as Zadara or Velostrata, this can be located in the cloud, with the gateway caching the datasets to obviate latency. The cloud-based storage solution allows cloudbursting to be started up very quickly and also brings clear benefits in disaster recovery and data sharing with SaaS applications.
Data protection in the cloud should come from zone-crossing replication. Zones are different geographical areas in the public clouds. This is common-sense disaster insurance, though there might be some associated cost for transferring replicas. The new erasure code approach may reduce the cost of this considerably.
While your use case will likely differ on details, most Big Data is a use-once item. Once it’s ground its way through the analytics process, it is all used up and can be discarded. This points to one copy in each of two zones being adequate, rather than the normal two local copies plus one in another zone.
The analytic results deserve a normal data protection level, with the ability to take multiple hits across the zones. For block-IO data, RAID parity is marginal these days, since large drives take a long time to rebuild. Erasure coded protection is a better choice. More advanced protection of results data involves snapshots and copying data to an incremental backup system. These could reside in the cloud or in-house.
Realistically, though, not all businesses will feel confident about committing master data to the public cloud. For these, the issue is building an economic and high-performance repository in-house that can be available to the public cloud as needed. In a sense, this is turning the discussion 180 degrees around. Caching will be needed in virtual instances in the public cloud and software to access those caches is also essential.
With cloud instances today, this will probably mean more copies of cache instances than would be seen in the first, reverse, scenario, but the model still should work. Those instances should be large-memory, high network bandwidth instances.
The local cloud’s storage now is an issue. Object storage could be whitebox appliance based, with object storage from Red Hat (Ceph), Scality or Caringo. With the private cloud segment based on OpenStack, other storage is probably from that source. An alternative to the whitebox appliance is to use hyper-converged gear, which shares server storage across the whole cloud.
We haven’t dealt with hot, warm and cold data. Data’s value over time tends to drop rapidly. Moving data automatically (auto-tiering) from expensive ultra-fast solid-state drives to slower storage is a sensible imperative for any configuration. As anyone looking at all-flash arrays learns, these units typically outperform the servers they connect to, allowing the arrays’ excess bandwidth to be use to compress data before moving it to slower secondary storage. With a reduction of around 80% being typical, and sometimes much more, compression can save a fortune both on capacity costs and retrieval costs in the cloud, while being much faster in-house and reducing capital expense drastically.
Cold data should end up in the cloud as an archive. Long-term, low access storage is very cheap and is now disk-based to allow relatively fast retrieval. All of these compression and auto-tiering approaches should be software based and fully automated. A policy system can ensure the data is treated appropriately.
The cloud is still evolving and new approaches to these issues are bound to arise. This should make future configurations much more capable of getting alpha from those nuggets of Big Data.