Default HubSpot Blog

Big Data

Posted by Aoife McMonagle on Mar 11, 2014 4:30:00 AM

By Michael Traves, Solution Architect at Scalar.

The volume of data being created is increasing exponentially, doubling every two years, and 90% of that is unstructured.  While social media has certainly contributed to that trend in recent years, machine data is fast taking over as the main contributor. Anything with a sensor, and typically devices have many, is creating data with inherent value that can no longer simply be discarded. Data that is being created in real time, such as Twitter and Facebook, is creating new challenges for analytic frameworks, while the explosive growth of unstructured data such as documents, pictures, video, and logs put stress on traditional storage infrastructures. The velocity, variety, and volume of data being generated simply cannot be stored and analyzed using traditional means in a cost effective manner.

Traditional storage solutions, built as controller-pair high availability nodes, often included proprietary hardware ASICs to improve performance and reduce latency for serving data to applications, with a tightly coupled proprietary operating system. Depending upon your application and workload performance and capacity requirements, an appropriately sized controller-pair (cpu, memory, IO profile) would typically be chosen, and an appropriate number and type of disk spindle chosen to meet the required specifications. While these solutions continue to dominate in environments where traditional applications drive capacity, performance, and growth projections, they fall short when massive scalability is required to meet short as well as long term volumes. The traditional scale-up paradigm of replacing controller-pairs with larger (cpu, memory, IO) ones to meet additional performance and capacity requirements fails to scale to that required for Big Data, and is no longer economical.

What’s changed in recent years is the economics related to retaining data. While it is certainly true that the volume of data being generated has increased dramatically, it is the cost of retaining it that has always been the limiting factor. The economics of storage today are allowing significant data capacities to be stored on disk, enabling online data access, and when coupled with compute and memory in a low-cost, commodity scale-out architecture, enable companies to retain and utilize all data for retention and analytics rather than only the subset that might fit into a proprietary data-warehouse or archival solution. The decision on whether to retain or discard data is based on the value of the information one is able to gain from that data, versus the cost of retaining it and performing that analysis.  Costs are decreasing and the analytic approaches to extracting useful information out of the data continue to improve.

The commoditization within the X86 market has dramatically lowered compute, memory, and storage costs to the point where it is now economical to retain what we once threw away. More over, by coupling intelligent data management software with this hardware together as a scale-out storage platform, we take advantage of this commoditization, driving storage costs down and data retention up. Hence the birth of scale-out clustered NAS and Object Storage solutions, built upon this foundation of commodity hardware, and coupled with intelligent software that provides the typical reliability, availability, and serviceability of traditional scale-up technologies.

Big Data is more than just the data itself, however - it’s about using that data to create information, through analytics. Addressing the analytic computational and data capacity requires a re-architecture of traditional BI platforms.  Unpredictable data structures dictate that solutions need to be flexible in how they store and retrieve data for analysis, while enabling adhoc queries to questions not known at the time the data is captured.  An enabling software stack that distributes analytic queries across this scale-out cluster, enabling high parallelism while localizing execution, is key to linear scalability in capacity and compute, while keeping costs low.

One such framework is Hadoop Map/Reduce.  Hadoop is an open-source software framework distributed as an Apache project, which includes the core framework for distributing data across a scale-out storage/compute architecture, Hadoop Distributed File System (HDFS), and the associated analytic framework for executing parallel queries, Map/Reduce.  Hadoop is able to store any data types, including structured, semi-structured, and unstructured, and provides flexibility through a schema on read approach to queries, rather than the fixed schema typical of RDBMS and data-warehouse appliances. This effectively eliminates the need to re-layout data structures and indices to address new data types or to answer new questions, a time-consuming process and one which provides little flexibility. By storing multiple data types in their raw format, and allowing the query to define the data schema(s) on read, new questions may be asked without infrastructure upheaval, greatly improving time-to-market, and creative 'what if?' analysis by non-traditional users.

Scale-out compute and storage architectures are quickly becoming the new standard for deploying data-centric applications with massive scalability and retention requirements, enabled by the commoditization of the X86 architecture. Whereas traditional storage solutions tightly coupled the storage hardware and operating system together to provide predictable performance and RAS, new software solutions which decouple the hardware are being designed to take advantage of scale-out infrastructure components using Software Defined Storage as a fundamental component. This approach allows the creation of capacity dense Object Storage solutions with geographic redundancy, just as easily as building an analytic cluster for massive data stores, all using the same low cost hardware.

As a leading provider of technology solutions and consulting services, Scalar is well versed in a variety of approaches and solutions from multiple vendors for building scale-out storage and compute environments designed to meet your data retention and analytic requirements. Through further dialogue, workshops, and technology reviews, Scalar will work with you to understand your specific requirements, select the best solution and technology to meet your needs, and deliver compelling ROI to the business.

Topics: Big Data, Data Management, Storage