Quantcast
Channel: Symantec Connect - Storage and Clustering - 記事
Viewing all articles
Browse latest Browse all 81

Get the Enterprise-class Hadoop you want from the infrastructure you’ve already got

$
0
0

How important is your data to you?

If the results and guidance you get from Hadoop-based analysis provide a significant benefit to your bottom line, then it clearly is important.  While data stored in Hadoop may initially represent a small pilot project, if the data analysis yields repeatedly useful and profitable guidance, then it raises its importance to an organization.

The importance of data is not just in the analysis.  It also represents the bulk of the work of the analysts.  Ninety percent of a data analyst's time is spent massaging and filtering the data.  Does your data change over time?  Can the same modifications be applied to the new data, or does it constantly have to be re-evaluated?  The analysis of the data could represent a significant investment in the value of the data itself.  It's not enough to reproduce the data, but to reproduce the massaged and filtered data, otherwise this added value could be lost.

If the data is shared among multiple analysts or for multiple analyses, then it becomes a single point of failure.  If the data becomes unavailable, analysis stops until it is recovered.  Thus the data needs to be both highly available and backed up for possible recovery.

How much work is needed to ingest the data?  If the data source is separate from Hadoop, then the data has to be migrated from the data source to Hadoop.  This is can be a very time-consuming operation.  As well, if the data is sensitive, and is secure at the source, is it secure in transit or in Hadoop?  If you are handling data-loss prevention at the source, how about in Hadoop?  Clearly data governance becomes an issue so long as data needs to be migrated between the source and Hadoop.  A better idea would be to analyze in place (in situ), so no ingestion is needed, and thus focus data governance on one data repository rather than several.

Hadoop suffers from a number of limitations that make it less than enterprise-ready.  In particular,

1. Not highly available, since the NameNode and JobTracker represent single points of failure.

2. Concerns about data governance, due to risky and time-consuming data moves for batch processing.

3. Server and storage sprawl, resulting in poor storage and compute utilization.

4. Wasteful storage consumption by using three copies of data.  This replication consumes more storage and computational resources.

Does it make sense then to store your data on commodity, public-domain storage?  The overwhelming criterion used to justify this choice is cost, not availability, management, or performance.  Again this makes sense if the data analysis is part of a research project, but not if it has matured to the point where it makes a significant contribution to your bottom line.  The community that supports the public-domain implementations is starting to become concerned with these issues, and they are happy to tout the improvements made with newer releases, but their notion of support is best-effort and consulting.  This is roughly equivalent to Linux support prior to Red Hat and SuSe. 

So long as the big data projects remain in the realm of skunk works, this lack of enterprise-class readiness makes sense.  However as soon as big data begins delivering results that contribute significantly to a company's bottom-line, the high availability, management, and performance of open-source tools becomes an issue.  The quality of the Hadoop environment should match its value to the organization it supports.

One solution is the hardware appliance.  This uses both enterprise-class hardware and software to provide for the three goals of availability, management, and performance.

However it does so at considerable cost, and as well still incurs the issue of data governance and expense of ingestion, that is, you still need to migrate the data from its source to the appliance.

What would be best is a solution that combines the best of both ideas – low cost and commodity hardware with enterprise readiness.  Symantec’s Enterprise Solution for Hadoop does just that.  It’s a simple add-on software connector to our Storage Foundation Cluster File System.  Built on the Hortonworks distribution of HDFS, it seamlessly runs MapReduce or any other Hadoop component.  Without any data migration or additional hardware, you can immediately transform your existing enterprise-class storage into an Hadoop file system.  That is, you get instant Hadoop.

Use the infrastructure you already have, even commodity servers, storage, and OS platforms.

Storage Foundation Cluster File System provides a number of features:

1.    Improves application performance and scalability.

2.    Minimizes application downtime.

3.    Provides sub-minute Oracle failover.

4.    Reduces costs with storage consolidation.

This is in addition all the features supported by Storage Foundation in general:

1.      Dynamic Multi-Pathing – standardized enterprise-class DMP for all major storage vendors.

2.      Online storage management – modification without downtime.

3.      Snapshots – full and space-optimized, even to other tiers of storage.

4.      Replication – both for volumes and files.

5.      Compression – transparent and per file.

6.      Deduplication.

7.      Thin-provisioning – only pay for what you use.

8.      SmartTiers – file systems that span multiple tiers of storage.

9.      SmartMove – only move the active data instead of blocks that are not in use.

10.  Centralized management – one web interface to manage all your Storage Foundation components.

11.  Integration with NetBackup – for Enterprise-class backup and recovery.

Key use cases:

1.    Low latency-high IOPs.

2.    Fast failover of applications.

3.    Clustered Network File Systems.

4.    Scale-out applications.

How does Symantec Enterprise Solution for Hadoop provide high availability, ease of management, and high performance?

In addition to the advantages of availability and management, SFCFS-Hadoop also outperforms HDFS, either using SAN or DAS.  In our testing, we used

1.    an 8-node cluster, with 8x 2.67Ghz cores per node,

2.    Hitachi HUS-1500SAN storage (15K rpm SAS, 8gb cache)

3.    1gps dedicated link

For comparison, the HDFS with SAN configuration used the same host and SAN configuration, but a shared-nothing configuration for storage with each node having LUNs from individual RAID5 groups.  For the HDFS with DAS, the same host configuration but each node has 4x 10krpm SAS disks, again shared-nothing.

The actual test used TeraSort, which is a three-stage serial test:  TeraGen, TeraSort, and TeraValidate.  We summarize the results by adding the times for the three stages together, in which you can see that CFS-Hadoop outperforms either configuration by at least 1/3.

TeraSort Suite (sum)

 

cfs-hadoop

hdfs-san

hdfs-das

cfs advantage

10g

6.467

9.034

8.717

34.79%

100g

57.617

79.650

78.483

36.22%

1000g

629.050

882.500

851.749

35.40%

In summary, the Symantec Enterprise Solution for Hadoop supports enterprise-class big data analytics by:

·         Reducing storage costs by 2/3

·         Providing 1/3 better performance.

·         Removing single points of failure

·         Improving service recovery times.

·         Providing ease of snapshots for point-in-time rollback and recovery.

·         Providing a scalable architecture to meet growing Big Data need (up to 16PB).


Viewing all articles
Browse latest Browse all 81

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>