fuel-docs/pages/planning-guide/7000-sahara-plan.rst

7.4 KiB

Planning a Sahara Deployment

Sahara<sahara-term> enables users to easily provision and manage Apache Hadoop clusters in an OpenStack environment. Sahara supports only 2.x Release of Hadoop.

The Sahara control processes run on the Controller node. The entire Hadoop cluster runs in VMs that run on Compute Nodes. A typical set-up is:

  • One VM that runs management and monitoring processes (Apache Ambari, Cloudera Manager, Ganglia, Nagios)
  • One VM that serves as the Hadoop master node to run ResourceManager and NameNode.
  • Many VMs that serve as Hadoop worker nodes, each of which runs NodeManager and DataNodes.

You must have exactly one instance of each management and master processes running in the environment. Other than that, you are free to use other configurations. For example, you can run the NodeManager and DataNodes in the same VM that runs ResourceManager and NameNode; such a configuration may not produce performance levels that are acceptable for a production environment but it works for evaluation and demonstration purposes. You could also run DataNodes and TaskTrackers in separate VMs.

Sahara can use either swift-object-storage-term or ceph-term for object storage.

Note

If you have configured the Swift public URL with SSL, Sahara will only work with the prepared Sahara images<sahara-images-ops>.

Special steps are required to implement data locality for Swift; see Data-locality for details. Data locality is not available for Ceph storage backend.

Plan the size and number of nodes for your environment based on the information in nodes-roles-plan.

When deploying an OpenStack Environment that includes Sahara for running Hadoop you need to consider a few special conditions.

Floating IPs

Fuel configures Sahara to use floating IPs to manage the VMs. This means that you must provide a Floating IP pool in each Node Group Template you define. See public-floating-ips-arch for general information about floating IPs.

A special case is if you are using Nova-Network and you have set the auto_assign_floating_ip parameter to true by checking the appropriate box on the Fuel UI. In this case, a floating IP is automatically assigned to each VM and the "floating ip pool" dropdown menu is hidden in the OpenStack Dashboard.

In either case, Sahara assigns a floating IP to each VM it spawns so be sure to allocate enough floating IPs.

However, if you have a limited number of floating IPs or special security policies you may not be able to provide access to all instances. In this case, you can use the instances that have access as proxy gateways. To enable this functionality, set the is_proxy_gateway parameter to true for the node group you want to use as proxy. Sahara will communicate with all other cluster instances through the instances of this node group.

Note

If use_floating_ips is set to true and the cluster contains a node group that is used as proxy, the requirement to provision a pool of floating IPs is only applied to the proxy node group. Sahara accesses the other instances through proxy instances using the private network.

Note

The Cloudera Hadoop plugin does not support the access to the Cloudera manager through a proxy node. Therefore, you can only assign the nodes on which you have the Cloudera manager as proxy gateways.

Security Groups

Sahara can create and configure security groups separately for each cluster depending on a provisioning plugin and Hadoop version. Security Groups<security-groups-term>

VM Flavor Requirements

Hadoop requires at least 1Gb of RAM to run. That means you must use flavors that have at least 1Gb of memory for Hadoop cluster nodes.

Hardware-assisted virtualization

In order for Sahara to work properly, hardware-assisted virtualization must be enabled for the hypervisor used by OpenStack. Its absence leads to frequent random errors during Hadoop deployment, because in that case VMs are too 'weak' to run such a heavywight application. To ensure that Sahara will work properly, you should do two things:

  • While deploying OpenStack environment via Fuel UI, select hypervisor other than QEMU.

  • Make sure that CPUs on compute nodes support hardware-assisted virtualization. To check that, run the following command on deployed compute nodes:

    cat /proc/cpuinfo  | grep --color "vmx\|svm"

While most modern x86 CPUs support hardware-assisted virtualization, its support still might be absent on compute nodes if they are themselves running as virtual machines. In that case hypervisor running compute nodes must support passing through hardware-assisted virtualization to nested VMs and have it enabled. VirtualBox does not have that feature, and as a result environments deployed as described in the QuickStart Guide <quickstart-guide> will have Sahara working poorly.

Communication between virtual machines

Be sure that communication between virtual machines is not blocked.

Default templates

Sahara bundles default templates that define simple clusters for the supported plugins. These templates are already added to the sahara database, therefore, you do not need to create them.

Supported default templates for plugins

There is an overview of the supported default templates for each plugin:

  • Vanilla Apache Hadoop 2.6.0:

    There are 2 node groups created for this plugin. First one is named vanilla-2-master and contains all management Hadoop components - NameNode, HistoryServer and ResourceManager. It also includes Oozie server required to run Hadoop jobs. Second one is named vanilla-2-worker and contains components required for data storage and processing - NodeManager and DataNode.

    The cluster template is also represented for this plugin. It's named vanilla-2 and contains 1 master and 3 worker nodes.

  • Cloudera Hadoop Distribution (CDH) 5.4.0:

    There are 3 node groups created for this plugin. First one is named cdh-5-master and contains all management Hadoop components - NameNode, HistoryServer and ResourceManager. It also includes Oozie server required to run Hadoop jobs. Second one is named cdh-5-manager and contains Cloudera Management component that provides UI to manage Hadoop cluster. Third one is named cdh-5-worker and contains components required for data storage and processing - NodeManager and DataNode.

    The cluster template is also represented for this plugin. It's named cdh-5 and contains 1 manager, 1 master and 3 worker nodes.

  • Hortonworks Data Platform (HDP) 2.2:

    There are also 2 node groups created for this plugin. First one named hdp-2-2-master and contains all management Hadoop components - Ambari, NameNode, MapReduce HistoryServer, ResourceManager, YARN Timeline Server, ZooKeeper. It also includes Oozie server required to run Hadoop jobs. Second one named hdp-2-2-worker and contains components required for data storage and processing - NodeManager and DataNode.

    The cluster template is also represented for this plugin. It's named hdp-2-2 and contains 1 master and 4 worker nodes.

For additional information about using Sahara to run Apache Hadoop, see the Sahara documentation.