workload-ref-archs/doc/source/big-data/big-data.rst

OpenStack Workload Reference Architecture: Big Data
===================================================

Introduction
------------

Big Data analytics has established itself as an important process to support
new or enhanced business models. Big Data is a term for data sets that are so
large or complex that traditional data processing applications are inadequate
to deal with them. Big Data analytics refers to the use of predictive
analytics, user behavior analytics, or certain other advanced data analytics
methods that extract value from data.

Since Big Data analytics can include and analyze all types of data sources,
the results are valuable for most departments in an enterprise. Each might
perform analytics with different business objectives. Considering the short
innovation cycle of most digital business models, Enterprise IT is often
under pressure to fulfill a multitude of demands quickly. A flexible, fast,
efficient and easy-to-manage Big Data deployment is critical.

Cloud is one approach to tackle the dynamic situation caused by high volumes
of analytics requests with rapid deployment time requirements. In an
OpenStack-based cloud environment, a Big Data cluster can be provisioned in
an automated manner. The value of Big Data on cloud contributes to it being
one of the top use cases for OpenStack. According to the April 2016
`OpenStack User Survey`_, 27 percent of users have deployed or are testing
Big Data analytics solutions.

Apache Hadoop on OpenStack offers a Big Data infrastructure that scales out
both compute and storage resources, and provides the secure and automated
capabilities for the analytics process. The `Apache Hadoop project`_ is the
de facto standard open source framework for Big Data analytics, used in the
vast majority of deployments. Multiple Hadoop clusters are often deployed to
respond to an enterprise’s needs.

This reference architecture is intended for enterprise architects who are
looking to deploy Big Data Hadoop clusters on an OpenStack cloud. It describes
a generic Hadoop architecture and uses open source technologies:

* `OpenStack cloud software`_

* `Ubuntu Linux`_ operating system

* `Apache Ambari`_ – open source software to provision, manage and monitor
  Hadoop clusters.

.. _OpenStack User Survey: https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf
.. _Apache Hadoop project: http://hadoop.apache.org/
.. _OpenStack cloud software: http://www.openstack.org/software/
.. _Ubuntu Linux: https://www.ubuntu.com/
.. _Apache Ambari: http://ambari.apache.org/

This reference architecture describes and includes installation files for a
basic Hadoop cluster. Additional services can be applied for a more complex
configuration and will be covered in future works.

Figure 1: High-level overview of Hadoop architecture

.. figure:: figures/figure01.png
   :alt: Figure 1: High-level overview of Hadoop architecture

OpenStack for Hadoop Clusters
-----------------------------

This Hadoop architecture is derived from actual use cases and experience.
Building a Hadoop-based Big Data environment can be a complex task. It is
highly recommended to use common approaches to reduce this complexity, such as
identifying the data processing models. These processing models demand high
availability of resources, networking, bandwidth, storage, as well as security
constraints in the enterprise context.

* **Batch processing model** – Analytics based on historic data

  In the batch processing model, the analytic tasks are executed or queried in
  a scheduled or recurring manner. Typically the data is already available for
  analysis in a static repository such as large files or databases. The batch
  processing model is often used to analyze business data of a certain period.
  One example is an ETL (extract, transform, load) process to extract business
  data from various ERP systems for supply chain planning.

* **Stream processing model** – Business real-time analytics

  In the stream processing model, data is continuously streamed and directly
  analyzed in real time. Actions can be triggered in case of occurrence of
  special or defined events. An example of a stream processing workload is
  fraud detection for credit card companies. A credit card transaction is
  transmitted online to the credit card company and is evaluated in real time
  based on certain parameters; for example, checking the card’s validity and
  the purchase amount against the limit. It is also possible to check the
  location of purchases and compare this to other recent purchases.
  For example, if purchases are made in the U.S. and Europe in a timespan of
  only a few hours, this indicates a high likelihood of fraud and action can be
  taken to decline the transaction.

* **Predictive processing model** – Predict outcome based on recent and
  historical data

  This model is used to predict an outcome, behavior or other actions for the
  future. Generally this analytic model consists of various predictive
  algorithms. One example is predictive maintenance. Data from machines,
  engines or other sensors is collected and analyzed so that predictive actions
  can be made to recommend the next maintenance cycle before a failure might
  occur.

Hadoop clusters use a master-slave architecture. The data is ingested into the
cluster and stored in blocks in the Hadoop distributed file system (HDFS). The
default block size is 64MB. The blocks of data are replicated to different
nodes in the clusters. Part of the core Hadoop project, `YARN`_ provides a
framework for job scheduling and cluster resource management. With YARN,
multiple data processing applications can be implemented in the Hadoop cluster.

.. _YARN: http://hortonworks.com/apache/yarn/

Typically a Hadoop cluster with YARN is composed of different types of cluster
nodes:

* **NameNode** – The metadata about the data blocks are stored in the NameNode.
  This provides lookup functionality and tracking for all data or files in the
  Hadoop cluster. NameNode does not store the actual data. Generally the
  NameNode requires high memory (RAM) allocation. The NameNode belongs to the
  "master" part of Hadoop architecture.

* **DataNode** – This is also referred as the worker node and belongs to the
  "slave" part of a Hadoop architecture. It is responsible for storing and
  computing the data and responds to the NameNode for filesystem operations.
  Generally a DataNode requires high amount of storage space.

* **ResourceManager** – This is the master that manages the resources in the
  Hadoop cluster. It has a scheduler to allocate resources to the various
  applications across the cluster.

* **NodeManager** – This takes instruction from the ResourceManager and is
  responsible for executing the applications. It monitors and reports the
  resources (cpu, memory, disk) to the ResourceManager.

An OpenStack cloud is powered by many different services (also known as
projects). Utilizing the core services and the Hadoop Common package, a
Hadoop cluster can be deployed in a virtualized environment with minimal
effort. Optional services such as the OpenStack Orchestration service (Heat)
can be added to automate deployment. This reference architecture does not
cover OpenStack Big Data Service (Sahara). Sahara provides a simple means to
provision as well as scale previously provisioned Hadoop clusters.
Sahara will be covered in future reference architecture documents.

Figure 2 shows the core and optional services in relation to one another,
and the services to confirm are available in your OpenStack cloud.

Figure 2. Logical representation of OpenStack services in support of Hadoop
clusters

.. figure:: figures/figure02.png
   :alt: Figure 2. Logical representation of OpenStack services in support of Hadoop clusters

Brief descriptions of the core and optional services are as follow.
The `OpenStack Project Navigator`_ provides additional information.

.. _OpenStack Project Navigator: http://www.openstack.org/software/project-navigator/

.. list-table:: **Core services**
   :widths: 20 50

   * - Compute (Nova)
     - Manages the life cycle of compute instances, including spawning,
       scheduling, and decommissioning of virtual machines (VMs) on demand.
   * - Image Service (Glance)
     - Stores and retrieves VM disk images. Used by OpenStack Compute during
       instance provisioning.
   * - Block Storage (Cinder)
     - Virtualizes the management of block storage devices and provides a
       self-service API to request and use those resources regardless of the
       physical storage location or device type. Supports popular storage
       devices.
   * - Networking (Neutron)
     - Enables network connectivity as a service for other OpenStack services,
       such as OpenStack Compute. Provides an API to define networks and their
       attachments. Supports popular networking vendors and technologies. Also
       provides LBaaS and Firewall-as-a-Service (FWaaS).
   * - Identity Service (Keystone)
     - Provides authentication and authorization for the other OpenStack
       services.
   * - Object Storage (Swift)
     - Stores and retrieves arbitrary unstructured data objects via a RESTful
       HTTP-based API. Highly fault-tolerant with data replication and
       scale-out architecture.

.. list-table:: **Optional services**
   :widths: 20 50

   * - Orchestration (Heat)
     - Orchestrates multiple composite cloud applications by using either the
       native HOT template format or the AWS CloudFormation template format,
       through both an OpenStack-native REST API and a
       CloudFormation-compatible Query API.
   * - Telemetry (Ceilometer)
     - Monitors and meters the OpenStack cloud for billing, benchmarking,
       scalability, and statistical purposes.
   * - Dashboard (Horizon)
     - Provides an extensible web-based self-service portal to interact with
       underlying OpenStack services, such as launching an instance, assigning
       IP addresses, or configuring access controls.

Figure 3 illustrates the basic functional interaction between these services.
For further details:
`OpenStack Conceptual Architecture Diagram <http://docs.openstack.org/admin-guide/common/get-started-conceptual-architecture.html>`_.

Figure 3. Functional interaction between OpenStack components

.. figure:: figures/figure03.png
   :alt: Figure 3. Functional interaction between OpenStack components

Structuring a Hadoop Cluster with OpenStack
-------------------------------------------

OpenStack provides the necessary compute, network and data storage services
for building a cloudbased Hadoop cluster to meet the needs of the various
processing models.

Networking
**********

Multiple networks can be created for the Hadoop cluster connectivity. Neutron
routers are created to route the traffic between networks.

* **Edge Network** – Provides connectivity to the client-facing and enterprise
  IT network. End users are accessing the Hadoop cluster through this network.

* **Cluster Network** – Provides inter-node communication for the Hadoop
  cluster.

* **Management Network** – Optionally provides a dedicated network for
  accessing the Hadoop nodes' operating system for maintenance and monitoring
  purposes.

* **Data Network** – Provides a dedicated network for accessing the object
  storage within an OpenStack Swift environment or to an external object
  storage such as Amazon S3. This is optional if object storage is not used.

Neutron security groups are used to filter traffic. Hadoop uses different
ports and protocols depending on the services deployed and communications
requirements. Different security groups can be created for different types of
nodes, depending on the Hadoop services running on it. With OpenStack security
groups, multiple rules can be specified that allow/deny traffic from certain
protocols, ports, or IP addresses or ranges. Each virtual machine (VM) can be
applied with one or more security groups. In OpenStack, each tenant has a
default security group, which is applied to instances that have no other
security group defined. Unless changed, this security group denies all
incoming traffic.

Image Management
****************

There are multiple options to provide operating system configuration for the
Hadoop nodes. On-the-fly configuration allows greater flexibility but can
increase spawning time. The operating system images can also be pre-configured
to contain all of the Hadoop-related packages required for the different types
of nodes. Pre-configuration can reduce instance build time, but includes its
own set of problems, such as patching and image lifecycle management. In this
example, the Heat orchestration features are used to configure the Hadoop
nodes on-the-fly. Additional Hadoop and operating system packages are installed
on-the-fly depending on the node type (e.g. NameNode, DataNode). These packages
can be downloaded from Internet-based or local repositories. For a more secure
enterprise environment, local package repository is recommended.

Data Management
***************

Similar to an external hard drive, Cinder volumes are persistent block-storage
virtual devices that may be mounted and dismounted from the VM. Cinder volumes
can be attached to only one instance at a time. A Cinder volume is attached to
each Hadoop DataNode to provide the HDFS.

If the data to be processed by a Hadoop cluster needs to be accessed by other
applications, the OpenStack Swift object storage can be used to store it.
Swift offers a cost-effective way of storing unstructured data. Hadoop provides
a built-in interface to access Swift or AWS S3 object storage; either can be
configured to serve data over HTTP to the Hadoop cluster.

Orchestration
*************

Heat uses template files to automate the deployment of complex cloud
environments. Orchestration is more than just standing up virtual servers;
it can also be used to install software, apply patches, configure networking
and security, and more. Heat templates are provided with this reference
architecture that allow the user to quickly and automatically setup and
configure a Hadoop cluster for different data processing models
(types of analytics).

Figure 4: A Hadoop cluster on OpenStack

.. figure:: figures/figure04.png
   :alt: Figure 4: A Hadoop cluster on OpenStack

Demonstration and Sample Code
-----------------------------

This section describes the Heat template provided for this workload. The
template is used to configure all of the Hadoop nodes. It has been created
for reference and training and is not intended to be used unmodified in a
production environment.

An Ambari Hadoop environment is created on a standard Ubuntu 14.04 server
cloud image in QEMU copy on write (qcow2). The qcow2 cloud image is stored in
the Glance repository. The Apache Ambari open source project makes Hadoop
management simpler by providing an easy-to-use Hadoop management web UI backed
by its RESTful APIs. Basically, Ambari is the central management service
for open source Hadoop. In this architecture, an Ambari service is installed
on the Master Node (NameNode). The Heat template also installs additional
required services such as the name server, Network Time Protocol (NTP) server,
database, and the operating system configuration customization required for
Ambari. Floating IP can be allocated to the Master Node to provide user access
to the Ambari service. In addition, an Ambari agent service is deployed on
each node of the cluster. This provides communication and authentication
functionality between the cluster nodes.

The following nodes are installed by the Heat template:

* **Master Node (NameNode)** – This node houses the cluster-wide management
  services that provide the internal functionality to manage the Hadoop cluster
  and its resources.

* **Data Nodes** – Services used for managing and analyzing the data, stored in
  HDFS, are located on these nodes. Analytics jobs access and compute the data
  on the Data Nodes.

* **Edge Node** – Services used to access the cluster environment or the data
  outside the cluster are on this node. For security, direct user access to the
  Hadoop cluster should be minimized. Users can access the cluster via the
  command line interface (CLI) from the Edge Node. All data-import and
  data-export processes can be channeled on one or more Edge Nodes.

* **Admin Node** – Used for system-wide administration

Multiple networks (edge, cluster, management, data) described in previous
sections are created by the Heat orchestration. A Neutron security group
is attached to each instance of the cluster node. The template also provisions
Cinder volumes and attaches one Cinder volume to each node. Swift is not
configured in this template and will be covered in future work.

The Heat template, BigData.yaml, can be downloaded from
http://www.openstack.org/software/sample-configs/#big-data.
Please review the README file for further details.

Scope and Assumptions
---------------------

The Heat template provided for this reference architecture assumes that the
Hadoop cluster workload is deployed in a single-region, single-zone OpenStack
environment. The deployment in a multi-zone/multiregion environment is outside
the scope of this document.

The Heat template is configured to address the minimum infrastructure
resources for deploying a Hadoop cluster. Architecting a Hadoop cluster is
highly dependent on the data volume and other performance indicators defined by
the business use cases, such as response times for analytic processes and how
and which services will be used.

The sample environment uses the Java environment. As such, the Heat template
installer will be required to accept the Java license agreement.

As mentioned, Sahara is not used in this implementation. Sahara is the
OpenStack Big Data Service that provisions a data-intensive application cluster
such as Hadoop or Spark. The Sahara project enables users to easily provision
and manage clusters with Hadoop and other data processing frameworks on
OpenStack. An update to this reference architecture to include Sahara is under
consideration.

Summary
-------

There are many possible choices or strategies for deploying a Hadoop cluster
and there are many possible variations in OpenStack deployment. This document
and the accompanying Heat templates serve as a general reference architecture
for a basic deployment and installation process via Openstack orchestration.
They are intended to demonstrate how easily and quickly a Hadoop Cluster can be
deployed, using the core OpenStack services. Complementary services will be
included in future updates.

These additional resources are recommended to delve into more depth on overall
OpenStack cloud architecture, the OpenStack services covered in this reference
architecture, and Hadoop and Ambari. The vibrant, global OpenStack community
and ecosystem can be invaluable for their experience and advice, especially the
users that have deployed Big Data solutions. Visit openstack.org to get started
or click on these resources to begin designing your OpenStack-based Big Data
analytics system.

.. list-table::
   :widths: 25 50
   :header-rows: 1

   * - Resource
     - Overview
   * - `OpenStack Marketplace`_
     - One-stop resource to the skilled global ecosystem for distributions,
       drivers, training, services and more.
   * - `OpenStack Architecture Design Guide`_
     - Guidelines for designing an OpenStack cloud architecture for common use
       cases. With examples.
   * - `OpenStack Networking Guide`_
     - How to deploy and manage OpenStack Networking (Neutron).
   * - `OpenStack Virtual Machine Image Guide`_
     - This guide describes how to obtain, create, and modify virtual machine
       images that are compatible with OpenStack.
   * - `Complete OpenStack documentation`_
     - Index to all documentation, for every role and step in planning and
       operating an OpenStack cloud.
   * - `Community Application Catalog`_
     - Download this LAMP/WordPress sample application and other free
       OpenStack applications here.
   * - `Apache Hadoop project`_
     - The de facto standard open source framework for Big Data analytics,
       used in this reference architecture.
   * - `Apache Ambari project`_
     - This reference architecture and files deploy Big Data using Ambari, an
       open source package for installing, configuring and managing a Hadoop
       cluster.
   * - `Welcome to the community!`_
     - Join mailing lists and IRC chat channels, find jobs and events, access
       the source code and more.
   * - `User groups`_
     - Find a user group near you, attend meetups and hackathons—or organize
       one!
   * - `OpenStack events`_
     - Global schedule of events including the popular OpenStack Summits and
       regional OpenStack Days.

.. _OpenStack Marketplace: http://www.openstack.org/marketplace/
.. _OpenStack Architecture Design Guide: http://docs.openstack.org/arch-design/
.. _OpenStack Networking Guide: http://docs.openstack.org/mitaka/networking-guide/
.. _OpenStack Virtual Machine Image Guide: http://docs.openstack.org/image-guide/
.. _Complete OpenStack Documentation: http://docs.openstack.org/
.. _Community Application Catalog: http://apps.openstack.org/
.. _Apache Ambari project: http://ambari.apache.org/
.. _Welcome to the community!: http://www.openstack.org/community/
.. _User groups: https://groups.openstack.org/
.. _OpenStack events: http://www.openstack.org/community/events/