Scientific Working Group
This project is no longer maintained.
The contents of this repository are still available in the Git
source code management system. To see the contents of this
repository before it reached its end of life, please check out the
previous commit with "git checkout HEAD^1".
This repository is for storing `Scientific Working Group
<>`_ resources.
Principally, this includes the text for the WG's book,
`The Crossroads of Cloud and HPC: OpenStack for Scientific Research
All documents are in RST format.
Python Tox is used to automate the creation of virtual environments for
building the working group's documentation resources.
To get started, you need to install all necessary tools:
* `virtualenv`
* `pip` (use the latest from ``)
* `tox`
Run the build
To build the resources::
$ tox -e docs
This will generate build artifacts in ``doc/build``. To view the generated
HTML artifacts navigate to
For any further questions, please email or join #openstack-dev on

View File

View File

@ -1,76 +0,0 @@
Scientific Working Group
The Scientific Working Group is dedicated to representing and
advancing the use-cases and needs of research and high-performance
computing atop OpenStack. It's also a great forum for cross-institutional
collaboration. If you are (or would like to) run OpenStack to support
researchers/scientists/academics and/or HPC/HTC, then please join!
* HPC/HTC Infrastructure
* Research Data Infrastructure
* Application Infrastructure
* Social Infrastructure
* Create opportunities for the scientific community to engage the wider OpenStack community, i.e. industry
The default communication of our members is via the following open
community mailing lists:
* for coordination of working
group activities. Please prefix email subject lines with the tag
* for operational discussion
of scientific OpenStack deployment issues. Please prefix email
subject lines with the tag "[scientific]".
Please use the hashtag "scientific-wg" for tagging any etherpad
URLs, code, blogs posts, scientific research and/or other social
publishing platforms.
The Working Group also maintains a `wiki page`_.
The Working Group has `weekly IRC meetings`_ in alternating time zones.
.. _wiki page:
.. _weekly IRC meetings:
No formal membership is required. Please introduce yourself and/or
fellow colleagues to this working group using one of the mailing
lists below, or by attending one of the IRC meetings.
This group is open to all members of the scientific OpenStack
community and supporting vendors.
The Scientific Working Group maintains a guide to meeting the requirements of
scientific computing workloads on OpenStack, titled
*The Crossroads of Cloud and HPC: OpenStack for Scientific Research*.
.. toctree::
:maxdepth: 2

View File

@ -1,130 +0,0 @@
OpenStack and Federated Identity Management
Scientific research depends on the free flow of ideas, and the free
flow of ideas depends upon the free flow of people. Scientific
collaborations and research groups are often composed of users from
different institutions across different countries.
To support convenient and effective collaboration, compute resources
managed by one institution should be seamlessly accessible to
collaborators from other institutions, and vice versa. This is the
core principle of identity federation.
OpenStack is uniquely positioned. Universities and research
institutions use OpenStack to deliver research computing services.
Through improved support for identity federation, OpenStack can
support collaboration between institutions.
Federation Terminology
* **National Research and Education Networks (NREN)** is a collective term used
for research federations.
* **Home Organisation** is the institution to which a user is affiliated, and in
federated use cases is where a user will be authenticated.
* **Security Assertion Markup Language (SAML)** ...
* **Open ID Connect (OIDC)** ...
* **Identity Provider (IdP)** ...
* ...
The Concepts of Identity Federation
NRENs use OpenStack to offer cloud compute services to authorized
users. The users may belong to universities, research institutions,
industrial partners or more generally to any organisation within
an identify federation.
We have three actors: the NREN running the public cloud, the user
and the user's home organisation, acting as an identity provider.
Federated public clouds built SAML-based Identity Federations,
leveraging Universities and Research Institutions as Identity
Providers. National Identity Federations are currently used to
authenticate and authorize milions of users, and enable single sign
on across thousands of services. OpenStack-based public cloud
computing is provided to users through federated identity login.
Openstack services and dashboards need to be configured as Service
Providers within identity federations and have to be accessed like
all other services.
The Challenges of Identity Federation
Federated research computing infrastructure needs to enable users
from other institutions within an identity federation to authenticate
and get authorized for access to use infrastructure resources. How is
this process controlled?
* *A user claims to be a user from another institution. How is it proven that
they are who they say they are?*
* *The user has successfully authenticated at their home organisation.
How are they authorized to use compute resources at this institution?
Is the affiliation to a project (and role within that project)
recorded at the home organisation or at the institution providing
the compute resources?*
* *A federated user wants to run periodic background jobs. How can
they do that without having to interactively submit their password?*
* *How does an institution monitor and account for the usage of a
user who has no presence in the institution's user database?*
* *How is a federated user contacted when required by an organisation,
and how would disciplinary actions be taken against a federated
user violating the terms of service?*
* *A federated user has already used a lot of compute resources as
a guest of this institution. How does the institutin decide and
enforce the user's limits?*
* *A federated user has just left her home organisation, or left
the project on which she was collaborating with this institution.
How are active resources assigned to this user and project dealt with?*
* *A user is part of an industrial research project in which their
organisation pays for access to the resources. How is billing applied to the
resources used across the identity federation?*
* *How does an institution enable its users to be authenticated and
authorized to use resources at other institutions within an
identity federation?*
Federated Identity Management in OpenStack
Case Study 1
Case Study 2
Further Reading
* The INDIGO Identity and Access Management (IAM) service:
* INDIGO Keystone OpenID-Connect integration guide:
* ...
* **Person 1** from organisation 1
* **Person 2** from organisation 2
.. figure:: images/cc-by-sa.png
:width: 100
:alt: Creative commons licensing
This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)

View File

@ -1,479 +0,0 @@
OpenStack and High Performance Data
What can data requirements mean an HPC context? The range of use cases
is almost boundless. With considerable generalisation we can consider
some broad criteria for requirements, which expose the inherent tensions
between HPC-centric and cloud-centric storage offerings:
* The **data access** model: data objects could be stored and retrieved
using file-based, block-based, object-based or stream-based access.
HPC storage tends to focus on a model of file-based shared data storage
(with an emerging trend for object-based storage proposed for achieving
new pinnacles of scalability). Conversely cloud infrastructure favours
block-based storage models, often backed with and extended by object-based
storage. Support for data storage through shared filesystems is still
maturing in OpenStack.
* The **data sharing** model: applications may request the same data
from many clients, or the clients may make data accesses that are
segregated from one another. This distinction can have significant
consequences for storage architecture. Cloud storage and HPC storage
are both highly distributed, but often differ in the way in which data
access is parallelised. Providing high-performance access for many
clients to a shared dataset can be a niche requirement specific to HPC.
Cloud-centric storage architectures typically focus on delivering high
aggregate throughput on many discrete data accesses.
* The level of **data persistence**. An HPC-style tiered data storage
architecture does not need to incorporate data redundancy at every level
of the hierarchy. This can improve performance for tiers caching data
closer to the processor.
The cloud model offers capabilities that enable new possibilities for HPC:
* **Automated provisioning**. Software-defined infrastructure automates the
provisioning and configuration of compute resources, including storage.
Users and group administrators are able to create and configure storage
resources to their specific requirements at the exact time they are
* **Multi-tenancy**. HPC storage does not offer multi-tenancy with the level
of segregation that cloud can provide. A virtualised storage resource
can be reserved for the private use of a single user, or could be shared
between a controlled group of collaborating users, or could even be
accessible by all users.
* **Data isolation**. Sensitive data requires careful data management.
Medical informatics workloads may contain patient genomes. Engineering
simulations may contain data that is trade secret. OpenStacks
segregation model is stronger than ownership and permissions on a
POSIX-compliant shared filesystem, and also provides finer-grained
access control.
There is clear value in increased flexibility - but at what cost in
performance? In more demanding environments, HPC storage tends to focus
on and be tuned for delivering the requirements of a confined subset
of workloads. This is the opposite approach to the conventional cloud
model, in which assumptions may not be possible about the storage access
patterns of the supported workloads.
This study will describe some of these divergences in greater detail, and
demonstrate how OpenStack can integrate with HPC storage infrastructure.
Finally some methods of achieving high performance data management on
cloud-native storage infrastructure will be discussed.
File-based Data: HPC Parallel Filesystems in OpenStack
Conventionally in HPC, file-based data services are delivered
by parallel filesystems such as Lustre and Spectrum Scale (GPFS).
A parallel filesystem is a shared resource. Typically it is mounted on
all compute nodes in a system and available to all users of a system.
Parallel filesystems excel at providing low-latency, high-bandwidth
access to data.
Parallel filesystems can be integrated into an OpenStack environment in
a variety of configuration models.
Provisioned Client Model
Access to an external parallel filesystem is provided through an OpenStack
provider network. OpenStack compute instances - virtualised or bare
metal - mount the site filesystem as clients.
This use case is fairly well established. In the virtualised use case,
performance is achieved through use of SR-IOV (with only a moderate
level of overhead). In the case of Lustre, with a layer-2 VLAN provider
network the o2ib client drivers can use RoCE to perform Lustre data
transport using RDMA.
Cloud-hosted clients on a parallel filesystem raise issues with root in
a cloud compute context. On cloud infrastructure, privileged accesses
from a client do not have the same degree of trust as on conventional HPC
infrastructure. Lustre approaches this issue by introducing Kerberos
authentication for filesystem mounts and subsequent file accesses.
Kerberos credentials for Lustre filesystems can be supplied to OpenStack
instances upon creation as instance metadata.
Provisioned Filesystem Model
There are use cases where the dynamic provisioning of software-defined
parallel filesystems has considerable appeal. There have been
proof-of-concept demonstrations of provisioning Lustre filesystems from
scratch using OpenStack compute, storage and network resources.
The OpenStack Manila project aims to provision and manage shared
filesystems as an OpenStack service. IBMs Spectrum Scale integrates
with Manila to re-export GPFS parallel filesystems using the user-space
Ganesha NFS server.
Currently these projects demonstrate functionality over performance.
In future evolutions the overhead of dynamically provisioned parallel
filesystems on OpenStack infrastructure may improve.
A Parallel Data Substrate for OpenStack Services
IBM positions Spectrum Scale as a distributed data service for
underpinning OpenStack services such as Cinder, Glance, Swift and Manila.
More information about using Spectrum Scale in this manner can be found
in IBM Researchs red paper on the subject (listed in the Further
Reading section).
Applying HPC Technologies to Enhance Data IO
A recurring theme throughout this study has been the use of remote DMA
for efficient data transfer in HPC environments. The advantages of this
technology are especially pertinent in data intensive environments.
OpenStacks flexibility enables the introduction of RDMA protocols
for many cloud infrastructure operations to reduce latency, increase
bandwidth and enhance processor efficiency:
Cinder block data IO can be performed using iSER (iSCSI extensions
for RDMA). iSER is a drop-in replacement for iSCSI that is easy to
configure and set up. Through providing tightly-coupled IO resources
using RDMA technologies, the functional equivalent of HPC-style burst
buffers can be added to the storage tiers of cloud infrastructure.
Ceph data transfers can be performed using the Accelio RMDA transport.
This technology was demonstrated some years ago but does not appear
to have achieved production levels of stability or gained significant
mainstream adoption.
The NOWLAB group at Ohio State University have developed extensions to
data analytics platforms such as HBase, Hadoop, Spark and Memcached to
optimise data movements using RDMA.
Optimising Ceph Storage for Data-Intensive Workloads
The versatility of Ceph embodies the cloud-native approach to storage,
and consequently Ceph has become a popular choice of storage technology
for OpenStack infrastructure. A single Ceph deployment can support
various protocols and data access models.
Ceph is capable of delivering strong read bandwidth. For large reads
from OpenStack block devices, Ceph is able to parallelise the delivery
of the read data across multiple OSDs.
Cephs data consistency model commits writes to multiple OSDs before
a write transaction is completed. By default a write is replicated
three times. This can result in higher latency and lower performance
on write bandwidth.
Ceph can run on clusters of commodity hardware configurations. However,
in order to maximise the performance (or price performance) of a Ceph
cluster some design rules of thumb can be applied:
Use separate physical network interfaces for external storage network and
internal storage management. On the NICs and switches, enable Ethernet
flow control and raise the MTU to support jumbo frames.
Each drive used for Ceph storage is managed by an OSD process.
A Ceph storage node usually contains multiple drives (and multiple
OSD processes).
The best price/performance and highest density is achieved using fat
storage nodes, typically containing 72 HDDs. These work well for
large scale deployments, but can lead to very costly units of failure
in smaller deployments. Node configurations of 12-32 HDDs are usually
found in deployments of intermediate scale.
Ceph storage nodes usually contain a higher-speed write journal, which is
dedicated to service of a number of HDDs. An SSD journal can typically
feed 6 HDDs while an NVMe flash device can typically feed up to 20 HDDs.
About 10G of external storage network bandwidth balances the read
bandwidth of up to 15 HDDs. The internal storage management network
should be similarly scaled.
A rule of thumb for RAM is to provide 0.5GB-1GB of RAM per TB per
OSD daemon.
On multi-socket storage nodes, close attention should be paid to NUMA
considerations. The PCI storage devices attached to each socket should
be working together. Journal devices should be connected with HDDs
attached to HBAs on the same socket. IRQ affinity should be confined
to cores on the same socket. Associated OSD processes should be pinned
to the same cores.
For tiered storage applications in which data can be regenerated from
other storage, the replication count can safely be reduced from 3 to
2 copies.
The Cancer Genome Collaboratory: Large-scale Genomics on OpenStack
Genome datasets can be hundreds of terabytes in size, sometimes requiring
weeks or months to download and significant resources to store and
.. image:: images/high_performance_data-oicr_logo.jpg
:width: 300
:align: right
:alt: OICR logo
The Ontario Institute for Cancer Research built the Cancer Genome
Collaboratory (or simply The Collaboratory) as a biomedical research
resource built upon OpenStack infrastructure. The Collaboratory aims
to facilitate research on the worlds largest and most comprehensive
cancer genome dataset, currently produced by the International Cancer
Genome Consortium (ICGC).
By making the ICGC data available in cloud compute form in the
Collaboratory, researchers can bring their analysis methods to the cloud,
yielding benefits from the high availability, scalability and economy
offered by OpenStack, and avoiding the large investment in compute
resources and the time needed to download the data.
An OpenStack Architecture for Genomics
The Collaboratorys requirements for the project were to build a cloud
computing environment providing 3000 compute cores and 10-15 PB of raw
data stored in a scalable and highly-available storage. The project
has also met constraints of budget, data security, confined data centre
space, power and connectivity. In selecting the storage architecture,
capacity was considered to be more important than latency and performance.
Each rack hosts 16 compute nodes using 2U high-density chassis, and
between 6 and 8 Ceph storage nodes. Hosting a mix of compute and storage
nodes in each rack keeps some of the Nova-Ceph traffic in the same rack,
while also lowering the power requirement for these high density racks
(2 x 60A circuits are provided to each rack).
As of September 2016, Collaboratory has 72 compute nodes (2600 CPU
cores, Hyper-Threaded) with a physical configuration optimized for large
data-intensive workflows: 32 or 40 CPU cores and a large amount of RAM
(256 GB per node). The workloads make extensive use of high performance
local disk, incorporating hardware RAID10 across 6 x 2TB SAS drives.
The networking is provided by Brocade ICX 7750-48C top-of-rack switches
that use 6x40Gb cables to interconnect the racks in a ring stack topology,
providing 240 Gbps non-blocking redundant inter-rack connectivity,
at a 2:1 oversubscription ratio.
The Collaboratory is deployed using entirely community-supported free
software. The OpenStack control plane is Ubuntu 14.04 and deployment
configuration is based on Ansible. The Collaboratory was initially
deployed using OpenStack Juno and a year later upgraded to Kilo and
then Liberty.
Collaboratory deploys a standard HA stack based on Haproxy/Keepalived and
Mariadb-Galera using three controller nodes. The controller nodes also
perform the role of Ceph-mon and Neutron L3-agents, using three separate
RAID1 sets of SSD drives for MySQL, Ceph-mon and Mongodb processes.
The compute nodes have 10G Ethernet with GRE and SDN capabilities
for virtualized networking. The Ceph nodes use 2x10G NICs bonded for
client traffic and 2x10G NICs bonded for storage replication traffic.
The Controller nodes have 4x10G NICs in an active-active bond (802.3ad)
using layer3+4 hashing for better link utilisation. The Openstack tenant
routers are highly-available with two routers distributed across the three
controllers. The configuration does not use Neutron DVR out of concern
for limiting the number of servers directly attached to the Internet.
The public VLAN is carried only on the trunk ports facing the controllers
and the monitoring server.
Optimising Ceph for Genomics Workloads
Upon workload start, the instances usually download data stored in Ceph's
object storage. OICR developed a download client that controls access
to sensitive ICGC protected data through managed tokens. Downloading a
100GB file stored in Ceph takes around 18 minutes, with another 10-12
minutes used to automatically check its integrity (md5sum), and is mostly
limited by the instances local disk.
The ICGC storage system adds a layer of control on top of Cephs
object storage. Currently this is a 2-node cluster behind an Haproxy
instance serving the ICGC storage client. The server component uses
OICRs authorization and metadata systems to provide secure access to
related objects stored in Ceph. By using OAuth-based access tokens,
researchers can be given access to the Ceph data without having to
configure Ceph permissions. Access to individual project groups can
also be implemented in this layer.
Each Ceph storage node consists of 36 OSD drives (4, 6 or 8 TB) in
a large Ceph cluster currently providing 4 PB of raw storage, using
three replica pools. The radosgw pool has 90% of the Ceph space being
reserved for storing protected ICGC datasets, including the very large
whole genome aligned reads for almost 2000 donors. The remaining 10% of
Ceph space is used as a scalable and highly-available backend for Glance
and Cinder. Ceph radosgw was tuned for the specific genomic workloads,
mostly by increasing read-ahead on the OSD nodes, 65 MB as rados object
stripe for Radosgw and 8 MB for RBD.
Further Considerations and Future Directions
In the course of the development of the OpenStack infrastructure at the
Collaboratory, several issues have been encountered and addressed:
The instances used in cancer research are usually short lived
(hours/days/weeks), but with high resource requirements in terms of CPU
cores, memory and disk allocation. As a consequence of this pattern of
usage the Collaboratory OpenStack infrastructure does not support live
migration as a standard operating procedure.
The Collaboratory have encountered a few problems caused by Radosgw bugs
involving overlapping multipart uploads. However, these were detected by
the Collaboratorys monitoring system, and did not result in data loss.
The Collaboratory created a monitoring system that uses automated Rally
tests to monitor end-to-end functionality, and also download a random
large S3 object (around 100 GB) to confirm data integrity and monitor
object storage performance.
Because of the mix of very large (BAM), medium (VCF) and very small
(XML, JSON) files, the Ceph OSD nodes have imbalanced load and we have
to regularly monitor and rebalance data.
Currently, the Collaboratory is hosting 500TB of data from 2,000 donors.
Over the next 2 years, OICR will increase the number of ICGC genomes
available in the Collaboratory, with the goal of having the entire ICGC
data set of 25,000 donors estimated to be 5PB when the project completes
in 2018.
Although in a closed beta phase with only a few research labs having
accounts, there were more than 19,000 instances started in the last 18
months, with almost 7,000 in the last three months. One project that
uses the Collaboratory heavily is the PanCancer Analysis of Whole Genomes
(PCAWG), which characterizes the somatic, and germline variants from
over 2,800 ICGC cancer whole genomes in 20 primary tumour sites.
In conclusion, the Collaboratory environment has been running well for
OICR and its partners. George Mihaiescu, senior cloud architect at OICR,
has many future plans for OpenStack and the Collaboratory:
“We hope to add new Openstack projects to the Collaboratorys offering
of services, with Ironic and Heat being the first candidates. We would
also like to provide new compute node configurations with RAID0 instead
of RAID10, or even SSD based local storage for improved IO performance.”
CLIMB: OpenStack, Parallel Filesystems and Microbial Bioinformatics
The Cloud Infrastructure for Microbial Bioinformatics (CLIMB) is a
collaboration between four UK universities (Swansea, Warwick, Cardiff
and Birmingham) and funded by the UKs Medical Research Council.
CLIMB provides compute and storage as a free service to academic
microbiologists in the UK. After an extended period of testing, the
CLIMB service was formally launched in July 2016.
.. image:: images/high_performance_data-climb.jpg
:width: 400
:align: right
:alt: CLIMB hardware
CLIMB is a federation of 4 sites, configured as OpenStack regions.
Each site has an approximately equivalent configuration of compute nodes,
network and storage.
The compute node hardware configuration is tailored to support the
memory-intensive demands of bioinformatics workloads. The system as
a whole comprises 7680 CPU cores, in fat 4-socket compute nodes with
512GB RAM. Each site also has three large memory nodes with 3TB of RAM
and 192 hyper-threaded cores.
The infrastructure is managed and deployed using xCAT cluster management
software. The system runs the Kilo release of OpenStack, with packages
from the RDO distribution. Configuration management is automated
using Salt.
Each site has 500 TB of GPFS storage. Every hypervisor is a GPFS client,
and uses an infiniband fabric to access the GPFS filesystem. GPFS is
used for scratch storage space in the hypervisors.
For longer term data storage, to share datasets and VMs, and to provide
block storage for running VMs, CLIMB deploys a storage solution based
on Ceph. The Ceph storage is replicated between sites. Each site has 27
Dell R730XD nodes for Ceph storage servers. Each storage server contains
16x 4TB HDDs for Ceph OSDs, giving a total raw storage capacity of 6912TB.
After 3-way replication this yields a usable capacity of 2304TB.
On two sites Ceph is used as the storage back end for Swift, Cinder
and Glance. At Birmingham GPFS is used for Cinder and Glance, with
plans to migrate to Ceph.
In addition to the infiniband network, a Brocade 10G Ethernet fabric is
used, in conjunction with dual-redundant Brocade Vyatta virtual routers
to manage cross-site connectivity.
In the course of deploying and trialling the CLIMB system, a number of
issues have been encountered and overcome.
* The Vyatta software routers were initially underperforming with
consequential impact on inter-site bandwidth.
* Some performance issues have been encountered due to NUMA topology
awareness not being passed through to VMs.
* Stability problems with Broadcom 10GBaseT drivers in the controllers
led to reliability issues. (Thankfully the HA failover mechanisms were
found to work as required).
* Problems with interactions between Ceph and Dell hardware RAID cards.
* Issues with Infiniband and GPFS configuration.
CLIMB has future plans for developing their OpenStack infrastructure,
* Migrating from regions to Nova cells as the federation model between
* Integrating OpenStack Manila for exporting shared filesystems from
GPFS to guest VMs.
Further Reading
An IBM research study on integrating GPFS
(Spectrum Scale) within OpenStack environments:
A 2015 presentation from ATOS on using Kerberos authentication in Lustre:
Glyn Bowden of HPE and Alex Macdonald from SNIA discuss OpenStack
storage (including the Provisioned Filesystem Model using Lustre):
The High-Performance Big Data team at Ohio State University:
A useful talk from the 2016 Austin OpenStack Summit on Ceph design:
The Ontario Institute for Cancer Research Collaboratory:
Further details on the International Cancer Genome Consortium:
Dr Tom Connor presented CLIMB at the 2016 Austin OpenStack summit:
This document was written by Stig Telfer of `StackHPC Ltd <>`_ with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:
* **George Mihaiescu**, **Bob Tiernay**, **Andy Yang**, **Junjun Zhang**,
**Francois Gerthoffert**, **Christina Yung**, **Vincent Ferretti**
from the Ontario Institute for Cancer Research. The authors wish
to acknowledge the funding support from the Discovery Frontiers:
Advancing Big Data Science in Genomics Research program (grant
no. RGPGR/448167-2013, The Cancer Genome Collaboratory), which
is jointly funded by the Natural Sciences and Engineering Research
Council (NSERC) of Canada, the Canadian Institutes of Health Research
(CIHR), Genome Canada, and the Canada Foundation for Innovation (CFI),
and with in-kind support from the Ontario Research Fund of the Ministry
of Research, Innovation and Science.
* **Dr Tom Connor** from Cardiff University and the CLIMB collaboration.
.. figure:: images/cc-by-sa.png
:width: 100
:alt: Creative commons licensing
This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)

View File

@ -1,665 +0,0 @@
OpenStack and HPC Infrastructure Management
In this section we discuss the emerging OpenStack use case for management
of HPC infrastructure. We introduce Ironic, the OpenStack bare metal
service and describe some of the differences, advantages and limitations
of managing HPC infrastructure as a bare metal OpenStack cloud.
Compared with OpenStack, established approaches to HPC infrastructure
management are very different. Conventional solutions offer much higher
scale, and much lower management plane overhead. However, they are also
inflexible, difficult to use and slow to evolve.
Through differences in the approach taken by cloud infrastructure
management, OpenStack brings new flexibility to HPC infrastructure
* OpenStacks integrated support for multi-tenancy infrastructure
introduces segregation between users and projects that require isolation.
* The cloud model enables the infrastructure deployed for different projects
to use entirely different software stacks.
* The software-defined orchestration of deployments is assumed.
This approach, sometimes referred to as “infrastructure as code”,
ensures that infrastructure is deployed and configured according to a
prescriptive formula, often maintained under source control in the same
manner as source code.
* The range of platforms supported by Ironic is highly diverse. Just about
any hardware can and has been used in this context.
* The collaborative open development model of OpenStack ensures that
community support is quick and easy to obtain.
The “infrastructure as code” concept is also gaining traction among
some HPC infrastructure management platforms that are adopting proven
tools and techniques from the cloud infrastructure ecosystem.
Deploying HPC Infrastructure at Scale
HPC infrastructure deployment is not the same as cloud deployment.
A cloud assumes large numbers of users, each administering a small
number of instances compared to the overall size of the system. In a
multi-tenant environment, each user may use different software images.
Without coordination between the tenants, it would be very unlikely for
more than a few instances to be deployed at any one time. The software
architecture of the cloud deployment process is designed around this
Conversely, HPC infrastructure deployment has markedly different
* A single user (the cluster administrator). HPC infrastructure is a
managed service, not user-administered.
* A single software image. All user applications will run in a single
common environment.
* Large proportions of the HPC cluster will be deployed simultaneously.
* Many HPC infrastructures use diskless compute nodes that network-boot
a common software image.
In the terminology of the cloud world, a typical HPC infrastructure
deployment might even be considered a “black swan event”. Cloud
deployment strategies do not exploit the simplifying assumptions that
deployments are usually across many nodes using the same image and for
the same user. Consequently, OpenStack Ironic deployments tend to scale
to the low thousands of compute nodes with current software releases
and best-practice configurations. Network booting a common image is a
capability that only recently has become possible in OpenStack and has
yet to become an established practice.
Bare Metal Management Using OpenStack Ironic
Using Ironic, bare metal compute nodes are automatically provisioned at
a users request. Once the compute allocation is released, the bare
metal hardware is automatically decommissioned ready for its next use.
Ironic requires no presence on the compute node instances that it manages.
The software-defined infrastructure configuration that would typically
be applied in the hypervisor environment must instead be applied in
the hardware objects that interface with the bare metal compute node:
local disks, network ports, etc.
Support for a Wide Range of Hardware
A wide range of hardware is supported, from full-featured BMCs on
enterprise server equipment down to devices whose power can only be
controlled through an SNMP-enabled data centre power strip.
An inventory of compute nodes is registered with Ironic and stored in
Ironics node database. Ironic records configuration details and
current hardware state, including:
* Physical properties of the compute node, including CPU count, RAM size
and disk capacity.
* The MAC address of the network interface used for provisioning instance
software images.
* The hardware drivers used to control and interact with the compute node.
* Details needed by those drivers to address this specific compute node
(for example, BMC IP address and login credentials).
* The current power state and provisioning state of the compute node,
including whether it is in active service.
Inventory Grooming through Hardware Inspection
A node is initially registered with a minimal set of identifying
credentials - sufficient to power it on and boot a ramdisk. Ironic
generates a detailed hardware profile of every compute node through a
process called Hardware Inspection.
Hardware inspection uses this minimal bootstrap configuration provided
during node registration. During the inspection phase a custom ramdisk
is booted which probes the hardware configuration and gathers data.
The data is posted back to Ironic to update the node inventory. Large
amounts of additional hardware profile data are also made available for
later analysis.
The inspection process can optionally run benchmarks to identify
performance anomalies across a group of nodes. Anomalies in the hardware
inspection dataset of a group of nodes can be analysed using a tool
called Cardiff. Performance anomalies, once identified, can often be
traced to configuration anomalies. This process helps to isolate and
eliminate potential issues before a new system enters production.
Bare Metal and Network Isolation
The ability for Ironic to support multi-tenant network isolation is a
new capability, first released in OpenStacks Newton release cycle.
This capability requires some mapping of the network switch ports
connected to each compute node. The mapping of an Ironic network port to
its link partner switch port is maintained with identifiers for switch
and switch port. These are stored as attributes in the Ironic network
port object. Currently the generation of the network mapping is not
automated by Ironic.
Multi-tenant networking is implemented through configuring state in the
attached switch port. The state could be the access port VLAN ID for
a VLAN network, or VTEP state for a VXLAN network. Currently only a
subset of Neutron drivers are able to perform the physical switch port
state manipulations needed by Ironic. Switches with VXLAN VTEP support
and controllable through the OVSDB protocol are likely to be supported.
Ironic maintains two private networks of its own: Networks dedicated to
node provisioning and cleaning networks are defined in Neutron as provider
networks. When a node is deployed, its network port is placed into the
provisioning network. Upon successful deployment the node is connected
to the virtual tenant network for active service. Finally, when the node
is destroyed it is placed on the cleaning network. Maintaining distinct
networks for each role enhances security, and the logical separation of
traffic enables different QoS attributes to be assigned for each network.
Current Limitations of Ironic Multi-tenant Networking
In HPC hardware configurations, compute nodes are attached to multiple
networks. Separate networks dedicated to management and high-speed data
communication are typical.
Current versions of Ironic do not have adequate support for attaching
nodes to multiple physical networks. Multiple physical interfaces can
be defined for a node, and a node can be attached to multiple Neutron
networks. However, it is not possible to attach specific physical
interfaces to specific networks.
Consequently, with current capabilities only a single network interface
should be managed by Ironic. Other physical networks would be managed
outside of OpenStacks purview, but will not benefit from OpenStack's
multi-tenant network capabilities as a result. Furthermore, Ironic only
supports a single network per physical port: all network switch ports
for Ironic nodes are access ports. Trunk ports are not yet supported
although this feature is in the development backlog.
Remote Console Management
Many server management products include support for remote consoles,
both serial and video. Ironic includes drivers for serial consoles,
built upon support in the underlying hardware.
Recently-developed capabilities within Ironic have seen bare metal
consoles integrated with OpenStack Novas framework for managing
virtual consoles. Ironics node kernel boot parameters are extended
with a serial console port, which is then redirected by the BMC to
serial-over-LAN. Server consoles can be presented in the Horizon web
interface in the same manner as virtualised server consoles.
Currently this capability is only supported for IPMI-based server
Security and Integrity
When bare metal compute is sold as an openly-accessible service,
privileged access is granted to a bare metal system. There is substantial
scope for a malicious user to embed malware payloads in the BIOS and
device firmware of the system.
Ironic counters this threat in several ways:
* **Node Cleaning**: Ironics node state machine includes states where
hardware state is reset and consistency checks can be run to detect
attempted malware injection. Ironics default hardware manager does
not support these hardware-specific checks. However, custom hardware
drivers can be developed to include BIOS configuration settings and
firmware integrity tests.
* **Network Isolation**: Through using separate networks for node provisioning,
active tenant service and node cleaning, the opportunities for a
compromised system to probe and infect other systems across the network
are greatly reduced.
* **Trusted Boot**: use of a Trusted Platform Module (TPM) and chain of trust
built upon it is necessary. These processes are used to secure public
cloud deployments of Ironic-administered bare metal compute today.
None of these capabilities is enabled by default. Hardening Ironics
security model requires expertise and some amount of effort.
Provisioning at Scale
The cloud model use case makes different assumptions to HPC. A cloud
is expected to support a large number of individual users. At any
time, each user is assumed to make comparatively small changes to their
compute resource usage. The HPC infrastructure use case is dramatically
different. HPC infrastructure typically runs a single software image
across the entire compute partition, and is likely to be deployed jointly
in one operation.
Ironics current deployment models do not scale as well as the models
used by conventional HPC infrastructure management platforms. xCAT uses
a hierarchy of subordinate service nodes to fan out an iSCSI-based
image deployment. Rocks cluster toolkit uses BitTorrent to distribute
RPM packages to all nodes. In the Rocks model, each deployment target
is a torrent peer. The capacity of the deployment infrastructure grows
alongside the number of targets being deployed.
However, the technologies for content distribution and caching that are
widely adopted by the cloud can be incorporated to address this issue.
Caching proxy servers can be used to speed up deployment at scale.
With appropriate configuration choices, Ironic can scale to handle
deployment to multiple thousands of servers.
.. figure:: images/hpc_infrastructure-ironic.png
:width: 600
:alt: Ironic node deployment flow diagram
*An overview of Ironics node deployment process when using the Ironic
Python Agent ramdisk and Swift URLs for image retrieval. This strategy
demonstrates good scalability, but the deploy disk image cannot be bigger
than the RAM available on the node.*
Building Upon Ironic to Convert Infrastructure into HPC Platforms
The strengths of cloud infrastructure tooling become apparent once Ironic
has completed deployment. From this point a set of unconfigured compute
nodes must converge into the HPC compute platform required to meet the
users needs. A rich ecosystem of flexible tools is available to
perform this purpose.
See the section
`OpenStack and HPC Workload Management <openstack-and-hpc-workloads.html>`_
for further details of some of the available approaches.
Chameleon: An Experimental Testbed for Computer Science
.. figure:: images/hpc_infrastructure-chameleon_logo.jpg
:width: 400
:alt: Chameleon logo
Chameleon is an infrastructure project implementing an experimental
testbed for Computer Science led by University of Chicago, with Texas
Advance Computing Center (TACC), University of Texas at San Antonio
(UTSA), Northwestern University and Ohio State University as partners.
The Chameleon project is funded by the National Science Foundation.
The current system comprises ~600 nodes split between sites at TACC in
Austin and University of Chicago. The sites are interconnected with a
100G network. The compute nodes are divided into twelve racks, referred
to as “standard cloud units”, comprising 42 compute nodes, 4 storage
nodes with 16 2 TB hard drives each, and 10G Ethernet connecting all nodes
with an SDN-enabled top-of-rack switch. Each SCU has 40G Ethernet uplinks
into the Chameleon core network fabric. On this, largely homogenous
framework were grafted heterogenous elements allowing for different
types of experimentation. One SCU has Mellanox ConnectX-3 Infiniband.
Two computer nodes were set up as storage hierarchy nodes with 512 GB
of memory, two Intel P3700 NVMe of 2.0 TB each, four Intel S3610 SSDs of
1.6 TB each, and four 15K SAS HDDs of 600 GB each. Two additional nodes
are equipped with NVIDIA Tesla K80 accelerators and two with NVIDIA
Tesla M40 accelerators.
In the near term additional, heterogeneous cloud units for experimentation
with alternate processors and networks will be incorporated, including
FPGAs, Intel Atom microservers and ARM microservers. Compute nodes with
GPU accelerators have already been added to Chameleon.
Chameleons public launch was at the end of July 2015; since then
it has supported over 200 research projects into computer science and
cloud computing.
The system is designed to be deeply reconfigurable and adaptive, to
produce a wide range of flexible configurations for computer science
research. Chameleon uses the OpenStack Blazar project to manage advance
reservation of compute resources for research projects.
Chameleon deploys OpenStack packages from RDO, orchestrated using
OpenStack Puppet modules. Chameleons management services currently
run CentOS 7 and OpenStack Liberty. Through Ironic a large proportion
of the compute nodes are provided to researchers as bare metal (a
few SCUs are dedicated to virtualised compute instances using KVM).
Chameleons Ironic configuration uses the popular driver pairing of
PXE-driven iSCSI deployment and IPMItool power management.
Ironics capabilities have expanded dramatically in the year since
Chameleon first went into production, and many of the new capabilities
will be integrated into this project.
The Chameleon projects wish list for Ironic capabilities includes:
* Ironic-Cinder integration, orchestrating the attachment of network block
devices to bare metal instances. This capability has been under active
development in Ironic and at the time of writing it is nearing completion.
* Network isolation, placing different research projects onto different
VLANs to minimise their interference with one another. Chameleon hosts
projects researching radically different forms of networking, which must
be segregated.
* Bare metal consoles, enabling researchers to interact with their allocated
compute nodes at the bare metal level.
* BIOS parameter management, enabling researchers to (safely) change
BIOS parameters, and then to restore default parameters at the end of
an experiment.
Pierre Riteau, DevOps lead for the Chameleon project, sees Chameleon as
an exciting use case for Ironic, which is currently developing many of
these features:
“With the Ironic project, OpenStack provides a modern bare-metal
provisioning system benefiting from an active upstream community, with
each new release bringing additional capabilities. Leveraging Ironic
and the rest of the OpenStack ecosystem, we were able to launch Chameleon
in a very short time.”
“However, the Ironic software is still maturing, and can lack in
features or scalability compared to some other bare-metal provisioning
software, especially in an architecture without a scalable Swift
“Based on our experience, we recommend getting familiar with the
other core OpenStack projects when deploying Ironic. Although Ironic
can be run as standalone using Bifrost, when deployed as part of an
OpenStack it interacts closely with Nova, Neutron, Glance, and Swift.
And as with all bare-metal provisioning systems, it is crucial to
have serial console access to compute nodes in order to troubleshoot
deployment failures, which can be caused by all sorts of hardware issues
and software misconfigurations.”
“We see the future of OpenStack in this area as providing a fully
featured system capable of efficiently managing data centre resources,
from provisioning operating systems to rolling out firmware upgrades
and identifying performance anomalies.”
BRIDGES: A Next-Generation HPC Resource for Data Analytics
Bridges is a supercomputer at the Pittsburgh Supercomputer Center funded
by the National Science Foundation. It is designed as a uniquely flexible
HPC resource, intended to support both traditional and non-traditional
workflows. The name implies the systems aim to “bridge the research
community with HPC and Big Data.”
Bridges supports a diverse range of use cases, including graph analytics,
machine learning and genomics. As a flexible resource, Bridges supports
traditional SLURM-based batch workloads, Docker containers and interactive
web-based workflows.
Bridges has 800 compute nodes, 48 of which have dual-GPU accelerators
from Nvidia. There are also 46 high-memory nodes, including 4 with
12TB of RAM each. The entire system is interconnected with an Omnipath
high-performance 100G network fabric.
Bridges is deployed using community-supported free software. The
OpenStack control plane is CentOS 7 and Red Hat RDO (a freely available
packaging of OpenStack for Red Hat systems). OpenStack deployment
configuration is based on the PackStack project. Bridges was deployed
using OpenStack Liberty and is scheduled to be upgraded to OpenStack
Mitaka in the near future.
Most of the nodes are deployed in a bare metal configuration using Ironic.
Puppet is used to select the software role of a compute node at boot
time, avoiding the need to re-image. For example, a configuration for
MPI, Hadoop or virtualisation could be selected according to workload
OmniPath networking is delivered using the OFED driver stack. Compute
nodes use IP over OPA for general connectivity. HPC apps use RDMA verbs
to take full advantage of OmniPaths capabilities.
.. figure:: images/hpc_infrastructure-bridges.png
:width: 600
:alt: PSC BRIDGES network architecture
*Visualisation of the Bridges OmniPath network topology. 800 General
purpose compute nodes and GPU nodes are arrayed along the bottom of the
topology. Special purpose compute nodes, storage and control plane nodes
are arrayed across the top of the topology. 42 compute nodes connect to
each OmniPath ToR switch (in yellow), creating a “compute island”,
with 7:1 oversubscription into the upper stages of the network.*
Bridges Exposes Issues at Scale
The Bridges system is a very large deployment for Ironic. While there are
no exact numbers, Ironic has been quoted to scale to thousands of nodes.
Coherency issues between Nova Scheduler and Ironic could arise if
too many nodes were deployed simultaneously. Introducing delays
during the scripting of the "nova boot" commands kept things in check.
Node deployments would be held to five building instances with
subsequent instances staggered by 25 seconds, resulting in automated
deployment of the entire machine taking 1-2 days.
Within Ironic the periodic polling of driver power states is serialised.
BMCs can be very slow to respond, and this can lead to the time taken
to poll all power states in series to grow quite large. On Bridges,
the polling takes approximately 8 minutes to complete. This can also
lead to apparent inconsistencies of state between Nova and Ironic, and
the admin team work around this issue by enforcing “settling time”
between deleting a node and reprovisioning it.
Benefiting from OpenStack and Contributing Back
The team at PSC have found benefits from using OpenStack for HPC system
* The ability to manage system image creation using OpenStack tools such
as diskimage-builder.
* Ironics automation of the management of PXE node booting.
* The prescriptive repeatable deployment process developed by the team
using Ironic and Puppet.
Robert Budden, senior cluster systems developer at PSC, has many future
plans for OpenStack and Bridges:
* Using other OpenStack services such as Magnum (Containerised workloads),
Sahara (Hadoop on the fly) and Trove (database as a service).
Developing Ironic support for network boot over OmniPath.
* Diskless boot of extremely large memory nodes using Ironics Cinder
* Deployment of a containerised OpenStack control plane using Kolla.
* Increased convergence between bare metal and virtualised OpenStack
Robert adds:
“One of the great things is that as OpenStack improves, Bridges can
improve. As these new projects come online, we can incorporate those
features and the Bridges architecture can grow with the community."
“A big thing for me is to contribute back. Im a developer by nature,
I want to fix some of the bugs and scaling issues that Ive seen and
push these back to the OpenStack community.”
A $200 Million OpenStack-Powered Supercomputer
In 2014 and 2015 the US Department of Energy announced three new
giant supercomputers, totalling $525 million, to be procured under the
CORAL (Collaboration of Oak Ridge, Argonne and Livermore) initiative.
Argonne National Laboratorys $200 million system, Aurora, features a
peak performance of 180 PFLOPs delivered by over 50,000 compute nodes.
Aurora is expected to be 18 times more powerful than Argonnes current
flagship supercomputer (Mira).
Aurora is to be deployed in 2018 by Intel, in partnership with Cray.
Aurora exemplifies the full capabilities of Intels Scalable Systems
Framework initiative. Whilst Intel are providing the processors,
memory technology and fabric interconnect, Crays long experience
and technical expertise in system integration are also fundamental to
Auroras successful delivery.
.. figure:: images/hpc_infrastructure-aurora.jpg
:width: 600
:alt: Aurora floorplan render
Crays Vision of the OpenStack-Powered Supercomputer
Cray today sells a wide range of products for supercomputing, storage
and high-performance data analytics. Aside from the companys core
offering of supercomputer systems, much of Crays product line has come
through acquisition. As a result of this historical path the system
management of each product is different, has different capabilities,
and different limitations.
The system management software that powers Crays supercomputers has
developed through long experience to become highly scalable and efficient.
The software stack is bespoke and specialised to delivering this single
capability. In some ways, its inflexible excellence represents the
antithesis of OpenStack and software-defined cloud infrastructure.
Faced with these challenges, and with customer demands for open management
interfaces, in 2013 Cray initiated a development programme for a unified
and open solution for system management across the product range.
Crays architects quickly settled on OpenStack. OpenStack relieves
the Cray engineering team of the generic aspects of system management
and frees them up to focus on problems specific to the demanding nature
of the products.
Successful OpenStack development strategies strongly favour an open
approach. Cray teams have worked with OpenStack developer communities to
bring forward the capabilities required for effective HPC infrastructure
management, for example:
* **Enhanced Ironic deployment**, using the Bareon ramdisk derived from the Fuel
deployment project. Cray management servers require complex deployment
configurations featuring multiple partitions and system images.
* **Diskless Ironic deployment**, through active participation in the
development of Cinder and Ironic integration.
* **Ironic multi-tenant networking**, through submission of bug fixes and
demonstration use cases.
* **Containerised OpenStack deployment**, through participation in the OpenStack
Kolla project.
* **Scalable monitoring infrastructure**, through participation in the Monasca
Fundamental challenges still remain for Cray to deliver
OpenStack-orchestrated system management for supercomputer systems on
the scale of Aurora. Kitrick Sheets, senior principal engineer at Cray
and architect of Crays OpenStack strategy, comments:
“Cray has spent many years developing infrastructure management
capabilities for high performance computing environments. The emergence
of cloud computing and OpenStack has provided a foundation for common
infrastructure management APIs. The abstractions provided within the
framework of OpenStack provide the ability to support familiar outward
interfaces for users who are accustomed to emerging elastic computing
environments while supporting the ability to provide features and
functions required for the support of HPC-class workloads. Normalizing
the user and administrator interfaces also has the advantage of increasing
software portability, thereby increasing the pace of innovation.”
“While OpenStack presents many advantages for the management of HPC
environments, there are many opportunities for improvement to support the
high performance, large scale use cases. Areas such as bulk deployment
of large collections of nodes, low-overhead state management, scalable
telemetry, etc. are a few of these. Cray will continue to work with
the community on these and other areas directly related to support of
current and emerging HPC hardware and software ecosystems.”
“We believe that additional focus on performance and scale which
drive toward the support of the highest-end systems will pay dividends
on systems of all sizes. In addition, as system sizes increase,
the incidents of hardware and software component failures become more
frequent, requiring increased resilience of services to support continual
operation. The community's efforts toward live service updates is one
area that will move us much further down that path.”
“OpenStack provides significant opportunities for providing core
management capabilities for diverse hardware and software ecosystems.
We look forward to continuing our work with the community to enhance
and extend OpenStack to address the unique challenges presented by high
performance computing environments.”
Most of the Benefits of Software-Defined Infrastructure...
In the space of HPC infrastructure management, OpenStacks attraction
is centred on the prospect of having all the benefits of software-defined
infrastructure while paying none of the performance overhead.
To date there is no single solution that can provide this. However,
a compromising trade-off can be struck in various ways:
* Fully-virtualised infrastructure provides all capabilities of cloud with
much of the performance overhead of cloud.
* Virtualised infrastructure using techniques such as SR-IOV and PCI
pass-through dramatically improves performance for network and IO
intensive workloads, but imposes some constraints on the flexibility of
software-defined infrastructure.
* Bare metal infrastructure management using Ironic incurs no performance
overhead, but has further restrictions on flexibility.
Each of these strategies is continually improving. Fully-virtualised
infrastructure using OpenStack private cloud provides control over
performance-sensitive parameters like resource over-commitment and
hypervisor tuning. It is anticipated that infrastructure using hardware
device pass-through optimisations will soon be capable of supporting cloud
capabilities like live migration. Ironics bare metal infrastructure
management is continually developing new ways of presenting physical
compute resources as though they were virtual.
OpenStack has already arrived in the HPC infrastructure management
ecosystem. Projects using Ironic for HPC infrastructure management
have already demonstrated success. As it matures, its proposition
of software-defined infrastructure without the overhead will become
increasingly compelling.
A Rapidly Developing Project
While it is rapidly becoming popular, Ironic is a relatively young
project within OpenStack. Some areas are still being actively developed.
For sites seeking to deploy Ironic-administered compute hardware, some
limitations remain. However, Ironic has a rapid pace of progress,
and new capabilities are released with every OpenStack release cycle.
HPC infrastructure management using OpenStack Ironic has been demonstrated
at over 800 nodes, while Ironic is claimed to scale to managing thousands
of nodes. However, new problems become apparent at scale. Currently,
large deployments using Ironic should plan for an investment in the
skill set of the administration team and active participation within
the Ironic developer community.
Further Reading
A clear and helpful introduction into the workings of Ironic in greater
Deployment of Ironic as a standalone tool:
Kate Keahey from University of Chicago presented an architecture
show-and-tell on Chameleon at the OpenStack Austin summit in April 2016:
Chameleon Clouds home page is at:
Robert Budden presented an architecture show-and-tell
on Bridges at the OpenStack Austin summit in April 2016:
Further information on Bridges is available at its home page at PSC:
Argonne National Labs home page for Aurora:
A presentation from Intel giving an overview of Aurora:
Intels Scalable System Framework:
This document was originally written by Stig Telfer of `StackHPC Ltd <>`_ with the support
of Cambridge University, with contributions, guidance and feedback from
subject matter experts:
* **Pierre Riteau**, University of Chicago and Chameleon DevOps lead.
* **Kate Keahey**, University of Chicago and Chameleon Science Director.
* **Robert Budden**, Senior Cluster Systems Developer, Pittsburgh Supercomputer Center.
* **Kitrick Sheets**, Senior Principal Engineer, Cray Inc.
.. figure:: images/cc-by-sa.png
:width: 100
:alt: Creative commons licensing
This document is provided as open source with a Creative Commons license
with Attribution + Share-Alike (CC-BY-SA)

