User Requirements

User Requirements A multi-site architecture is complex and has its own risks and considerations, therefore it is important to make sure when contemplating the design such an architecture that it meets the user and business requirements. Many jurisdictions have legislative and regulatory requirements governing the storage and management of data in cloud environments. Common areas of regulation include: Data retention policies ensuring storage of persistent data and records management to meet data archival requirements. Data ownership policies governing the possession and responsibility for data. Data sovereignty policies governing the storage of data in foreign countries or otherwise separate jurisdictions. Data compliance policies governing types of information that needs to reside in certain locations due to regular issues and, more importantly, cannot reside in other locations for the same reason. Examples of such legal frameworks include the data protection framework of the European Union (http://ec.europa.eu/justice/data-protection) and the requirements of the Financial Industry Regulatory Authority (http://www.finra.org/Industry/Regulation/FINRARules) in the United States. Consult a local regulatory body for more information.

Workload Characteristics The expected workload is a critical requirement that needs to be captured to guide decision-making. An understanding of the workloads in the context of the desired multi-site environment and use case is important. Another way of thinking about a workload is to think of it as the way the systems are used. A workload could be a single application or a suite of applications that work together. It could also be a duplicate set of applications that need to run in multiple cloud environments. Often in a multi-site deployment the same workload will need to work identically in more than one physical location. This multi-site scenario likely includes one or more of the other scenarios in this book with the additional requirement of having the workloads in two or more locations. The following are some possible scenarios: For many use cases the proximity of the user to their workloads has a direct influence on the performance of the application and therefore should be taken into consideration in the design. Certain applications require zero to minimal latency that can only be achieved by deploying the cloud in multiple locations. These locations could be in different data centers, cities, countries or geographical regions, depending on the user requirement and location of the users.

Consistency of images and templates across different sites It is essential that the deployment of instances is consistent across the different sites. This needs to be built into the infrastructure. If OpenStack Object Store is used as a back end for Glance, it is possible to create repositories of consistent images across multiple sites. Having a central endpoint with multiple storage nodes will allow for a consistent centralized storage for each and every site. Not using a centralized object store will increase operational overhead so that a consistent image library can be maintained. This could include development of a replication mechanism to handle the transport of images and the changes to the images across multiple sites.

High Availability If high availability is a requirement to provide continuous infrastructure operations, a basic requirement of High Availability should be defined. The OpenStack management components need to have a basic and minimal level of redundancy. The simplest example is the loss of any single site has no significant impact on the availability of the OpenStack services of the entire infrastructure. The OpenStack High Availability Guide (http://docs.openstack.org/high-availability-guide/content/) contains more information on how to provide redundancy for the OpenStack components. Multiple network links should be deployed between sites to provide redundancy for all components. This includes storage replication, which should be isolated to a dedicated network or VLAN with the ability to assign QoS to control the replication traffic or provide priority for this traffic. Note that if the data store is highly changeable, the network requirements could have a significant effect on the operational cost of maintaining the sites. The ability to maintain object availability in both sites has significant implications on the object storage design and implementation. It will also have a significant impact on the WAN network design between the sites. Connecting more than two sites increases the challenges and adds more complexity to the design considerations. Multi-site implementations require extra planning to address the additional topology complexity used for internal and external connectivity. Some options include full mesh topology, hub spoke, spine leaf, or 3d Torus. Not all the applications running in a cloud are cloud-aware. If that is the case, there should be clear measures and expectations to define what the infrastructure can support and, more importantly, what it cannot. An example would be shared storage between sites. It is possible, however such a solution is not native to OpenStack and requires a third-party hardware vendor to fulfill such a requirement. Another example can be seen in applications that are able to consume resources in object storage directly. These applications need to be cloud aware to make good use of an OpenStack Object Store.

Application readiness Some applications are tolerant of the lack of synchronized object storage, while others may need those objects to be replicated and available across regions. Understanding of how the cloud implementation impacts new and existing applications is important for risk mitigation and the overall success of a cloud project. Applications may have to be written to expect an infrastructure with little to no redundancy. Existing applications not developed with the cloud in mind may need to be rewritten.

Cost The requirement of having more than one site has a cost attached to it. The greater the number of sites, the greater the cost and complexity. Costs can be broken down into the following categories Compute Resources Networking resources Replication Storage Management Operational costs

Site Loss and Recovery Outages can cause loss of partial or full functionality of a site. Strategies should be implemented to understand and plan for recovery scenarios. The deployed applications need to continue to function and, more importantly, consideration should be taken of the impact on the performance and reliability of the application when a site is unavailable. It is important to understand what will happen to replication of objects and data between the sites when a site goes down. If this causes queues to start building up, considering how long these queues can safely exist until something explodes. Ensure determination of the method for resuming proper operations of a site when it comes back online after a disaster. It is recommended to architect the recovery to avoid race conditions.

Compliance and Geo-location An organization could have certain legal obligations and regulatory compliance measures which could require certain workloads or data to not be located in certain regions.

Auditing A well thought-out auditing strategy is important in order to be able to quickly track down issues. Keeping track of changes made to security groups and tenant changes can be useful in rolling back the changes if they affect production. For example, if all security group rules for a tenant disappeared, the ability to quickly track down the issue would be important for operational and legal reasons.

Separation of duties A common requirement is to define different roles for the different cloud administration functions. An example would be a requirement to segregate the duties and permissions by site.

Authentication between sites Ideally it is best to have a single authentication domain and not need a separate implementation for each and every site. This will, of course, require an authentication mechanism that is highly available and distributed to ensure continuous operation. Authentication server locality is also something that might be needed as well and should be planned for.