Building Continuous Cloud Infrastructures Deepak Verma, Senior Manager, Data Protection Products & Solutions John Harker, Senior Product Marketing Manager October 8, 2014 WebTech Educational Series Building Continuous Cloud Infrastructures In this Webtech, Hitachi design experts will cover what is needed to build Continuous Cloud Infrastructures – servers, networks and storage. Geographically distributed, fault-tolerant, stateless designs allow efficient distributed load balancing, easy migrations, and continuous uptime in the face of individual system element or site failures. Starting with distributed stretch-cluster server environments, learn how to design and deliver enterprise-class cloud storage with the Hitachi Storage Virtualization Operating System and Hitachi Virtual Storage Platform G1000. In this session, you will learn: • Options for designing Continuous Cloud Infrastructures from an application point of view. • Why a stretch-cluster server operating environment is important to Continuous Cloud Infrastructure system design. • How Hitachi global storage virtualization and global-active devices can simplify and improve server-side stretch-cluster systems. Application Business Continuity Choices Types of failure scenarios Locality of Reference of a failure How much data can we lose on failover? (RPO) How long does recovery take? (RTO) How automatic is failover? How much does solution cost? Types of Failure Events & Locality of Reference LOGICAL FAILURE PHYSICAL FAILURE Probability: High Causes: Human Error, Bugs Desired RTO/RPO: Low/Low Remediation: Savepoints, Logs, Backups, Point-in-Time Snapshots Cost: $ Probability: Medium Causes: Hardware Failure Desired RTO/RPO: Zero/Zero Remediation: Local High Availability Clusters – Servers & Storage Cost: $$ Probability: Low Causes: Rolling Disasters Desired RTO/RPO: Medium/ Remediation: Remote Replication with Point-in-Time Snapshots Cost: $$$$ Probability: Very Low Causes: Immediate Site Failure Desired RTO/RPO: Low/ZeroLow Remediation: Replication Synchronous, Remote High Availability Cost: $$$ COST PROBABILITY REMOTE RECOVERY LOCALIZED RECOVERY Understanding RPO and RTO 2am Hours 4am 6am 8am RPO 10am 12pm 2pm 4pm Seconds 8pm OUTAGE 12am ZERO 10pm 12am RTO 2am Hours Data Protection Options Traditional approach ‒ Multi-pathing, Server clusters, backups, Application or database driven Storage array based replication ‒ Remote and local protection Appliance based solutions ‒ Stretched clusters, quorums Array based high availability Traditional Data Protection Approach App/DB Server Cluster Buffer App/DB Server App/DB Server Buffer Buffer Cluster App/DB Server Buffer App & DB Backup DB Log DB App & DB Restore Log Tape Tape Truck, Tape Copy, VTL Replication ‒ Focus has been on server failures at local site only ‒ Coupled with enterprise storage for higher localized UpTime ‒ Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 4-24 hrs. 8-48 hrs. 8-48 hrs. Logical failures and rolling disasters have high RPO/RTO RTO 0* 4-8 hrs. 4+ hrs. 4+ hrs. ‒ Scalability and efficiency are oxymoron Caveats ‒ Recovery involves manual intervention and scripting *Assume HA for every component and cluster aware application. Application Based Data Protection Approach App/DB Server Cluster Buffer Application Data Transfer App/DB Server App/DB Server Cluster App/DB Server Buffer Buffer Buffer App & DB Backup DB Log DB App & DB Restore Log Tape Tape Truck, Tape Copy, VTL Replication ‒ Reduces remote physical recovery times ‒ Requires additional standby infrastructure, licenses ‒ Consumes processing capability of application/db servers ‒ Specific to every application type, OS type, etc. ‒ Fail-back involves manual intervention, scripting Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 4-24 hrs. 0-4 hrs.# 8-48 hrs. RTO 0* 4-8 hrs. 15 min. - 4 hrs.# 4+ hrs. Caveats *Assume HA for every component and cluster aware application. #Network latency and application overhead dictate values Array Based Data Protection Approach App /DB Server Cluster Buffer Offline App/DB Server App/DB Server Buffer Cluster Offline App/DB Server OFFLINE Array Based Block Sync or Async. App/DB Aware Local Array Clone/Snap DB DB Log ‒ Reduces recovery times across the board ‒ No additional standby infrastructure, licenses, or compute power Generic to any application type, OS type, etc. ‒ Fail-back as easy as fail-over, with some scripting Not application awareness, usually crash consistent App/Db Aware Remote Array Clone/Snap Log Single IO Consistency Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 15 min. – 24 hrs. 0-4 hrs.# 15-24 hrs. RTO 0* 1 – 5 min. 5 - 15 min. 1 – 5 min. Caveats ‒ DB Optional Batch Copy Single IO Consistency ‒ Log Single IO Consistency Single IO Consistency DB Tape Log *Assume HA for every component and cluster aware application. #Network latency and application overhead dictate values Appliance Based High Availability Approach App/DB Server Cluster Buffer Extended Cluster App/DB Server App/DB Server DB Server Buffer Buffer Applianc e Cluster App/DB Applianc e Buffer Applianc e Log Applianc e App & DB Backup DB App & DB Restore Quorum Log DB Log Tape Tape Truck, Tape Copy, VTL Replication Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 4-24 hrs. 0# 8-48 hrs. Introduces complexity (connectivity, quorum) and risk and latency to performance RTO 0* 4-8 hrs. 0# 4+ hrs. Does not address logical recovery RPO and RTO Caveats ‒ Takes remote physical recovery times to zero ‒ Combine with app/db/os clusters for “true” 0 RPO & RTO ‒ ‒ *Assume HA for every component and cluster aware application. #Synchronous Distances, coupled with app/db/os geo-clusters Array Based H/A + Data Protection Approach App /DB Server Cluster Buffer App/DB Server Offline App /DB App/DB Server Server Buffer Extended Cluster Buffer App/DB Cluster Offline App/DB Server Server Buffer OFFLINE App/DB Aware Local Array Clone/Snap DB Log ArrayArray Based Bi-Directional Based Block HighSync Availability Copy or Async. DB Log Single IO Consistency Single IO Consistency App/Db Aware Remote Array Clone/Snap Quorum DB Tape DB Log Single IO Consistency Single IO Consistency ‒ Takes remote physical recovery times down to zero ‒ Generic to any application type, OS type, etc. Log Local Physical Local Logical Remote Physical Remote Log. & Phy. RPO 0* 15 min. – 24 hrs. 0# 15 min -24 hrs. 0* 1 – 5 min. 0# 1 – 5 min. ‒ No performance impact, built-in capability of array ‒ Combine with app/db/os clusters for “true” 0 RPO & RTO RTO ‒ Fail-back as easy as fail-over, no scripting Caveats ‒ Combined with snaps/clones for dual logical protection *Assume HA for every component and cluster aware application. #Synchronous Distances, coupled with app/db/os geo-clusters Consideration to move to an active-active highly available architecture Storage platform capable of supporting H/A Application/DB/OS clusters capable of utilizing storage H/A functionality without impacts Network capable of running dual site workloads with low latency Quorum site considerations to protect against split-brain or H/A downtime. People and process maturity in managing active-active sites Coupled logical protection across both sites and 3rd site DR Options for Data Protection 2am 4am Hours Archive Hitachi Content Platform Backup Data Instance Manager Data Protection Suite Symantec Netbackup 6am 8am 10am 12pm 4pm 2pm RPO Seconds Application aware Snapshot and Mirroring Operational resiliency Operational recovery • HAPRO • HDPS IntelliSnap • Thin Image or In-system replication Disaster Recovery • Universal Replicator (async) • TrueCopy (sync) • Universal Replicator (async) • Truecopy (sync) CDP Data Instance Manager 8pm OUTAGE 12am ZERO 10pm 12am RTO 2am Hours Transparent Cluster Failover Restore/Rec over from Restore from Global Active Device Snapshot Database logs Backup Mirroring Replication Always On Hitachi Storage Virtualization Operating System Introducing Global Storage Virtualization Virtual Server Machines FOREVER CHANGED the way we see DATA CENTERS Hitachi STORAGE VIRTUAL OPERATING SYSTEM is doing the SAME Application Application Virtual Storage Identify Virtual Storage Identify Operating System Operating System Host I/O and Copy Mgmt. Host I/O and Copy Mgmt. Virtual Hardware Virtual Hardware Virtual Hardware Virtual Hardware Hardware Server OS and VM File System CPU Memory NIC Drive Hardware Virtual Storage Software Virtual Storage Director Cache Front-End Ports Media Disaster Avoidance Simplified New SVOS global-active device (GAD) Virtual-storage machine abstracts underlying physical arrays from hosts Storage-site failover transparent to host and requires no reconfiguration When new global-active device volumes are provisioned from virtual-storage machine, they can be automatically protected Simplified management from a single pane of glass Site A Site B Compute HA Cluster Storage HA Cluster Global Storage Virtualization Virtual-Storage Machine Supported Server Cluster Environments SVOS global-active device OS + Multipath + Cluster Software Support Matrix Global-active device Support OS Version Cluster August 2014 VMware 4.x, 5.x VMware HA (Vmotion) Supported Supported IBM AIX 6.x, 7.x HACMP / PowerHA Supported 2008 MSFC Supported 2008 R2 MSFC Supported 2012 MSFC Supported 2012 R2 MSFC Red Hat Cluster VCS Supported Supported Supported Microsoft Windows Red Hat Linux 5.x, 6.x Hewlett Packard HP-UX 11iv2, 11iv3 MC/SG 10, 11.1 SC VCS Oracle RAC Oracle Solaris 1Q2015 Supported Supported Supported Supported Hitachi SVOS Global-Active Device Clustered Active-Active Systems global storage virtualization global-active device Servers with Apps Requiring High Availability Write to Multiple Copies Simultaneous from Multiple Applications Virtual LDEVs: 10:01 10:02 Virtual Storage Identity 123456 Virtual Storage Identity 123456 Resource Group 1 Resource Group 2 LDEVs: 10:00 10:01 10:02 LDEVs: 20:00 20:01 20:02 Quorum Servers with Apps Requiring High Availability Read Locally Simultaneous from Multiple Applications One Technology, Many Uses Cases HETEROGENEOUS STORAGE VIRTUALIZATION Host GLOBAL-ACTIVE DEVICES Virtual-Storage Machine CPU CACHE PORTS PORTS MEDIA Physical-Storage Machine MEDIA External-Storage Machine MEDIA External-Storage Machine One Technology, Many Uses Cases NON-DISRUPTIVE MIGRATION Host Preserve Identity During Migration GLOBAL-ACTIVE DEVICES Virtual-Storage Machine LOGICAL DEVICES CPU CACHE PORTS PORTS MEDIA MEDIA Physical-Storage Machine Physical-Storage Machine MEDIA One Technology, Many Uses Cases MULTI-TENANCY Host Host GLOBAL-ACTIVE DEVICES Virtual-Storage Machine #1 CPU CACHE GLOBAL-ACTIVE DEVICES Virtual-Storage Machine #2 PORTS PORTS MEDIA MEDIA Physical-Storage Machine MEDIA One Technology, Many Uses Cases FAULT TOLERANCE Host MIRRORING GLOBAL-ACTIVE DEVICES GLOBAL-ACTIVE DEVICES Virtual-Storage Machine CPU CACHE PORTS PORTS MEDIA MEDIA Physical-Storage Machine MEDIA CPU CACHE PORTS PORTS MEDIA MEDIA Physical-Storage Machine MEDIA One Technology, Many Uses Cases APPLICATION / HOST - LOAD-BALANCING Application GLOBAL-ACTIVE DEVICES Virtual Storage Machine CPU CACHE PORTS PORTS MEDIA MEDIA Physical Storage Machine #1 MEDIA CPU CACHE PORTS PORTS MEDIA MEDIA Physical Storage Machine #2 MEDIA One Technology, Many Uses Cases DISASTER AVOIDANCE and ACTIVE-ACTIVE DATA CENTER Host Host NAS NAS Server Cluster MIRRORING GLOBAL-ACTIVE DEVICES GLOBAL-ACTIVE DEVICES Virtual Storage Machine CPU CACHE PORTS PORTS MEDIA MEDIA Physical Storage Machine Site A MEDIA CPU CACHE PORTS PORTS MEDIA MEDIA Physical Storage Machine Site B MEDIA Delivering Always-Available VMware Prod. Servers (Active) VMware Stretch Cluster Prod. Servers (Active) Extend native VMware functionality with or without vMetro Storage Cluster Active/Active over metro distances Global-active device Fast, simple non-disruptive migrations 3-data center high availability (with SRM support) Hitachi Thin Image snapshot support Site 1 Site 2 QRM Quorum system VMware Continuous Infrastructure Scenarios Application Migration Path/Storage Failover HA Failover Read/Write IO switches to local site’s path ESX switches paths to alternate site path VMware HA fails over VM, local site’s IO path is used VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM VMware ESX +HDLM HCS Active Quorum Quorum Quorum Delivering Always-Available Oracle RAC Prod. Servers (Active) Oracle RAC Prod. Servers (Active) Elegant distance extension to Oracle RAC Active/Active over metro distances Simplified designs, fast non-disruptive migrations Global-active device Site 1 Site 2 3-data center high availability Increase infrastructure utilization and reduce costs QRM Quorum Delivering Always-Available Microsoft Hyper-V Active/Active over metro distances Prod. Servers (Active) Microsoft Multisite/Stretch Cluster Prod. Servers (Active) Complement or avoid Microsoft geo clustering Fast, simple and non-disruptive application migrations Hitachi Thin Image snapshot support Simple Failover and failback Global-active device Site 1 Site 2 QRM Quorum Global-Active Device Management Hitachi Command Suite (HCS) offers efficient management of global-active devices while providing central control of multiple systems Storage-Management Server Storage Mgt Server (Active) HCS Prod. Server-1 (Active) HCS Agent CCI App/ DBMS Storage Mgt Server (Passive) HCS clustering App/DBMS clustering Clustered HCS server is used, the local HCS server enables GAD management HCS Prod. Server-2 (Active) HCS Agent CCI App/ DBMS Pair Mgt Server Pair Mgt Server CMD HA mirroring HCS DB TC/HA mirroring Primary Quorum Volume HCS Database should be replicated with either TrueCopy or GAD Pair-Management Servers Managed through Hitachi Replication Manager CMD HCS DB Remote QRM If local site fails, the remote HCS server takes over GAD management Runs Hitachi Device Manager Agent/CCI HCS management requests to configure/operate the HA mirrored via the command device 3 Data Center Always Available Infrastructures Protecting the protected Server Node (e.g., Oracle/RAC) Server Cluster Server Node (e.g., Oracle/RAC) I/O Active I/O Active Global-Active Device Active-Active high availability Read-local Global Active Device (GAD) HUR PVOL HUR PVOL Journal group Journal group HUR Active HUR Standby Bi-directional synchronous writes Metro distance Consistency groups (supported early 2015) Hitachi Universal Replicator Active/standby ‘remote’ paths HUR SVOL Journal group Quorum Pair configuration is on GAD consistency and HUR journal group basis with Delta-Resync Journal groups with Delta-Resync Any distance Remote FCIP Quorum Global-Active Device Specifications Index August 2014 Late 2014 Global-active device management Hitachi Command Suite v8.0.1 or later Max number of volumes (creatable pairs) 64K Max pool capacity 12.3 PB Max volume capacity 46 MB to 4 TB 46 MB to 59.9 TB Supporting products in combination with global-active device. All on either side or both sides Dynamic Provisioning / Dynamic Tiering / Hitachi Universal Volume Manager ShadowImage / Thin Image HUR with Delta-Resync Nondisruptive Migration (NDM) Campus distance support Can use any qualified path failover software Metro distance support Hitachi Dynamic Link Manager is required (until ALUA support) Hitachi Storage Software Implementation Services Service Description Pre-deployment assessment of your environment Planning and design Prepare subsystem for replication options Implementations: ‒ Create and delete test configuration ‒ Create production configuration ‒ Integrate production environment with Hitachi Storage Software Test and validate installation Knowledge transfer Don’t Pay the Appliance Tax! SAN port explosion Appliance proliferation Additional management tools Limited snapshot support Per-appliance capacity pools Disruptive migrations All of the above With Appliances Complexity Scales Faster Than Capacity Global-Active Device: Simplicity at Scale Native, high-performance design Single management interface Advanced non-disruptive migrations Simplified SAN topologies Large-scale data protection support Full access to storage pool All of the above Avoid the Appliance Tax With Hitachi Hitachi Global Storage Virtualization OPERATIONAL SIMPLICITY ENTERPRISE SCALE Questions and Discussion Upcoming WebTechs WebTechs, 9 a.m. PT, 12 p.m. ET ‒ The Rise of Enterprise IT-as-a-Service, October 22 ‒ Stay tuned for new sessions in November Check www.hds.com/webtech for ‒ Links to the recording, the presentation, and Q&A (available next week) ‒ Schedule and registration for upcoming WebTech sessions Questions will be posted in the HDS Community: http://community.hds.com/groups/webtech Thank You
© Copyright 2024 ExpyDoc