Building Continuous Cloud Infrastructures

Building Continuous Cloud
Infrastructures
Deepak Verma, Senior Manager, Data Protection Products & Solutions
John Harker, Senior Product Marketing Manager
October 8, 2014
WebTech Educational Series
Building Continuous Cloud Infrastructures
In this Webtech, Hitachi design experts will cover what is needed to build Continuous Cloud Infrastructures
– servers, networks and storage. Geographically distributed, fault-tolerant, stateless designs allow efficient
distributed load balancing, easy migrations, and continuous uptime in the face of individual system element
or site failures. Starting with distributed stretch-cluster server environments, learn how to design and
deliver enterprise-class cloud storage with the Hitachi Storage Virtualization Operating System and Hitachi
Virtual Storage Platform G1000.
In this session, you will learn:
•
Options for designing Continuous Cloud Infrastructures from an application point of view.
•
Why a stretch-cluster server operating environment is important to Continuous Cloud Infrastructure
system design.
•
How Hitachi global storage virtualization and global-active devices can simplify and improve server-side
stretch-cluster systems.
Application Business Continuity Choices
 Types of failure scenarios
 Locality of Reference of a failure
 How much data can we lose on failover? (RPO)
 How long does recovery take? (RTO)
 How automatic is failover?
 How much does solution cost?
Types of Failure Events & Locality of
Reference
LOGICAL
FAILURE
PHYSICAL
FAILURE
Probability: High
Causes: Human Error, Bugs
Desired RTO/RPO: Low/Low
Remediation: Savepoints, Logs,
Backups, Point-in-Time Snapshots
Cost: $
Probability: Medium
Causes: Hardware Failure
Desired RTO/RPO: Zero/Zero
Remediation: Local High
Availability Clusters – Servers &
Storage
Cost: $$
Probability: Low
Causes: Rolling Disasters
Desired RTO/RPO: Medium/
Remediation: Remote
Replication with Point-in-Time
Snapshots
Cost: $$$$
Probability: Very Low
Causes: Immediate Site Failure
Desired RTO/RPO: Low/ZeroLow
Remediation: Replication
Synchronous, Remote High
Availability
Cost: $$$
COST
PROBABILITY
REMOTE RECOVERY
LOCALIZED RECOVERY
Understanding RPO and RTO
2am
Hours
4am
6am
8am
RPO
10am
12pm
2pm
4pm
Seconds
8pm
OUTAGE
12am
ZERO
10pm
12am
RTO
2am
Hours
Data Protection Options
 Traditional approach
‒ Multi-pathing, Server clusters, backups, Application or database
driven
 Storage array based replication
‒ Remote and local protection
 Appliance based solutions
‒ Stretched clusters, quorums
 Array based high availability
Traditional Data Protection Approach
App/DB
Server
Cluster
Buffer
App/DB
Server
App/DB
Server
Buffer
Buffer
Cluster App/DB
Server
Buffer
App & DB
Backup
DB
Log
DB
App & DB
Restore
Log
Tape
Tape
Truck, Tape Copy, VTL Replication
‒
Focus has been on server failures at local site only
‒
Coupled with enterprise storage for higher localized UpTime
‒
Local
Physical
Local
Logical
Remote
Physical
Remote
Log. & Phy.
RPO
0*
4-24 hrs.
8-48 hrs.
8-48 hrs.
Logical failures and rolling disasters have high RPO/RTO
RTO
0*
4-8 hrs.
4+ hrs.
4+ hrs.
‒
Scalability and efficiency are oxymoron
Caveats
‒
Recovery involves manual intervention and scripting
*Assume HA for every component and cluster aware application.
Application Based Data Protection Approach
App/DB
Server
Cluster
Buffer
Application
Data Transfer
App/DB
Server
App/DB
Server
Cluster App/DB
Server
Buffer
Buffer
Buffer
App & DB
Backup
DB
Log
DB
App & DB
Restore
Log
Tape
Tape
Truck, Tape Copy, VTL Replication
‒
Reduces remote physical recovery times
‒
Requires additional standby infrastructure, licenses
‒
Consumes processing capability of application/db servers
‒
Specific to every application type, OS type, etc.
‒
Fail-back involves manual intervention, scripting
Local
Physical
Local
Logical
Remote
Physical
Remote
Log. & Phy.
RPO
0*
4-24 hrs.
0-4 hrs.#
8-48 hrs.
RTO
0*
4-8 hrs.
15 min. - 4
hrs.#
4+ hrs.
Caveats
*Assume HA for every component and cluster aware application.
#Network
latency and application overhead dictate values
Array Based Data Protection Approach
App /DB
Server
Cluster
Buffer
Offline
App/DB
Server
App/DB
Server
Buffer
Cluster Offline
App/DB
Server
OFFLINE
Array Based Block
Sync or Async.
App/DB Aware
Local Array
Clone/Snap
DB
DB
Log
‒
Reduces recovery times across the board
‒
No additional standby infrastructure, licenses, or compute
power
Generic to any application type, OS type, etc.
‒
Fail-back as easy as fail-over, with some scripting
Not application awareness, usually crash consistent
App/Db Aware
Remote Array
Clone/Snap
Log
Single IO Consistency
Local
Physical
Local
Logical
Remote
Physical
Remote
Log. & Phy.
RPO
0*
15 min. –
24 hrs.
0-4 hrs.#
15-24 hrs.
RTO
0*
1 – 5 min.
5 - 15 min.
1 – 5 min.
Caveats
‒
DB
Optional Batch Copy
Single IO Consistency
‒
Log
Single IO Consistency
Single IO Consistency
DB
Tape
Log
*Assume HA for every component and cluster aware application.
#Network
latency and application overhead dictate values
Appliance Based High Availability Approach
App/DB
Server
Cluster
Buffer
Extended Cluster
App/DB
Server
App/DB
Server
DB
Server
Buffer
Buffer
Applianc
e
Cluster App/DB
Applianc
e
Buffer
Applianc
e
Log
Applianc
e
App & DB
Backup
DB
App & DB
Restore
Quorum
Log
DB
Log
Tape
Tape
Truck, Tape Copy, VTL Replication
Local
Physical
Local
Logical
Remote
Physical
Remote
Log. & Phy.
RPO
0*
4-24 hrs.
0#
8-48 hrs.
Introduces complexity (connectivity, quorum) and risk and
latency to performance
RTO
0*
4-8 hrs.
0#
4+ hrs.
Does not address logical recovery RPO and RTO
Caveats
‒
Takes remote physical recovery times to zero
‒
Combine with app/db/os clusters for “true” 0 RPO & RTO
‒
‒
*Assume HA for every component and cluster aware application.
#Synchronous
Distances, coupled with app/db/os geo-clusters
Array Based H/A + Data Protection Approach
App /DB
Server
Cluster
Buffer
App/DB
Server
Offline
App /DB
App/DB
Server
Server
Buffer
Extended Cluster
Buffer
App/DB
Cluster Offline
App/DB
Server
Server
Buffer
OFFLINE
App/DB Aware
Local Array
Clone/Snap
DB
Log
ArrayArray
Based
Bi-Directional
Based
Block
HighSync
Availability
Copy
or Async.
DB
Log
Single IO Consistency
Single IO Consistency
App/Db Aware
Remote Array
Clone/Snap
Quorum
DB
Tape
DB
Log
Single IO Consistency
Single IO Consistency
‒
Takes remote physical recovery times down to zero
‒
Generic to any application type, OS type, etc.
Log
Local
Physical
Local
Logical
Remote
Physical
Remote
Log. & Phy.
RPO
0*
15 min. –
24 hrs.
0#
15 min -24
hrs.
0*
1 – 5 min.
0#
1 – 5 min.
‒
No performance impact, built-in capability of array
‒
Combine with app/db/os clusters for “true” 0 RPO & RTO
RTO
‒
Fail-back as easy as fail-over, no scripting
Caveats
‒
Combined with snaps/clones for dual logical protection
*Assume HA for every component and cluster aware application.
#Synchronous
Distances, coupled with app/db/os geo-clusters
Consideration to move to an active-active
highly available architecture
 Storage platform capable of supporting H/A
 Application/DB/OS clusters capable of utilizing
storage H/A functionality without impacts
 Network capable of running dual site workloads with low latency
 Quorum site considerations to protect against split-brain or
H/A downtime.
 People and process maturity in managing active-active sites
 Coupled logical protection across both sites and 3rd site DR
Options for Data Protection
2am
4am
Hours
Archive
Hitachi
Content
Platform
Backup
Data
Instance
Manager
Data
Protection
Suite
Symantec
Netbackup
6am
8am
10am
12pm
4pm
2pm
RPO
Seconds
Application aware
Snapshot and Mirroring
Operational
resiliency
Operational recovery
• HAPRO
• HDPS IntelliSnap
• Thin Image or In-system
replication
Disaster Recovery
• Universal Replicator
(async)
• TrueCopy (sync)
• Universal
Replicator
(async)
• Truecopy
(sync)
CDP
Data Instance
Manager
8pm
OUTAGE
12am
ZERO
10pm
12am
RTO
2am
Hours
Transparent
Cluster Failover
Restore/Rec
over from
Restore
from
Global
Active
Device
Snapshot
Database
logs
Backup
Mirroring
Replication
Always
On
Hitachi Storage Virtualization Operating System
Introducing Global Storage Virtualization
Virtual Server Machines FOREVER
CHANGED the way we see
DATA CENTERS
Hitachi
STORAGE VIRTUAL OPERATING
SYSTEM is doing the SAME
Application
Application
Virtual Storage
Identify
Virtual Storage
Identify
Operating
System
Operating
System
Host I/O and
Copy Mgmt.
Host I/O and
Copy Mgmt.
Virtual
Hardware
Virtual
Hardware
Virtual Hardware
Virtual Hardware
Hardware
Server OS and VM File System
CPU
Memory
NIC
Drive
Hardware
Virtual Storage Software
Virtual Storage
Director
Cache
Front-End
Ports
Media
Disaster Avoidance Simplified
New SVOS global-active device (GAD)
 Virtual-storage machine
abstracts underlying physical
arrays from hosts
 Storage-site failover
transparent to host and
requires no reconfiguration
 When new global-active device
volumes are provisioned from
virtual-storage machine, they
can be automatically protected
 Simplified management from a
single pane of glass
Site A
Site B
Compute HA Cluster
Storage HA Cluster
Global Storage Virtualization
Virtual-Storage Machine
Supported Server Cluster Environments
SVOS global-active device
OS + Multipath + Cluster
Software Support Matrix
Global-active device
Support
OS
Version
Cluster
August 2014
VMware
4.x, 5.x
VMware HA
(Vmotion)
Supported
Supported
IBM AIX
6.x, 7.x
HACMP / PowerHA
Supported
2008
MSFC
Supported
2008 R2
MSFC
Supported
2012
MSFC
Supported
2012 R2
MSFC
Red Hat Cluster
VCS
Supported
Supported
Supported
Microsoft Windows
Red Hat Linux
5.x, 6.x
Hewlett Packard HP-UX
11iv2, 11iv3
MC/SG
10, 11.1
SC
VCS
Oracle RAC
Oracle Solaris
1Q2015
Supported
Supported
Supported
Supported
Hitachi SVOS Global-Active Device
Clustered Active-Active Systems
global storage virtualization
global-active device
Servers with Apps
Requiring
High Availability
Write to
Multiple Copies
Simultaneous
from Multiple
Applications
Virtual LDEVs:
10:01
10:02
Virtual Storage
Identity 123456
Virtual Storage
Identity 123456
Resource
Group 1
Resource
Group 2
LDEVs:
10:00
10:01
10:02
LDEVs:
20:00
20:01
20:02
Quorum
Servers with Apps
Requiring
High Availability
Read Locally
Simultaneous
from Multiple
Applications
One Technology, Many Uses Cases
HETEROGENEOUS STORAGE VIRTUALIZATION
Host
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
Physical-Storage Machine
MEDIA
External-Storage Machine
MEDIA
External-Storage Machine
One Technology, Many Uses Cases
NON-DISRUPTIVE MIGRATION
Host
Preserve Identity
During Migration
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine
LOGICAL DEVICES
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine
Physical-Storage
Machine
MEDIA
One Technology, Many Uses Cases
MULTI-TENANCY
Host
Host
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine #1
CPU
CACHE
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine #2
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine
MEDIA
One Technology, Many Uses Cases
FAULT TOLERANCE
Host
MIRRORING
GLOBAL-ACTIVE DEVICES
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine
MEDIA
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine
MEDIA
One Technology, Many Uses Cases
APPLICATION / HOST - LOAD-BALANCING
Application
GLOBAL-ACTIVE DEVICES
Virtual Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine #1
MEDIA
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine #2
MEDIA
One Technology, Many Uses Cases
DISASTER AVOIDANCE and ACTIVE-ACTIVE DATA CENTER
Host
Host
NAS
NAS
Server Cluster
MIRRORING
GLOBAL-ACTIVE DEVICES
GLOBAL-ACTIVE DEVICES
Virtual Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine
Site A
MEDIA
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine
Site B
MEDIA
Delivering Always-Available VMware
Prod. Servers
(Active)
VMware Stretch
Cluster
Prod. Servers
(Active)
 Extend native VMware functionality with
or without vMetro Storage Cluster
 Active/Active over metro distances
Global-active
device
 Fast, simple non-disruptive migrations
 3-data center high availability (with
SRM support)
 Hitachi Thin Image snapshot support
Site 1
Site 2
QRM
Quorum
system
VMware Continuous Infrastructure Scenarios
Application Migration
Path/Storage Failover
HA Failover
Read/Write IO switches to local site’s path
ESX switches paths to alternate site path
VMware HA fails over VM, local site’s IO path is used
VMware ESX
+HDLM
VMware ESX
+HDLM
VMware ESX
+HDLM
VMware ESX
+HDLM
VMware ESX
+HDLM
VMware ESX
+HDLM
HCS
Active
Quorum
Quorum
Quorum
Delivering Always-Available Oracle RAC
Prod. Servers
(Active)
Oracle RAC
Prod. Servers
(Active)
 Elegant distance extension to Oracle
RAC
 Active/Active over metro distances
 Simplified designs, fast non-disruptive
migrations
Global-active
device
Site 1
Site 2
 3-data center high availability
 Increase infrastructure utilization and
reduce costs
QRM
Quorum
Delivering Always-Available Microsoft Hyper-V
 Active/Active over metro distances
Prod. Servers
(Active)
Microsoft
Multisite/Stretch
Cluster
Prod. Servers
(Active)
 Complement or avoid Microsoft geo
clustering
 Fast, simple and non-disruptive
application migrations
 Hitachi Thin Image snapshot support
 Simple Failover and failback
Global-active
device
Site 1
Site 2
QRM
Quorum
Global-Active Device Management
Hitachi Command Suite (HCS) offers efficient management of global-active devices
while providing central control of multiple systems
Storage-Management Server
Storage Mgt
Server
(Active)
HCS
Prod. Server-1
(Active)
HCS
Agent
CCI
App/
DBMS
Storage Mgt
Server (Passive)
HCS
clustering
App/DBMS
clustering
 Clustered HCS server is used, the local HCS
server enables GAD management
HCS
Prod. Server-2
(Active)
HCS
Agent
CCI
App/
DBMS
Pair Mgt
Server
Pair Mgt
Server
CMD
HA
mirroring
HCS
DB
TC/HA
mirroring
Primary
Quorum
Volume
 HCS Database should be replicated with either
TrueCopy or GAD
Pair-Management Servers
 Managed through Hitachi Replication Manager
CMD
HCS
DB
Remote
QRM
 If local site fails, the remote HCS server takes
over GAD management
 Runs Hitachi Device Manager Agent/CCI
 HCS management requests to
configure/operate the HA mirrored via the
command device
3 Data Center Always Available Infrastructures
Protecting the protected
Server Node
(e.g., Oracle/RAC)
Server
Cluster
Server Node
(e.g., Oracle/RAC)
I/O
Active
I/O
Active
Global-Active Device
 Active-Active high availability
 Read-local
Global Active Device
(GAD)
HUR PVOL
HUR PVOL
Journal
group
Journal
group
HUR Active
HUR Standby
 Bi-directional synchronous writes
 Metro distance
 Consistency groups (supported early 2015)
Hitachi Universal Replicator
 Active/standby ‘remote’ paths
HUR SVOL
Journal
group
Quorum
Pair configuration is on GAD
consistency and HUR journal
group basis with Delta-Resync
 Journal groups with Delta-Resync
 Any distance
 Remote FCIP Quorum
Global-Active Device Specifications
Index
August 2014
Late 2014
Global-active device management
Hitachi Command Suite
v8.0.1 or later
Max number of volumes (creatable
pairs)
64K
Max pool capacity
12.3 PB
Max volume capacity
46 MB to 4 TB
46 MB to 59.9 TB
Supporting products in combination
with global-active device. All on either
side or both sides
Dynamic Provisioning /
Dynamic Tiering / Hitachi
Universal Volume Manager
ShadowImage / Thin Image
HUR with Delta-Resync
Nondisruptive Migration
(NDM)
Campus distance support
Can use any qualified path
failover software
Metro distance support
Hitachi Dynamic Link
Manager is required
(until ALUA support)
Hitachi Storage Software Implementation Services
Service Description
 Pre-deployment assessment of your environment
 Planning and design
 Prepare subsystem for replication options
 Implementations:
‒ Create and delete test configuration
‒ Create production configuration
‒ Integrate production environment with Hitachi
Storage Software
 Test and validate installation
 Knowledge transfer
Don’t Pay the Appliance Tax!
SAN port explosion
Appliance proliferation
Additional management tools
Limited snapshot support
Per-appliance capacity pools
Disruptive migrations
All of the above
With Appliances
Complexity
Scales
Faster Than
Capacity
Global-Active Device: Simplicity at Scale
Native, high-performance design
Single management interface
Advanced non-disruptive migrations
Simplified SAN topologies
Large-scale data protection support
Full access to storage pool
All of the above
Avoid the
Appliance
Tax
With
Hitachi
Hitachi Global Storage Virtualization
OPERATIONAL
SIMPLICITY
ENTERPRISE SCALE
Questions
and Discussion
Upcoming WebTechs
 WebTechs, 9 a.m. PT, 12 p.m. ET
‒ The Rise of Enterprise IT-as-a-Service, October 22
‒ Stay tuned for new sessions in November
 Check www.hds.com/webtech for
‒ Links to the recording, the presentation, and Q&A (available next week)
‒ Schedule and registration for upcoming WebTech sessions
Questions will be posted in the HDS Community:
http://community.hds.com/groups/webtech
Thank You