SEcure Cloud computing for CRitical Infrastructure IT Resilience management for the cloud Noor Shirazi, Steven Simpson, Andreas Mauthe & David Hutchison Lancaster University {n.shirazi, s.simpson,a.mauthe,d.hutchison}@lancaster.ac.uk AIT Austrian Institute of Technology • ETRA Investigación y Desarrollo • Fraunhofer Institute for Experimental Software Engineering IESE • Karlsruhe Institute of Technology • NEC Europe • Lancaster University • Mirasys • Hellenic Telecommunications Organization OTE • Ayuntamiento de Valencia • Amaris Critical cloud computing • Cloud services which underpin CI o Cloud computing services which are used by operators of CI to support the delivery of their core services, in cases where the reliability of the underlying cloud technology is itself essential to the safe functioning of the critical service. • Cloud services which underpin the digital society o Cloud computing services which are critical in themselves, i.e. failure would have significant impact on health, safety, security or economic well-being of citizens or the effective functioning of EU governments. Source: “ENISA Incident reporting for cloud computing, 2013” www.enisa.europa.eu • • • • • • • Secure cloud air traffic management solution (SESAR) NASDAQ QMX FinQloud for compliance and surveillance system A Slovenia-based railway operator has a cloud-based platform to centralize passenger, freight and logistic systems A big oil company in US has adopted cloud solution, 60% of its infrastructure is virtual Jan 2013, Dropbox suffered a substantial loss of service for more than 15 hours affecting all users across globe. March 2013, Microsoft email infrastructure suffered a loss of availability for nearly 16 hours affecting business critical services August 2013, Amazon Web services suffered an outage, taking down Vine, Instagram and other applications for an hour. Resilience as a cloud need • Deploying CI services in the cloud increases resilience and security concerns • A resilient system is one that can continue to offer a satisfactory level of service even in the face of the challenges it experiences • We need resilience as a property of the Cloud (networks and systems (VM)), such that they can withstand any challenge, whether from misconfigurations, congestion/overloads (including flash crowds), or attacks (such as DDoS, malware) We define cloud resilience as “the ability to maintain an acceptable level of system operation and services even in the presence of challenges”. Resilience strategy: D²R² + DR • D2R2+DR  Resilience • Real-time control loop o Defend against challenges and threats to normal operation • reduce the probability of a fault leading to a failure • reduce the impact of adverse event or condition o Detect when an adverse event has occurred • determine when remediation needs to occur o Remediate the effects of the adverse event • minimize the impact of failure • graceful degradation of performance o Recover to original and normal operations once an adverse event has ended Source: James PG Sterbenz, David • Off-line control loop o Diagnose reflecting on past operational experiences o Refine; aim to improve design of system (e.g. cloud) Hutchison, Egemen K Çetinkaya, Abdul Jabbar, Justin P. rohrer, Marcus Schöller, and Paul Smith. Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines. Computer Networks, 54(8):1245–1265, 2010. SECCRIT objective mapping to resilience • Cloud resilience management framework (CRMF) o Design a joint network- and system-wide analysis via a unified resilience architecture • The resilience framework helps to protect cloud infrastructures through dynamically observing the state of the cloud services and resources, analysing for threats and remediating against their effects. De-constructing D²R² + DR • Detect o Implies a monitoring system (Network and VM level) • Instrument the cloud • Aim to observe normal behaviour • Then look for anomalies o Employ suitable ADTs (migration-aware) • Classify the detected anomalies • Attempt a root cause analysis • Remediate o Policy-based remediation o Make adjustments as appropriate e.g. migrate VMs, adjust firewall rules, sandbox a VM o Get as much context as possible • Recover o Get back to normal behaviour if possible o Use policies for high-level guidance • Diagnose & Refine o Learning phase Anomaly evaluation framework • How to quantify the impact of elasticity such as VM migration on state-of-the-art ADTs? Simpson. S, Shirazi. N, Hutchsion. D, and B. Helge, “Anomaly detection techniques for cloud computing,” Dec. 2013. [Online]. https://www.seccrit.eu/upload/D4-1Anomaly-Detection-Techniques-forCloud.pdf • Anomaly evaluation framework o Composed of various pre-/post-processing modules (scripts, Perl libraries, Python and C) o Attack Scripts for volume- and non-volume-based attacks with rate-limiting features o Monitoring scripts based on tcpdump o Background traffic o Summary extraction scripts • Convert traffic into normalized statistical properties on per-packet basis o Detector scripts provide reference implementation of ADTs o Visualization scripts compare anomaly score to threshold and plot ROC and PRC NW & VM level analysis Network Level analysis • 8 different network features such that X is 600×2400 • Aggregation in 1-second bins • 40-minute experiment Attack (NS) Migration VM Level analysis • 33 different system features such that X is 400×33 • Aggregation in 3-seconds bins • 20-minute experiment Malware Injected Migration Conclusion • Deploying CI services in the cloud increases concerns about resilience and security • There is a need for robust anomaly detection for cloud environments especially for CI, that are aware of elastic behaviour and can work in real settings • Elasticity has direct impact on underlying ADTs • Policies are back bone of remediation • The resilience framework can help to protect cloud infrastructures through dynamically observing the state of the cloud services and resources, analysing for threats and remediating against their effects