High Availability is more than five nines

High Availability
is more than five nines
Evolved Packet Core features build resiliency and preference
The availability of mobile broadband service has become essential to daily life. Without their smartphones, people are not capable of performing functions now taken for granted, such as mobile banking,
social networking and checking the weather forecast on the move. Just as important as the device
however is the network supporting it. The end user experience is enabled and increasingly personalized
by the Evolved Packet Core. This paper outlines examples of Evolved Packet Core products and features
developed to meet operators’ high availability needs.
1
INTRODUCTION
There is no ‘busy hour’ for operators now.
The network is always busy, and for most
subscribers, it’s mission critical. Faced with a
choice between having to leave either your wallet
or smartphone at home for the day, your decision
might be different from ten years ago when it
would have been either your wallet or mobile
phone. Subscribers expect their mobile networks
to support almost every aspect of their lives, and
that demands high availability.
Robust and ultra-resilient signaling will be
near the top of the requirements list, driven
by ever-increasing subscriber numbers, rapid
deployment of machine-to-machine applications,
Voice over LTE (with multiple bearers), and of
course a further proliferation of new devices and
applications.
Networks will also need to respond to further
increases in traffic and signaling created by the
implementation of heterogeneous networks and
handovers between different access networks:
principally between Wi-Fi and WCDMA/LTE.
Unfortunately, the network also needs to
respond well to natural disasters and increasingly
sophisticated and mobile network-centric
security attacks.
Network performance is redefined every time we
unlock our devices, open a new app’ or change
location. Delivering the right experience in the
right context is what counts, and makes us
recommend our operator to friends, neighbors
and colleagues. Any definition of performance
starts with high availability. Whatever bandwidth
you can provide, however low the round-trip
delay, however expansive the coverage, the
service has to be rock solid.
What price then will operators pay for ‘less-thanhigh’ network availability? There have been some
notable and presumably expensive outages
caused by excesses in service demands, or
unpredicted outcomes to new device or user
behaviors. In an example operator case, for every
million subscribers paying an average $30 per
month, then an illustrative 1% increase in churn
rate can amount to a loss of $3.6 million annually.
The need for improved resiliency has increased
at the same time as network signaling volumes
have increased. The introduction of LTE, powerful
smartphones and changing user behaviors has
meant that operators and vendors are constantly
adapting to meet new demands.
The Networked Society will create a world
connected in real time, and a technology-driven
revolution transforming industries and the way
society interacts at work and play.
Correspondingly, the networks enabling this
transformation will need to respond to a swathe
of new demands.
Ericsson has responded to worldwide operator
needs with platforms and software features
capable of handling volume (number of users,
traffic & signaling), latent threats and failure
conditions in a predictable and reliable manner.
This paper outlines specific examples of Evolved
Packet Core products and features developed to
meet operators’ high availability needs.
2
NETWORK PERFORMANCE AND END
USER PREFERENCES
Putting superior performance into
practice:
An Ericsson ConsumerLab survey (2013) shows
the relative importance of a range of factors in
determining end user loyalty to an operator’s
brand. Network Performance was shown to be
the clear leader: more than twice as important as
customer support, and four times as important
as loyalty awards. The survey showed that
more than two thirds of promoters (end users
who promote their operator to others) are very
satisfied with network performance.
Australia’s Telstra is a prime example of an
operator recognizing the strengths of network
performance in driving customer satisfaction and
business objectives. As Mike Wright, Telstra’s
Executive Director of Networks and Access
Technologies stated:
20%
Network performance
16%
Value for money
11%
Ongoing communication
10%
Tariff plans offered
9%
Customer support
8%
Account management
Billing and payment
7%
Handset/devices offered
7%
Initial purchase
7%
Loyalty rewards
There’s a very clear linkage
between superior network performance
and our business outcomes. We see that
customers are prepared to pay a little
more when they know that the network is
going to be reliable and give them great
service.
Customer service
Offer
Marketing
Network
5%
Source: Ericsson ConsumerLab report. (2013).
Base: 9,040 smartphone users in Brazil, China, South
Korea, Japan, USA, UK, Sweden, Russia and Indonesia.
For survey methodology, based upon ‘Net Promoter
Scores’ please refer to report.
Mike Wright
Executive Director of Networks and
Access Technologies, Telstra
Telstra provides unmatched performance across
Australia with Ericsson LTE 1800 MHz radio and
multi-access Evolved Packet Core networks.
3
HIGH AVAILABILITY EPC
The EPC nodes deliver high availability services
with telecom-grade platforms and applications.
Important additional features ensure the high
availability of user sessions and services, and
protect the rest of the mobile network. Without
high availability, an operator incurs costs to fix
outages, costs to improve network resiliency,
and perhaps most importantly – the loss of
subscriber loyalty.
The Evolved Packet Core (EPC) network
comprises three principal network elements.
Ericsson’s Evolved Packet Gateway (EPG)
comprises both Packet Gateway (P-GW) and
Serving Gateway (S-GW) functions and connects
the mobile operator’s network with external IP
networks. Ericsson’s SGSN-MME comprises both
Serving GPRS Support Node (for GSM and
WCDMA) and Mobility Management Entity (for
LTE) functions and handles most of the signaling
requirements in the EPC. The SGSN-MME Pool
operation provides efficiency gains and resiliently
handles changing conditions.
As signaling volumes and subscriber density
increase then system availability becomes
increasingly demanding. A recent survey
collected data from 115 SGSN-MMEs in ‘live’
service with North American operators providing
LTE service to more than 30 million subscribers.
The in-service performance (ISP) over the 12
month survey duration was 99,99985%. Due
to SGSN-MME Pool the operators were able to
benefit from 100% network availability.
The Ericsson Service-Aware Policy Controller
(SAPC) performs the Policy and Charging Rules
Function (PCRF) and plays a pivotal role in
delivering differentiated services and prioritizing
network access.
Within the EPC network, the SGSN-MME can be
seen as the signaling ‘nerve center’ or ‘control
room’ of the mobile broadband network. It has a
unique perspective in having visibility, and a large
degree of control, over the workings of both the
RAN and EPC.
This level of ultra-resilience is due to Ericsson’s
telecom-oriented mindset, and a long history
in providing highly robust telecoms systems.
This experience now provides highly resilient
systems for today’s IP-connected world and the
Networked Society.
BTS
GSM
HLR/
HSS
BSC
TEXT
eNodeB
WCDMA
SGSNTEXT
MME
SGSNTEXT
MME
SGSNTEXT
MME
SGSNTEXT
MME
POOL
SAPC
RNC
TEXT
3G
Dir
ect
Tun
nel
User/Data
Plane
eNodeB
EPG
TEXT
- Traffic
- Signaling
LTE
4
INTERNET
HIGH AVAILABILITY REALIZATION
cards. This ensures that the node is functioning
correctly, managing sessions and traffic, even
during a continuous and severe overload
situation.
Ericsson’s principal EPC components (SGSNMME, SAPC and EPG) are designed for
telecom-grade resiliency and based upon highly
redundant, multi-application platforms: the
Ericsson Blade System for the SGSN-MME and
SAPC, and the Smart Services Router SSR 8000
family for the EPG.
The Overload Protection (‘OLP’) function
performs orderly message discards based
on the CPU load and/or queue length. Rather
than discarding messages indiscriminately, the
OLP function discards control plane messages
based upon prioritization of users and services,
and according to the 3GPP-defined Multimedia
Priority Service (MPS) settings. This provides a
guarantee that higher priority users and services
will be protected from service degradation.
Both platforms are designed to deliver at least
five-nines availability, and feature ‘N+1’ card
redundancy, providing session resilience in the
case of an individual card failure. This is a far
more efficient and cost-effective mechanism than
traditional ‘1+1’ resiliency.
Similarly, the SAPC features a load-regulation
mechanism that discards messages in a
controlled way to protect the node during
continuous high-load conditions over a
configured threshold. Messages are discarded
based upon a combination of two criteria:
message type and user/service priority.
Ericsson EPC components add value when
used together, which is the case in the majority
of operator cases. The value is realized when
components work together to sustain availability.
The P-GW, for example, will protect against the
temporary loss of a PCRF (SAPC) by falling back
to local policies that have been configured in
the EPG. This facility is typically a requirement
to protect users of VoLTE services. The EPG and
SGSN-MME also combine to raise service levels
through EPG features such as S-GW Restoration
and P-GW Restart Notification. Collectively these
EPC features serve to protect PDN connections
and user sessions, and exemplify Ericsson’s
commitment to supporting operators’ user
expectations.
Geo-Redundancy
The SGSN-MME ‘Geo-Redundant Pool’ feature
maintains not just service continuity, but also the
more challenging session continuity, upon an
otherwise quite serious network failure. Example
failure causes could include loss of S1 or S11
links, or perhaps in a disaster situation, such as
an earthquake or a flood, the loss of an SGSNMME or a complete site.
SGSN-MME Pool and the other ‘Important
additional features’, mentioned previously, take
Ericsson’s Evolved Packet Core implementation
beyond simply offering ‘five-nines’ levels of
uptime.
This feature also requires support from the
Serving-Gateway (S-GW) in the EPG as the
S-GW needs to be constantly aware of which
SGSN-MMEs are available in the pool. The S-GW
also needs to respond to the service restoration
event in an integrated manner.
These additional features ensure the high
availability of user sessions and services, and
protect the rest of the mobile network. They are
part of the software available with the SGSNMME or EPG products, or a combination of both.
A selection of representative high availability
features follows.
Session continuity is made possible by
replicating, or ‘mirroring’, user data (contexts)
between SGSN-MMEs in a Pool during normal
operation with stateful replication. Upon
Serving SGSN-MME outage detection, the
Backup SGSN-MME in the pool takes over by
retrieving the User Equipment (UE) context data
from the Backup SGSN-MME and maintaining
session continuity. Given adequate network
dimensioning, there is virtually no service impact
end users; even VoLTE calls and SIP sessions are
maintained.
Processor Overload Protection
End users’ service availability requires network
nodes to be functioning as expected, and
within capacity constraints. If a node becomes
overloaded then service availability can be
affected and users can experience delays or
particularly severe conditions, service disruption.
Ericsson’s SGSN-MME uses an innovative and
patented technique, based on service priorities,
to provide CPU overload protection for processor
The Geo-redundancy feature assures extremely
high service availability for end users and
business partners with no requirement for
external boxes or other failure-detection devices.
5
Automatic Network Verification
Without the feature, all UEs would attempt to
re-attach to the network simultaneously creating
an ‘attach signaling storm’ which could result in
users waiting perhaps 15 to 20 minutes to reattach and restore service.
The Automatic Network Validation (ANV) feature
provides the ability to quickly and effectively
validate a new SGSN-MME node, or a node
running new software or features using a preselected sample of active subscribers.
The EPG also has a separate Geo-Redundancy
feature called ‘Inter Chassis Redundancy’ (ICR).
Like the SGSN-MME Geo-Redundant Pool’
feature, ICR also preserves session continuity,
but with ‘mirrored’ EPG pairs. In a contained
laboratory environment sessions can be moved
from one EPG to another within as little as 20ms.
The operator benefits are reductions in
operational costs and enhanced quality
assurance. With ANV, operators can reduce
verification times from hours to just minutes
because of this type of verification requires no
time consuming and resource-intensive drive
tests.
The SAPC further supports EPC GeoRedundancy by providing a guarantee of service
continuity in the event of primary node failure.
When deployed geo-redundantly, a pair of SAPC
nodes is configured in an ‘active-standby’ mode.
The two nodes are connected by an update
channel that replicates the data and session
information from the active node to the standby
node. This provides full session and service
continuity if a failover occurs. A failover event is
transparent to the rest of the network through the
use of a common ‘Geo-Redundancy IP address’
that automatically and independently redirects
traffic to the newly active node.
The testing can be customized by performing
a ‘selective move’ of a relatively small number
of representative users before moving a larger
number of users from the SGSN-MME pool onto
the SGSN-MME under test. These initial users
can be selected based upon a number of metrics
(such as IMSI, APN, IMEI-TAC, RAT type,
roaming status) to guarantee the expected status
of the SGSN-MME.
The results of the selective UE move are quickly
compiled in a validation report indicating the
success rates of key signaling events. Upon
successful verification a larger UE move is
performed to populate the SGSN-MME and rebalance the pool.
Preserving sessions while moving users
It’s beneficial to both operators and end users if
SGSN-MME maintenance or upgrade operations
can be conducted during daily working hours
and with service continuity. Operators benefit
from lower resource costs because out-of-hours
maintenance incurs additional costs, and end
users benefit from very minimal and tolerable
service impact.
Quality assurance is enhanced through ANV
because operators can quickly and effectively
extend the validation scope beyond what can be
achieved through regular drive tests.
Reducing the impact of signaling storms
The consequences of a control-plane ‘signaling
tsunami’ can be potentially quite damaging to
services availability. These signaling-overload
conditions can be as a result of a variety of
events such as:
Ericsson is able to move all UEs from one SGSNMME to other SGSN-MMEs in the same Pool
while guaranteeing bearer preservation and traffic
continuity. Users experience approximately 1-2
seconds of inactivity while the move takes place
which is typically faster than other vendors’
comparable schemes. Any time difference is
particularly important as it permits the operation
to take place during normal working hours,
whereas the larger-duration outage is sufficient
for end users to perceive a service impact.
Ericsson calculates that the working-hours
operation will provide an approximate 75%
reduction in operator costs.
• A malicious distributed denial of service (DDOS
security attack,
• A network outage caused for example by an
event such as an earthquake or flood.
DDOS attacks create excessive signaling storms
which attempt to overload networks and take
networks out of service. In the case of a network
outage an ‘attach storm’ can be caused by
a significantly large number of user devices
attempting to re-attach after a network comes
back into service. When this happens, it’s better
The ‘UE Move’ takes as little as one operator
command to activate, or just one ‘click’ if
performed by the OSS management system.
6
for the network to take a little longer to attach
some of the users, by throttling, than to cause
a control plane node to fail. In both cases a
technique for mitigating the effects of overload
provides protection for the network.
and are able to generate significant signaling
volumes if they are not working according to
3GPP specifications, or have not been tested
sufficiently. In these cases the devices can work
in unpredictable or erroneous ways. They can
also generate excessive signaling when new apps
are introduced that have not been developed
with an understanding of mobile network impact.
DDOS attacks are the most extreme example of
‘bad’ signaling, as they’re specifically designed
to be destructive.
The SGSN-MME is a very powerful and flexible
system, and it’s capable of providing a ‘safety
net’ between the incoming signaling storm from
the RAN and connected control plane nodes.
Signaling rate adaption reduces outbound or
northbound signaling volumes on the SGSNMME’s Diameter and GTP-C interfaces through a
process called ‘Smart Signaling Throttling’.
The ‘UE Signaling Control’ feature enables
the Ericsson SGSN-MME to provide effective
detection of, and protection against these causes
of excessive or destructive UE signaling. Using
a UE ‘lock-out’ function it’s possible to lock a
‘misbehaving’ device out of a network, either by
detaching it or rejecting an attach request. Lock
out takes place when signaling messages from a
specific UE exceed a configured threshold. In the
case of a problematic device type, it’s possible
to prohibit network attachment for specific IMEI
number series, which will be advantageous,
for example, if a newly introduced smartphone
device is generating problems across multiple
UEs.
The SGSN-MME is constantly measuring the
time taken for other nodes (such as HLR/HSS,
S-GW, PCRF) to respond to requests, and
comparing that with the volume of signaling sent.
In this way, it’s able to adaptively respond when
signaling requests are not met within anticipated
timeframes to improve network robustness.
Smart Signaling Throttling starts after the
measured delay to respond to outgoing requests
exceeds an automatically configured threshold.
Once this threshold ‘window’ is exceeded then
the SGSN-MME will provide dynamic throttling
based on the current load situation to improve
node stability and network robustness.
Handling ‘misbehaving’ devices & Users
Over recent years, there has been an increasing
requirement to protect mobile networks against
‘bad’ signaling, and optimize networks for ‘good’
signaling.
Excessive signaling conditions have been
observed when operators have introduced new
user devices to their networks. Smartphones and
tablets, for example, are quite powerful devices
7
KEEPING ONE STEP AHEAD
Smartphone Lab
With fast mobile broadband, firefighters can work
with real-time video from a support helicopter, for
example. All communications require a reliable
network connection.
In the rapidly changing ecosystem of smartphone
devices and applications and new services,
Ericsson is taking a proactive measure to
increase the resiliency of Ericsson-equipped
operator networks. By working closely with
device vendors, application developers and OS
vendors, Ericsson gains a unique insight into how
networks respond to the changing environment.
This is the ‘Smartphone Lab’ initiative. Ericsson
will stay ‘ahead of the curve’, protect networks
and increase services availability by turning
insight into solutions that will enable operators to
stay in control of the signaling challenge.
High availability is not all that’s required though.
Priority mechanisms must ensure that those who
need the network resources most will gain access
when and where it is needed.
Ericsson and Motorola Solutions have entered
into a strategic alliance to provide real-time
broadband services to the public safety
community. Together, they deliver the best user
experience with the highest performing networks
available, adapted to meet the specific needs of
this unique sector.
Virtual Evolved Packet Core
Ericsson is providing a virtual EPC to support
operators’ transitions to cloud. This will, for
example, create new operator opportunities in the
areas of machine-to-machine (M2M), Enterprise
and Distributed Cloud for fast-growing markets.
Existing ‘native’ and new ‘virtual’ network nodes
will coexist seamlessly and feature the same
high availability functions such as pooling, georedundancy and load sharing. This offers a very
attractive proposition for operators moving to
NFV.
Motorola Solutions offers a turnkey Public Safety
LTE solution including sector-specific devices,
applications and management systems.
Ericsson contributes LTE hardware and software,
including the system’s key enabler – the dynamic,
interactive interplay between the network,
applications and devices.
Differentiated services based on a user’s role,
rank, jurisdiction, incident level and application
are achieved through enhanced access control,
QoS mechanisms, bandwidth modification and
prioritization.
Virtualized EPC solutions will benefit from the
same full feature set and compatibility with native
EPC by using the same software and a common
Operations and Support System. This means that
operators deploying Ericsson’s virtual EPC on
Ericsson platforms or 3rd party platforms (having
performed the required systems integration), will
continue to enjoy market-leading compatibility
with a whole range of connected devices
and systems, from smartphones and RAN, to
charging systems and services.
The Ericsson Cloud System, based on the
Ericsson platforms or certified 3rd party
hardware, adds cloud capabilities for operators
while extending carrier grade operations from
physical EPC nodes to virtual EPC nodes too.
MEETING MISSION CRITICAL NEEDS
The most exacting mission critical application for
the general public demands the best of network
infrastructure. That’s why Motorola’s Public
Safety LTE solution includes an Ericsson Evolved
Packet Core with components including SGSNMME, EPG and SAPC.
8
SUMMARY
Reliability enables innovation and growth
Every service-affecting network failure has some
impact on customer confidence and operator
brand perception. That’s why achieving a ‘fivenines’ level of availability is not really sufficient by
itself. The Evolved Packet Core is a particularly
important part of an operator’s network because
all traffic flows through it, and it is the ‘nervecenter’ or ‘control room’ for policy management
and network signaling. Being such a strategically
important part of the network it’s vitally important
to establish and maintain a high availability EPC.
Recent mobile network outages have become big
news in the TV and on-line media, so the potential
costs are very clear.
As Hideyuki Tsukuda, Senior Vice president
of Networks, SoftBank Mobile Corp. in Japan
confirms, high availability is critical in helping
operators satisfy their business objectives:
At SoftBank Mobile, providing
highly reliable mobile broadband services
to our customers is fundamental to
achieving our aims. Ericsson’s highly
resilient solutions have significantly
contributed to our record of having no
serious network incidents for more than
three years, which is a key success factor
for our customers.
Ericsson has had a continual focus on delivering
high availability, particularly in the Evolved Packet
Core. Being first to market in the first live LTE
network and a market leader ever since means
that Ericsson has a wealth of experience in every
related aspect of network design and support.
Especially important is the deep experience
gained from helping operators to manage
the complexities of new end user devices
introduction and applications evolution.
Elevating high availability beyond ‘five-nines’
requires a commitment to designing telecomgrade equipment and software. It also requires
a commitment to developing high availabilityspecific additional features both at the individual
node level, between similar nodes in a high
availability ‘Pool’, and at an Evolved Packet Core
system level. Ericsson addresses all these areas
and is constantly adding new high availability
features, both for native and virtual EPC
implementations.
Hideyuki Tsukuda
Senior Vice President of Networks,
SoftBank Mobile Corp.
End users will continue to express their
preferences for network performance when
considering brand loyalty, and with Ericsson
EPC your network couldn’t be in safer ‘hands’.
9
REFERENCES
Ericsson ConsumerLab Mobility Report, June 2013:
http://www.ericsson.com/res/docs/2013/ericsson-mobility-report-june-2013.pdf
Reference Story: Telstra, Australia: Superior performance
http://www.ericsson.com/thecompany/our_publications/reference-stories-a-z/telstra-australia
Public Safety LTE
http://www.ericsson.com/ourportfolio/government/public-safety-lte
Press Release: Evolved Packet Core provided in a virtualized mode industrializes NFV, February 2014.
http://www.ericsson.com/thecompany/press/releases/2014/02/1761217
Ericsson
SE- 126 25 Stockholm, Sweden
Telephone +46 10 719 00 00
www.ericsson.com
10/287 01-FGB 101 256 rev A
© Ericsson AB 2014
10