High Availability is more than five nines

10
1 High Availability is more than five nines Evolved Packet Core features build resiliency and preference The availability of mobile broadband service has become essential to daily life. Without their smart- phones, people are not capable of performing functions now taken for granted, such as mobile banking, social networking and checking the weather forecast on the move. Just as important as the device however is the network supporting it. The end user experience is enabled and increasingly personalized by the Evolved Packet Core. This paper outlines examples of Evolved Packet Core products and features developed to meet operators’ high availability needs.

Transcript of High Availability is more than five nines

1

High Availability is more than five ninesEvolved Packet Core features build resiliency and preference

The availability of mobile broadband service has become essential to daily life. Without their smart-phones, people are not capable of performing functions now taken for granted, such as mobile banking, social networking and checking the weather forecast on the move. Just as important as the device however is the network supporting it. The end user experience is enabled and increasingly personalized by the Evolved Packet Core. This paper outlines examples of Evolved Packet Core products and features developed to meet operators’ high availability needs.

2

INTRODUCTION

There is no ‘busy hour’ for operators now.The network is always busy, and for most subscribers, it’s mission critical. Faced with a choice between having to leave either your wallet or smartphone at home for the day, your decision might be different from ten years ago when it would have been either your wallet or mobile phone. Subscribers expect their mobile networks to support almost every aspect of their lives, and that demands high availability.

Network performance is redefined every time we unlock our devices, open a new app’ or change location. Delivering the right experience in the right context is what counts, and makes us recommend our operator to friends, neighbors and colleagues. Any definition of performance starts with high availability. Whatever bandwidth you can provide, however low the round-trip delay, however expansive the coverage, the service has to be rock solid.

The need for improved resiliency has increased at the same time as network signaling volumes have increased. The introduction of LTE, powerful smartphones and changing user behaviors has meant that operators and vendors are constantly adapting to meet new demands.The Networked Society will create a world connected in real time, and a technology-driven revolution transforming industries and the way society interacts at work and play.Correspondingly, the networks enabling this transformation will need to respond to a swathe of new demands.

Robust and ultra-resilient signaling will be near the top of the requirements list, driven by ever-increasing subscriber numbers, rapid deployment of machine-to-machine applications, Voice over LTE (with multiple bearers), and of course a further proliferation of new devices and applications.

Networks will also need to respond to further increases in traffic and signaling created by the implementation of heterogeneous networks and handovers between different access networks: principally between Wi-Fi and WCDMA/LTE. Unfortunately, the network also needs to respond well to natural disasters and increasingly sophisticated and mobile network-centric security attacks.

What price then will operators pay for ‘less-than-high’ network availability? There have been some notable and presumably expensive outages caused by excesses in service demands, or unpredicted outcomes to new device or user behaviors. In an example operator case, for every million subscribers paying an average $30 per month, then an illustrative 1% increase in churn rate can amount to a loss of $3.6 million annually.

Ericsson has responded to worldwide operator needs with platforms and software features capable of handling volume (number of users, traffic & signaling), latent threats and failure conditions in a predictable and reliable manner.This paper outlines specific examples of Evolved Packet Core products and features developed to meet operators’ high availability needs.

3

Putting superior performance into practice:

Australia’s Telstra is a prime example of an operator recognizing the strengths of network performance in driving customer satisfaction and business objectives. As Mike Wright, Telstra’s Executive Director of Networks and Access Technologies stated:

There’s a very clear linkage between superior network performance and our business outcomes. We see that customers are prepared to pay a little more when they know that the network is going to be reliable and give them great service.

Telstra provides unmatched performance across Australia with Ericsson LTE 1800 MHz radio and multi-access Evolved Packet Core networks.

NETWORK PERFORMANCE AND END USER PREFERENCES

An Ericsson ConsumerLab survey (2013) shows the relative importance of a range of factors in determining end user loyalty to an operator’s brand. Network Performance was shown to be the clear leader: more than twice as important as customer support, and four times as important as loyalty awards. The survey showed that more than two thirds of promoters (end users who promote their operator to others) are very satisfied with network performance.

Mike WrightExecutive Director of Networks and Access Technologies, Telstra

Source: Ericsson ConsumerLab report. (2013). Base: 9,040 smartphone users in Brazil, China, South Korea, Japan, USA, UK, Sweden, Russia and Indonesia. For survey methodology, based upon ‘Net Promoter Scores’ please refer to report.

11%

16%

10%

9%

8%

5%

7%

7%

7%

20%Network performance

Value for money

Ongoing communication

Tariff plans offered

Customer support

Account management

Billing and payment

Handset/devices offered

Initial purchase

Loyalty rewards

Customer service

Offer

Marketing

Network

4

HIGH AVAILABILITY EPC

The Evolved Packet Core (EPC) network comprises three principal network elements. Ericsson’s Evolved Packet Gateway (EPG) comprises both Packet Gateway (P-GW) and Serving Gateway (S-GW) functions and connects the mobile operator’s network with external IP networks. Ericsson’s SGSN-MME comprises both Serving GPRS Support Node (for GSM and WCDMA) and Mobility Management Entity (for LTE) functions and handles most of the signaling requirements in the EPC. The SGSN-MME Pool operation provides efficiency gains and resiliently handles changing conditions.

The Ericsson Service-Aware Policy Controller (SAPC) performs the Policy and Charging Rules Function (PCRF) and plays a pivotal role in delivering differentiated services and prioritizing network access.

Within the EPC network, the SGSN-MME can be seen as the signaling ‘nerve center’ or ‘control room’ of the mobile broadband network. It has a unique perspective in having visibility, and a large degree of control, over the workings of both the RAN and EPC.

The EPC nodes deliver high availability services with telecom-grade platforms and applications. Important additional features ensure the high availability of user sessions and services, and protect the rest of the mobile network. Without high availability, an operator incurs costs to fix outages, costs to improve network resiliency, and perhaps most importantly – the loss of subscriber loyalty.

As signaling volumes and subscriber density increase then system availability becomes increasingly demanding. A recent survey collected data from 115 SGSN-MMEs in ‘live’ service with North American operators providing LTE service to more than 30 million subscribers. The in-service performance (ISP) over the 12 month survey duration was 99,99985%. Due to SGSN-MME Pool the operators were able to benefit from 100% network availability.

This level of ultra-resilience is due to Ericsson’s telecom-oriented mindset, and a long history in providing highly robust telecoms systems. This experience now provides highly resilient systems for today’s IP-connected world and the Networked Society.

TEXTTEXTSGSN-MME

TEXTTEXTBSC

TEXTTEXTEPG

WCDMA

LTE

GSM

eNodeB

eNodeB

BTS

TEXTTEXTRNC

INTERNET- Traffic- Signaling

3G Direct TunnelUser/Data

Plane

POOL

HLR/HSS

TEXTTEXTSGSN-MMETEXTTEXTSGSN-

MMETEXTTEXTSGSN-MME

SAPC

5

HIGH AVAILABILITY REALIZATION

Ericsson’s principal EPC components (SGSN-MME, SAPC and EPG) are designed for telecom-grade resiliency and based upon highly redundant, multi-application platforms: the Ericsson Blade System for the SGSN-MME and SAPC, and the Smart Services Router SSR 8000 family for the EPG.

Both platforms are designed to deliver at least five-nines availability, and feature ‘N+1’ card redundancy, providing session resilience in the case of an individual card failure. This is a far more efficient and cost-effective mechanism than traditional ‘1+1’ resiliency.

Ericsson EPC components add value when used together, which is the case in the majority of operator cases. The value is realized when components work together to sustain availability. The P-GW, for example, will protect against the temporary loss of a PCRF (SAPC) by falling back to local policies that have been configured in the EPG. This facility is typically a requirement to protect users of VoLTE services. The EPG and SGSN-MME also combine to raise service levels through EPG features such as S-GW Restoration and P-GW Restart Notification. Collectively these EPC features serve to protect PDN connections and user sessions, and exemplify Ericsson’s commitment to supporting operators’ user expectations.

SGSN-MME Pool and the other ‘Important additional features’, mentioned previously, take Ericsson’s Evolved Packet Core implementation beyond simply offering ‘five-nines’ levels of uptime.

These additional features ensure the high availability of user sessions and services, and protect the rest of the mobile network. They are part of the software available with the SGSN-MME or EPG products, or a combination of both. A selection of representative high availability features follows.

Processor Overload Protection

End users’ service availability requires network nodes to be functioning as expected, and within capacity constraints. If a node becomes overloaded then service availability can be affected and users can experience delays or particularly severe conditions, service disruption.Ericsson’s SGSN-MME uses an innovative and patented technique, based on service priorities, to provide CPU overload protection for processor

cards. This ensures that the node is functioning correctly, managing sessions and traffic, even during a continuous and severe overload situation.

The Overload Protection (‘OLP’) function performs orderly message discards based on the CPU load and/or queue length. Rather than discarding messages indiscriminately, the OLP function discards control plane messages based upon prioritization of users and services, and according to the 3GPP-defined Multimedia Priority Service (MPS) settings. This provides a guarantee that higher priority users and services will be protected from service degradation.

Similarly, the SAPC features a load-regulation mechanism that discards messages in a controlled way to protect the node during continuous high-load conditions over a configured threshold. Messages are discarded based upon a combination of two criteria: message type and user/service priority.

Geo-Redundancy

The SGSN-MME ‘Geo-Redundant Pool’ feature maintains not just service continuity, but also the more challenging session continuity, upon an otherwise quite serious network failure. Example failure causes could include loss of S1 or S11 links, or perhaps in a disaster situation, such as an earthquake or a flood, the loss of an SGSN-MME or a complete site.

This feature also requires support from the Serving-Gateway (S-GW) in the EPG as the S-GW needs to be constantly aware of which SGSN-MMEs are available in the pool. The S-GW also needs to respond to the service restoration event in an integrated manner.

Session continuity is made possible by replicating, or ‘mirroring’, user data (contexts) between SGSN-MMEs in a Pool during normal operation with stateful replication. Upon Serving SGSN-MME outage detection, the Backup SGSN-MME in the pool takes over by retrieving the User Equipment (UE) context data from the Backup SGSN-MME and maintaining session continuity. Given adequate network dimensioning, there is virtually no service impact end users; even VoLTE calls and SIP sessions are maintained.

The Geo-redundancy feature assures extremely high service availability for end users and business partners with no requirement for external boxes or other failure-detection devices.

6

Without the feature, all UEs would attempt to re-attach to the network simultaneously creating an ‘attach signaling storm’ which could result in users waiting perhaps 15 to 20 minutes to re-attach and restore service.

The EPG also has a separate Geo-Redundancy feature called ‘Inter Chassis Redundancy’ (ICR). Like the SGSN-MME Geo-Redundant Pool’ feature, ICR also preserves session continuity, but with ‘mirrored’ EPG pairs. In a contained laboratory environment sessions can be moved from one EPG to another within as little as 20ms.

The SAPC further supports EPC Geo-Redundancy by providing a guarantee of service continuity in the event of primary node failure. When deployed geo-redundantly, a pair of SAPC nodes is configured in an ‘active-standby’ mode. The two nodes are connected by an update channel that replicates the data and session information from the active node to the standby node. This provides full session and service continuity if a failover occurs. A failover event is transparent to the rest of the network through the use of a common ‘Geo-Redundancy IP address’ that automatically and independently redirects traffic to the newly active node.

Preserving sessions while moving users

It’s beneficial to both operators and end users if SGSN-MME maintenance or upgrade operations can be conducted during daily working hours and with service continuity. Operators benefit from lower resource costs because out-of-hours maintenance incurs additional costs, and end users benefit from very minimal and tolerable service impact.

Ericsson is able to move all UEs from one SGSN-MME to other SGSN-MMEs in the same Pool while guaranteeing bearer preservation and traffic continuity. Users experience approximately 1-2 seconds of inactivity while the move takes place which is typically faster than other vendors’ comparable schemes. Any time difference is particularly important as it permits the operation to take place during normal working hours, whereas the larger-duration outage is sufficient for end users to perceive a service impact. Ericsson calculates that the working-hours operation will provide an approximate 75% reduction in operator costs.

The ‘UE Move’ takes as little as one operator command to activate, or just one ‘click’ if performed by the OSS management system.

Automatic Network Verification

The Automatic Network Validation (ANV) feature provides the ability to quickly and effectively validate a new SGSN-MME node, or a node running new software or features using a pre-selected sample of active subscribers.

The operator benefits are reductions in operational costs and enhanced quality assurance. With ANV, operators can reduce verification times from hours to just minutes because of this type of verification requires no time consuming and resource-intensive drive tests.

The testing can be customized by performing a ‘selective move’ of a relatively small number of representative users before moving a larger number of users from the SGSN-MME pool onto the SGSN-MME under test. These initial users can be selected based upon a number of metrics (such as IMSI, APN, IMEI-TAC, RAT type,roaming status) to guarantee the expected status of the SGSN-MME.

The results of the selective UE move are quickly compiled in a validation report indicating the success rates of key signaling events. Upon successful verification a larger UE move is performed to populate the SGSN-MME and re-balance the pool.

Quality assurance is enhanced through ANV because operators can quickly and effectively extend the validation scope beyond what can be achieved through regular drive tests.

Reducing the impact of signaling storms

The consequences of a control-plane ‘signaling tsunami’ can be potentially quite damaging to services availability. These signaling-overload conditions can be as a result of a variety of events such as:

• A malicious distributed denial of service (DDOS security attack,• A network outage caused for example by an event such as an earthquake or flood.

DDOS attacks create excessive signaling storms which attempt to overload networks and take networks out of service. In the case of a network outage an ‘attach storm’ can be caused by a significantly large number of user devices attempting to re-attach after a network comes back into service. When this happens, it’s better

7

for the network to take a little longer to attach some of the users, by throttling, than to cause a control plane node to fail. In both cases a technique for mitigating the effects of overload provides protection for the network.

The SGSN-MME is a very powerful and flexible system, and it’s capable of providing a ‘safety net’ between the incoming signaling storm from the RAN and connected control plane nodes. Signaling rate adaption reduces outbound or northbound signaling volumes on the SGSN-MME’s Diameter and GTP-C interfaces through a process called ‘Smart Signaling Throttling’.

The SGSN-MME is constantly measuring the time taken for other nodes (such as HLR/HSS, S-GW, PCRF) to respond to requests, and comparing that with the volume of signaling sent. In this way, it’s able to adaptively respond when signaling requests are not met within anticipated timeframes to improve network robustness.

Smart Signaling Throttling starts after the measured delay to respond to outgoing requests exceeds an automatically configured threshold. Once this threshold ‘window’ is exceeded then the SGSN-MME will provide dynamic throttling based on the current load situation to improve node stability and network robustness.

Handling ‘misbehaving’ devices & Users

Over recent years, there has been an increasing requirement to protect mobile networks against ‘bad’ signaling, and optimize networks for ‘good’ signaling.

Excessive signaling conditions have been observed when operators have introduced new user devices to their networks. Smartphones and tablets, for example, are quite powerful devices

and are able to generate significant signaling volumes if they are not working according to 3GPP specifications, or have not been tested sufficiently. In these cases the devices can work in unpredictable or erroneous ways. They can also generate excessive signaling when new apps are introduced that have not been developed with an understanding of mobile network impact. DDOS attacks are the most extreme example of ‘bad’ signaling, as they’re specifically designed to be destructive.

The ‘UE Signaling Control’ feature enables the Ericsson SGSN-MME to provide effective detection of, and protection against these causes of excessive or destructive UE signaling. Using a UE ‘lock-out’ function it’s possible to lock a ‘misbehaving’ device out of a network, either by detaching it or rejecting an attach request. Lock out takes place when signaling messages from a specific UE exceed a configured threshold. In the case of a problematic device type, it’s possible to prohibit network attachment for specific IMEI number series, which will be advantageous, for example, if a newly introduced smartphone device is generating problems across multiple UEs.

8

KEEPING ONE STEP AHEAD

Smartphone Lab

In the rapidly changing ecosystem of smartphone devices and applications and new services, Ericsson is taking a proactive measure to increase the resiliency of Ericsson-equipped operator networks. By working closely with device vendors, application developers and OS vendors, Ericsson gains a unique insight into how networks respond to the changing environment. This is the ‘Smartphone Lab’ initiative. Ericsson will stay ‘ahead of the curve’, protect networks and increase services availability by turning insight into solutions that will enable operators to stay in control of the signaling challenge.

Virtual Evolved Packet Core

Ericsson is providing a virtual EPC to support operators’ transitions to cloud. This will, for example, create new operator opportunities in the areas of machine-to-machine (M2M), Enterprise and Distributed Cloud for fast-growing markets. Existing ‘native’ and new ‘virtual’ network nodes will coexist seamlessly and feature the same high availability functions such as pooling, geo-redundancy and load sharing. This offers a very attractive proposition for operators moving to NFV.

Virtualized EPC solutions will benefit from the same full feature set and compatibility with native EPC by using the same software and a common Operations and Support System. This means that operators deploying Ericsson’s virtual EPC on Ericsson platforms or 3rd party platforms (having performed the required systems integration), will continue to enjoy market-leading compatibility with a whole range of connected devices and systems, from smartphones and RAN, to charging systems and services.

The Ericsson Cloud System, based on the Ericsson platforms or certified 3rd party hardware, adds cloud capabilities for operators while extending carrier grade operations from physical EPC nodes to virtual EPC nodes too.

MEETING MISSION CRITICAL NEEDS

The most exacting mission critical application for the general public demands the best of network infrastructure. That’s why Motorola’s Public Safety LTE solution includes an Ericsson Evolved Packet Core with components including SGSN-MME, EPG and SAPC.

With fast mobile broadband, firefighters can work with real-time video from a support helicopter, for example. All communications require a reliable network connection.

High availability is not all that’s required though. Priority mechanisms must ensure that those who need the network resources most will gain access when and where it is needed.

Ericsson and Motorola Solutions have entered into a strategic alliance to provide real-time broadband services to the public safety community. Together, they deliver the best user experience with the highest performing networks available, adapted to meet the specific needs of this unique sector.

Motorola Solutions offers a turnkey Public Safety LTE solution including sector-specific devices, applications and management systems.Ericsson contributes LTE hardware and software, including the system’s key enabler – the dynamic, interactive interplay between the network,applications and devices.

Differentiated services based on a user’s role, rank, jurisdiction, incident level and application are achieved through enhanced access control, QoS mechanisms, bandwidth modification and prioritization.

9

Reliability enables innovation and growth

As Hideyuki Tsukuda, Senior Vice president of Networks, SoftBank Mobile Corp. in Japan confirms, high availability is critical in helping operators satisfy their business objectives:

At SoftBank Mobile, providing highly reliable mobile broadband services to our customers is fundamental to achieving our aims. Ericsson’s highly resilient solutions have significantly contributed to our record of having no serious network incidents for more than three years, which is a key success factor for our customers.

Hideyuki TsukudaSenior Vice President of Networks,SoftBank Mobile Corp.

SUMMARY

Every service-affecting network failure has some impact on customer confidence and operator brand perception. That’s why achieving a ‘five-nines’ level of availability is not really sufficient by itself. The Evolved Packet Core is a particularly important part of an operator’s network because all traffic flows through it, and it is the ‘nerve-center’ or ‘control room’ for policy management and network signaling. Being such a strategically important part of the network it’s vitally important to establish and maintain a high availability EPC. Recent mobile network outages have become big news in the TV and on-line media, so the potential costs are very clear.

Ericsson has had a continual focus on delivering high availability, particularly in the Evolved Packet Core. Being first to market in the first live LTE network and a market leader ever since means that Ericsson has a wealth of experience in every related aspect of network design and support.

Especially important is the deep experience gained from helping operators to manage the complexities of new end user devices introduction and applications evolution.

Elevating high availability beyond ‘five-nines’ requires a commitment to designing telecom-grade equipment and software. It also requires a commitment to developing high availability-specific additional features both at the individual node level, between similar nodes in a high availability ‘Pool’, and at an Evolved Packet Core system level. Ericsson addresses all these areas and is constantly adding new high availability features, both for native and virtual EPC implementations.

End users will continue to express their preferences for network performance when considering brand loyalty, and with Ericsson EPC your network couldn’t be in safer ‘hands’.

10

Ericsson

SE- 126 25 Stockholm, Sweden

Telephone +46 10 719 00 00

www.ericsson.com10/287 01-FGB 101 256 rev A

© Ericsson AB 2014

REFERENCES

Ericsson ConsumerLab Mobility Report, June 2013: http://www.ericsson.com/res/docs/2013/ericsson-mobility-report-june-2013.pdf

Reference Story: Telstra, Australia: Superior performance http://www.ericsson.com/thecompany/our_publications/reference-stories-a-z/telstra-australia

Public Safety LTE http://www.ericsson.com/ourportfolio/government/public-safety-lte

Press Release: Evolved Packet Core provided in a virtualized mode industrializes NFV, February 2014.http://www.ericsson.com/thecompany/press/releases/2014/02/1761217