[RakutenTechConf2013] [A-4] The approach of Event in Japan Ichiba

Post on 23-Jan-2015

630 views 0 download

description

Rakuten Technology Conference 2013 "The approach of Big Event in Japan Ichiba" Yusuke Kobayashi, Osamu Iwasaki, Makito Hashiyama (Rakuten)

Transcript of [RakutenTechConf2013] [A-4] The approach of Event in Japan Ichiba

The approach of Big EventIn Japan Ichiba

Vol.01   Oct/26/2013Yusuke Kobayashi

Group ManagerMall Group, Japan Ichiba SectionRakuten Ichiba Development Department, Rakuten, Inc.http://www.rakuten.co.jp/

2

Index

Big Sale in Ichiba

3

Introduce Me

Yusuke Kobayashi

Group ManagerJapan Ichiba Section Japan Mall Group

Rakuten Ichiba Development Department• Joined in 2005.• Fist career in Rakuten was Infoseek.• Transferred to Ichiba from 2009.

MALL RMS IBS IBE

Japan Ichiba Section

4

Index

1.Scale of Big Sales - Huge Traffic Scale, Amazing Sales.

2.History of success - Share how we could improve our services.

3.Case study - Show trouble case and explain the countermeasure. -- Checkout System -- Infrastructure(Cloud Environment/Network)

5

Shopping Marathon

Shop around points

6

Super Sale

Half Price Items, Point, Topic Items

7

Victory Sale

77%Off, Half Price,1001 yen items

8

1.Scale of Big Sales - Huge Traffic Scale, Amazing Sales.

2.History of success - Share how we could improve our services.

3.Case study - Show trouble case and explain the countermeasure. -- Checkout System -- Infrastructure(Cloud Environment/Network)

9

1.Scale of Big Sale

15 Billion Sales per day

10

1.Scale of Big Sale

Victory Sale5% traffic of entire Japan!!

11

Usual

1.Scale of Big Sale

Comparison of order numbers between big sale and usual.

Sale

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58

Time(1hour)

Order Number

12

1.Scale of Big Sales - Huge Traffic Scale, Amazing Sales.

2.History of success - Share how we could improve our services.

3.Case study - Show trouble case and explain the countermeasure. -- Checkout System -- Infrastructure(Cloud Environment/Network)

13

Monitoring by 100 DU members

14

Our Energy!

15

2. History

2012/03/04 00:00 – 2012/03/04 23:59

2012/06/03 00:00 – 2012/03/06 01:59

2012/12/02 00:00 – 2012/12/04 01:59

2013/03/03 00:00 – 2013/03/05 01:59

2013/06/02 00:00 – 2013/06/05 01:59

2013/09/01 00:00 – 2013/09/04 01:59

2013/09/27 00:00 – 2013/09/30 01:59

16

2012/03 – Super Sale

24h LimitedHalf Price

Special Items

TV CommercialTrain AD

17

Top PageEvent Page

Search

IDAD

Entry

ItemPage

BookmarkPurchaseHistory

Review

Checkout

Search Engine

2012/03 – Super Sale

Point Coupon

18

2012/03 – Super Sale

19

2012/03 – Super Sale

20

2012/03 – Super Sale

Almost all services Delayed. Huge traffic. Application high load. Over band frequency. DB high load. NFS high load.

Mass Media powerwas so huge.

21

2012/03 – Super Sale

Just only Restart and Reboot. Change Apache Configuration. Restart Apache. Reboot physical servers.

Contents Delete Access Control by Creative Web Design team.

22

2012/03 – Super Sale

Countermeasure Enhance Web/Apps/NW/DB Servers. Band width limitation Tuning Middleware configuration. Decrease traffic. Contents Control. Web Front Speed UP.

23

2012/06 – Super Sale

!?

24

Bookmark

2012/06 – Super Sale

Top PageEvent Page

Search

IDAD

Entry

ItemPage

PurchaseHistory

Review

Checkout

Search Engine

Point Coupon

25

2012/06 – Super Sale

ID went down and Checkout delayed.

We expanded the period due to big troubles.

26

ID Service

0

10,000

20,000

30,000

40,000

50,000

60,000

6/222:00

6/30:00

6/32:00

6/34:00

6/36:00

6/38:00

6/310:00

6/312:00

6/314:00

6/316:00

6/318:00

6/320:00

6/322:00

6/40:00

6/42:00

#login 20120603

#login 20120527b

#login 20120304b

0

50000

100000

150000

200000

250000

300000

350000

400000

20

12

/6

/3

0:0

0

20

12

/6

/3

0:3

0

20

12

/6

/3

1:0

0

20

12

/6

/3

1:3

0

20

12

/6

/3

2:0

0

20

12

/6

/3

2:3

0

20

12

/6

/3

3:0

0

20

12

/6

/3

3:3

0

20

12

/6

/3

4:0

0

20

12

/6

/3

4:3

0

20

12

/6

/3

5:0

0

20

12

/6

/3

5:3

0

20

12

/6

/3

6:0

0

20

12

/6

/3

6:3

0

20

12

/6

/3

7:0

0

20

12

/6

/3

7:3

0

20

12

/6

/3

8:0

0

20

12

/6

/3

8:3

0

20

12

/6

/3

9:0

0

20

12

/6

/3

9:3

0

20

12

/6

/3

10

:00

20

12

/6

/3

10

:30

20

12

/6

/3

11

:00

20

12

/6

/3

11

:30

20

12

/6

/3

12

:00

20

12

/6

/3

12

:30

20

12

/6

/3

13

:00

20

12

/6

/3

13

:30

20

12

/6

/3

14

:00

20

12

/6

/3

14

:30

20

12

/6

/3

15

:00

20

12

/6

/3

15

:30

20

12

/6

/3

16

:00

20

12

/6

/3

16

:30

20

12

/6

/3

17

:00

20

12

/6

/3

17

:30

20

12

/6

/3

18

:00

20

12

/6

/3

18

:30

20

12

/6

/3

19

:00

20

12

/6

/3

19

:30

20

12

/6

/3

20

:00

20

12

/6

/3

20

:30

20

12

/6

/3

21

:00

20

12

/6

/3

21

:30

20

12

/6

/3

22

:00

20

12

/6

/3

22

:30

20

12

/6

/3

23

:00

20

12

/6

/3

23

:30

20

12

/6

/4

0:0

0

20

12

/6

/4

0:3

0

20

12

/6

/4

1:0

0

20

12

/6

/4

1:3

0

20

12

/6

/4

2:0

0

20

12

/6

/4

2:3

0

ID3 2/3

ID3 1/4

0:00AMSun 6/3

2:00AMMon 6/4

Fig.1: # of Login Successes

Fig.2: # of DB Connection Errors

[0:00-1:09]Just after the launch of Super Sales, DB Connection Errors occurred because of the users' massive accesses. Some of the users experienced connection errors. Errors automatically solved with users’ access decrease.

[20:20 - 0:34]Serious DB Connection Errors occurred because of the users' massive accesses. Critical user login failures by reboots of the ID services, limitation of Login, etc.

[22:37 - 0:42]Ichiba stopped using ID service. Purchases were only processed by non-members.

[23:15 - 23:40]ID service stopped because of server reboot and server down.

[Sun 6/3] Super Sale (2nd) this time[Sun 5/27] Ordinary Sunday[Sun, 3/4] Super Sale (1st) last time

[0:41]DB Connection Error terminated just after stopping batch program for Fraud Access Management running every 10 min. Users became login smoothly.

LoginSuccess

27

ID Service

ID Service had serious DB connection errors

during the following time period.(1) 0:00 - 1:09 (Sun, 6/3-2012)

Just after the launch of Super Sales, DB Connection Errors occurred because of the users' massive accesses.Some of the users experienced connection errors.

# of DB Connection Errors = 160,497 (in 10 min) (ref. # of Login Successes = 2,167,126)

(2) 20:20(Sun, 6/3-2012) - 0:34 (Mon, 6/4)Serious DB Connection Errors occurred because of the users' massive accesses. Critical user login failures by reboots of the ID services, limitationof Login, etc.

22:37-0:42, Ichiba stopped using ID service. Purchases were only processed by non-members.# of DB Connection Errors = 4,063,667 (in 4 hours 15 min)

Impacted to entire Rakuten Group.

28

Checkout

Web

Web

Web

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

Web

Web

Web

Web

Web

Web

APP

APP

APP

APP

APP

APP

APP

API

Enhance the instances

Enhance the instances

Change the configuration of thread

Change the threshold

29

Have no time!!!!

30

2012/06 – Super Sale

Countermeasure for next Super Sale. Migration of DB servers.(ID) Enhancement Applications Servers.

-> Transfer to Cloud environment not using Physical servers.

Load Test on Production environment.

-> We did on staging environment, but it was not enough.• User Numbers• Item Numbers• Transaction• Server Spec

Different between staging and production

environment.

31

2012/12 – Super Sale

32

2012/12 – Super Sale

Bookmark

Top PageEvent Page

Search

IDAD

Entry

ItemPage

PurchaseHistory

Review

Cart

Search Engine

Point Coupon

33

2012/12 – Super Sale

The first peak time -> DownThis was the most high traffic in this year.

Search, Item Page and Checkout were down.

34

Search

Web

Web

Web

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

Application load was high

The beginning of the first peak time, search applications was high load because the huge traffic came from event contents.-> We enhanced 65ins by using Cloud environment within 4 hours.

The LB which we enhanced was high load.

Search EngineAPP

APP

APP

APP

APP

APP

APP

LB was highEnhance

35

Item Page

12/02 9:00 pm to 24:00

Disk Util was 100%

36

Item Page

The connection delayed between App and NFS.

37

Item Page

We switched to Akamai during peak time. Cache 25min.(by Mikitani-san suggestion.) Inventory data was not updated in real time. The countermeasure of emergency.

38

Item Page

39

Checkout

Web

Web

Web

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

APP

Web

Web

Web

Web

Web

Web

APP

APP

APP

APP

APP

APP

APP

API

High Load

40

2012/12 – Super Sale

Countermeasures for next Continuously doing the load test of checkout system. Need to decrease NFS call numbers in Item Page

system. Transfer to Cloud environment gradually.

41

Item Page

AppServer Decreasing Unnecessary File Call

Memcached Server Cache more kinds of files.

- Shop data, Layout data. Re-cache anytime whether file is update or not

when cache is expired

42

Item PageBefore After

30% Down

43

Checkout

APP

Jmeter

Web

WEB serverAPP

APP server

Cache DataBase

APP

API α APP

API βAPP

API γ APP

API δ

Load Test about 50 times in midnight…

44

Checkout

We always have so many load tests for checkout systems before Super

Sale.Explain later…by Hashiyama.

45

2013/03 – Super Sale

46

2013/03 – Super Sale

Bookmark

Top PageEvent Page

Search

IDAD

Entry

ItemPage

PurchaseHistory

Review

STEP

Cart

Search Engine

BasketAPI Point Coupon

47

No Trouble!!

Success!!

48

2013/06 & 2013/09 – Super Sale

No Big Trouble!

49

From this year…

50

Victory Sale

Just only 3 weeks for this event preparation..

51

Victory Sale

1st Spike. From Yahoo!

2nd Spike. Event Started.

3rd Spike. New Paper AD

52

Victory Sale

Bookmark

Top PageEvent Page

Search

IDAD

Entry

ItemPage

PurchaseHistory

Review

STEP

Cart

Search Engine

BasketAPI Point Coupon

53

Top & Search

Top Page

Search

The traffic was higher more than we expected.Around 6 or 7 times!!!!

The countermeasure for this was just enhancement.

54

1.Scale of Big Sales - Huge Traffic Scale, Amazing Sales.

2.History of success - Share how we could improve our services.

3.Case study - Show trouble case and explain the countermeasure. -- Checkout System -- Infrastructure(Cloud Environment/Network)

55

About Me

Name Makito Hashiyama(@capyogu)

Role Team manager of APIs for Rakuten Ichiba

Recent activity GlassFish Community Feedback @ JavaOne 2013

Contact makito.hashiyama@mail.rakuten.com

56

Overview of Rakuten Ichiba Checkout

Architecture Behaves like a service bus based on SOA Calls more than 15 external APIs and mashes up

them, then provides

Scale More than 100 application servers

Checkout

APIKVS

External APIs

Client side

SOAP/REST

57

Provides over 50 services to client side Get/Set shopper information Get/Set merchant information Validation Update inventory Register order data into DB etc..

Checkout

API

Item API Cart API

Merchant API

Client side

SOAP/REST

Overview of Rakuten Ichiba Checkout

58

Overview of Rakuten Ichiba Checkout

Stateful API Manage session information instead of client side Creates unique key and manage it with KVS Client side only have to call API with the key

KVSCheckout

API

SOAP/REST

Client side

key value

Key1 Session1

Key2 Session2

59

Overview of Rakuten Ichiba Checkout

Rakuten Super Sale Biggest online sales in Japan It causes a huge amount of traffic

Performance bottleneck External APIs called by our API were slow down We needed to improve the system at the peak time

delay

Checkout

API

External APISOAP/REST

Slow downDelay

60

How to execute load test

delay

Checkout

API

External APISOAP/REST

JMeter

Environment On production at midnight Execute over 50 times with JMeter

Test case 100,000 dummy shoppers 1,000,000 dummy items / 6,500 dummy merchant Reproduce sale’s load as much as possible

61

How to execute load test

APP

Jmeter

Web

WEB serverAPP

APP server

Cache DataBase

APP

API α APP

API βAPP

API γ APP

API δ

As a result, bottleneck moved to APP server

62

Improvement to handle a huge traffic

Task Queue

Worker Thread

Worker Thread

Worker Thread

Checkout API

Worker Thread

Worker Thread

External APIs

delay

delay

delay

delay

delay

CPU load was high

Client side

Request

63

Improvement to handle a huge traffic

Task Queue

Worker Thread

Worker Thread

Worker Thread

Checkout API

External APIs

Client side

Request

(1)According to vmstat, ‘run queue’ was very high(2)Decrease worker threads to keep ‘run queue’ low(3)As a result, latency increased but throughput was improved

64

Improvement to handle a huge traffic

As a result… Checkout API could process over 12,000

transactions / minute We also achieved 30,000 TPM in load test

(Just yesterday we did!!!)

65

Overview of Rakuten Ichiba Checkout

In the future

Set SLA for each external APIResolve performance issues

Synchronous vs. Asynchronous Upgrade library / middleware / JDK Deep copy(copy constructor vs. serialize)

66

Self introduction

Vice Group ManagerServer Platform Group / Network administration Group

Global Infrastructure Development Department

And Committee member of JANOG(JApan Network Operators’ Group)

Name : Osamu Iwasaki

Role : Network / Cloud Eng & Mgr

Twitter @osamuiwasakiSkype osamu.iwasaki

67

Our traffics history

0

20000

40000

60000

80000

100000

120000

140000

160000(Gbps)

Peak traffics at Victory Sales, over 140Gbps which was about over 5% of Japan Internet traffics

68

Our traffics history

0

20000

40000

60000

80000

100000

120000

140000

160000(Gbps)

Peak traffics at Victory Sales, over 140Gbps which was about over 5% of Japan Internet traffics

Victory Sale

Super Sales

69

Network traffic trend from 2012/Jan(SS traffic focus)

SuperSale 2012June

2012Dec

2012Mar

2013June

2013Sep

2013Oct(VS)

CDN 60G 78.9G 69.1G 75.8G 73.7G 127.6G

RakutenDC 12.7G 14.2G 12.8G 12.5G 11.7G 12.9G

Total 72.7G 93.1G 81.9G 88.3G 85.4G 140.5G

0

20000

40000

60000

80000

100000

120000

140000

160000(Gbps)

70

PC/FeaturePhone/Smartphone/Table share by Sales

Mobile traffics increase rapidly!! Almost 50%

71

Our private cloud history

About 1years ago, we starts from 300VMs.But now, around 10000VMs running for Rakuten Ichiba services. Compared last year is over 30 times !!!

72

Victory Sale

Just only 3 weeks for this event preparation..

73

But !!

74

Our Load Balancers are downed……

75

What happened at peak time

LoadBalancer-ACPU utilization

Peak TimePeak Time

LoadBalancer-BCPU utilization

Due to heavy traffics at VictorySale start time, CPU load of LoadBalancer rapidly growth……

76

After the result of re-allocation operation

LoadBalancer-ACPU utilization

LoadBalancer-BCPU utilization

After the VIP re-allocation, we could separate heavy traffics to other LoadBalancer

77

Our counter action for next Victory sale

ActiveSLB

(Target CPU under 30%)

StandbySLB

(CPU 0%)

VIP Group A

VIP Group B

Internet

Regular time3times peak capable

ActiveSLB

(Target CPU under 15%)

StandbySLB

(Target CPU under 15%)

VIP Group A

VIP Group B

Internet

BigSale time6times peak capable

VIP Group B

3times is not enough for us, 6times we need for the Super/Victory sales.

78

Next Victory sale ready?

79

Next Victory sale ready?

Yes, we are ready !!!

80

Wrap UP

81

Wrap Up

• Traffic : 5% of entire Japan.• Sales : Over 15B yen/day• Continuously

Tuning/Improvement• Cloud environment

82

Global Expansion - Super Sale

83

Worldwide Rakuten Super Sale

In future

84

And…

85

86

87

Thank you for listening.

Yusuke Kobayashi

@okoba23

Makito Hashiyama

@capyogu

Osamu Iwasaki

@osamuiwasaki