Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
-
Upload
jens-hadlich -
Category
Engineering
-
view
165 -
download
6
Transcript of Ceph Object Storage at Spreadshirt (July 2015, Ceph Berlin Meetup)
Ceph Object Storage at Spreadshirt
How we start
July 2015
Jens Hadlich, Chief Architect Ansgar Jazdzewski, System Engineer
Ceph Berlin Meetup
About Spreadshirt
2
Spread it with Spreadshirt
A global e-commerce platform for everyone to create, sell and buy ideas on clothing and accessories across many points of sale. • 12 languages, 11 currencies • 19 markets • 150+ shipping regions
• community of >70.000 active sellers • € 72M revenue (2014) • >3.3M items shipped (2014)
Object Storage at Spreadshirt
• Our main use case – Store and read primarily user generated content, mostly images
• Some 10s of terabyte (TB) of data • 2 typical sizes:
– a few dozen KB – a few MB
• Up to 50.000 uploads per day • Read > Write
3
Object Storage at Spreadshirt
• „Never change a running system“? – Currently solution (from our early days):
• Big storage, well-branded vendor • Lots of files / directories / sharding
– Problems: • Regular UNIX tools are unusable in practice • Not designed for „the cloud“ (e.g. replication is an issue) • Performance bottlenecks
– Challenges: • Growing number of users à more content • Build a truly global platform (multiple regions and data centers)
4
Ceph
• Why Ceph? – Vendor independent – Open source – Runs on commodity hardware – Local installation for minimal latency – Existing knowledge and experience – S3-API
• Simple bucket-to-bucket replication – A good fit also for < Petabyte – Easy to add more storage – (Can be used later for block storage)
5
Ceph Object Storage Architecture
6
Overview
Ceph Object Gateway
Monitor
Cluster Network
Public Network
OSD OSD OSD OSD OSD
Monitor Monitor
A lot of nodes and disks
Client HTTP (S3 or SWIFT API)
RADOS (reliable autonomic distributed object store)
Ceph Object Storage Architecture
7
A little more detailled
Monitor
Cluster Network
Public Network
Client
RadosGW
HTTP (S3 or SWIFT API)
Monitor Monitor
Some SSDs (for journals) More HDDs JBOD (no RAID)
OSD node
Ceph Object Gateway
librados
Odd number (Quorum)
OSD node OSD node OSD node OSD node
1G
10G (the more the better)
...
RADOS (reliable autonomic distributed object store)
OSD node
Ceph Object Storage at Spreadshirt
8
Initial Setup
Cluster Network (OSD Replication)
Cluster nodes 3 x SSD (journal / index) 9 x HDD (data) xfs
3 Monitors
2 x 1G
2 x 10G
Public Network
Client HTTP (S3 or SWIFT API)
HAProxy
RadosGW
Monitor
RadosGW
Monitor
RadosGW
Monitor
RadosGW RadosGW
2 x 10G Cluster Network
RadosGW on each node
Ceph Object Storage at Spreadshirt
9
Initial Setup • Hardware Configuration – 5 x Dell PowerEdge R730xd
• Intel Xeon E5-2630v3, 2.4 GHz, 8C/16T • 64 GB RAM • 9 x 4 TB NLSAS HDD, 7.2K • 3 x 200 GB SSD Mixed Use • 2 x 120 GB SDD for Boot & Ceph Monitors (LevelDB) • 2 x 1 Gbit + 4 x 10 Gbit NW
10
Performance – First smoke tests
Ceph Object Storage Performance
11
First smoke tests
• How fast with RadosGW? – Response times (read / write)
• Average? • Percentiles (P99)?
– Throughput? – Compared to AWS S3?
• A first (very minimalistic) test setup – 3 VMs (KVM) all with RadosGW, Monitor and 1 OSD
• 2 Cores, 4GB RAM, 1 OSD each (15 GB + 5GB), SSD, 10G Network between nodes, HAProxy (round-robin), LAN, HTTP
– No further optimizations
Ceph Object Storage Performance
12
First smoke tests
• How fast is RadosGW? – Random read and write – Object size: 4 KB
• Results: Pretty promising! – E.g. 16 parallel threads, read:
• Avg 9 ms • P99 49 ms • > 1.300 requests/s
Ceph Object Storage Performance
13
First smoke tests
• Compared to Amazon S3? – Comparing apples and oranges (unfair, but interesting)
• http vs. https, LAN vs. WAN etc.
• Reponse times – Random read, object size: 4KB, 4 parallel threads, location: Leipzig Ceph S3
(Test) AWS S3
eu-central-1 eu-west-1
Location Leipzig Frankfurt Ireland Avg 6 ms 25 ms 56 ms P99 47 ms 128 ms 374 ms Requests/s 405 143 62
14
Performance – Now with the final hardware
Ceph Object Storage Performance
15
Now with the final hardware
• How fast is RadosGW? – Random read and write – Object size: 4 KB
• Results: – E.g. 16 parallel threads, read:
• Avg 4 ms • P99 43 ms • > 2.800 requests/s
Ceph Object Storage Performance
16
Now with the final hardware
0
50
100
150
200
250
300
350
1 2 4 8 16 32
ms
client threads
Average response times (4k object size)
read
write
Ceph Object Storage Performance
17
Now with the final hardware
0
5
10
15
20
25
30
35
40
45
50
1 2 4 8 16 32 32+32
ms
client threads
Read response times (4k object size)
avg
p99
Ceph Object Storage Performance
18
Now with the final hardware
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1 2 4 8 16 32 32+32
requ
ests
/s
client threads
Read request/s
4k object size
128k object size
1 client / 8 threads: 1G network almost saturated at ~115 MB/s
2 clients: 1G network saturated again; but scale out works J
19
Monitoring
Monitoring
20
Grafana rulez J
21
Global availability
Global Availability
22
• 1 Ceph cluster per data center
• S3 bucket-to-bucket replication
• Multiple regions, local delivery
23
Currently open issues / operational tasks
Open issues / operational tasks
24
• Backup – s3fs-fuse too slow – Setup another Ceph cluster?
• Security – Users – ACLs
• Migration of old data – Upload all existing files via script – Use the old system as fallback / in parallel
Open issues / operational tasks
25
• Replication – Test-drive radosgw-agent – s3cmd? Custom tool? – Metadata (User) – Data
• Performance?
• Bucket Notification – Currently unsupported by RadosGW – Build a custom solution?
Open issues / operational tasks
26
• Scrubbing • Rebuild
To be continued ...
+ = ?
Thank You! [email protected] [email protected]