Post on 21-Dec-2015
Building Network-Centric Building Network-Centric SystemsSystems
Liviu IftodeLiviu Iftode
Before WWW, people were Before WWW, people were happy...happy...
Mostly local computingMostly local computing Occasional TCP/IP networking with low expectations and mostly non-Occasional TCP/IP networking with low expectations and mostly non-
interactive trafficinteractive traffic local area networks: file server (NFS)local area networks: file server (NFS) wide area networks -Internet- : E-mail, Telnet, Ftpwide area networks -Internet- : E-mail, Telnet, Ftp
Networking was not a major concern for the OSNetworking was not a major concern for the OS
NFSCS.umd.EDU CS.rutgers.EDU
TCP/IP
TCP/IP
E-mail, Telnet
Emacs
One Exception: Cluster One Exception: Cluster ComputingComputing
Cost-effective solution for high-performance Cost-effective solution for high-performance distributed computingdistributed computing
TCP/IP networking was the headache TCP/IP networking was the headache large software overheadslarge software overheads
Software DSM not a network-centric system :-(Software DSM not a network-centric system :-(
Multicomputers Clusters of computers
The Great WWW ChallengeThe Great WWW Challenge
World Wide Web made access over the Internet easyWorld Wide Web made access over the Internet easy Internet became commercialInternet became commercial Dramatic increase of interactive traffic Dramatic increase of interactive traffic WWW networking creates a WWW networking creates a network-centric systemnetwork-centric system: Internet server : Internet server
performance: service more network clientsperformance: service more network clients availability: be accessible all the time over the networkavailability: be accessible all the time over the network security: protect resources against network attackssecurity: protect resources against network attacks
Bank.com
TCP/IP
http://www.Bank.comWeb Browsing
Network-Centric SystemsNetwork-Centric Systems
Networking dominates the operating systemNetworking dominates the operating system
Mobile SystemsMobile Systems mobility aware TCP/IP (Mobile IP, I-TCP, etc), disconnected file systems (Coda), mobility aware TCP/IP (Mobile IP, I-TCP, etc), disconnected file systems (Coda),
adaptation-aware applications for mobility(Odyssey), etcadaptation-aware applications for mobility(Odyssey), etc Internet ServersInternet Servers
resource allocation (Lazy Receive Processing, Resource Containers), OS shortcuts resource allocation (Lazy Receive Processing, Resource Containers), OS shortcuts (Scout, IO-Lite), etc(Scout, IO-Lite), etc
Pervasive/Ubiquitous SystemsPervasive/Ubiquitous Systems Tiny OS , sensor networks (Directed Diffusion, etc), programmability (One World, etc)Tiny OS , sensor networks (Directed Diffusion, etc), programmability (One World, etc)
Storage NetworkingStorage Networking network-attached storage (NASD, etc), peer-to-peer systems (Oceanstore, etc), network-attached storage (NASD, etc), peer-to-peer systems (Oceanstore, etc),
secure file systems (SFS, Farsite), etcsecure file systems (SFS, Farsite), etc
Big PictureBig Picture
Research sparked by various OS-Networking Research sparked by various OS-Networking tensionstensions
Shift of focus from Performance to Availability Shift of focus from Performance to Availability and Manageabilityand Manageability
Networking and Storage I/O Convergence Networking and Storage I/O Convergence Server-based and serverless systems Server-based and serverless systems TCP/IP and non-TCP/IP protocolsTCP/IP and non-TCP/IP protocols Local area, wide-area, ad-hoc and Local area, wide-area, ad-hoc and
application/overlay networksapplication/overlay networks Significant interest from industrySignificant interest from industry
OutlineOutline
TCP ServersTCP Servers Migratory-TCP and Service ContinuationsMigratory-TCP and Service Continuations Cooperative Computing, Smart Messages and Cooperative Computing, Smart Messages and
Spatial ProgrammingSpatial Programming Federated File SystemsFederated File Systems Talk Highlights and ConclusionsTalk Highlights and Conclusions
Network Processing
71%
Other system calls9%
User space20%
Problem 1: TCP/IP is too ExpensiveProblem 1: TCP/IP is too Expensive
Breakdown of the CPU time for Apache (uniprocessor based Web-server)
Traditional Send/Receive Traditional Send/Receive CommunicationCommunication
sender receiver
App OS AppOS
send(a)
receive(b)
copy(a,send_buf)
DMA(send_buf,NIC)
send_buf is transferred interrupt
DMA(NIC,recv_buf)
copy(recv_buf,b)
NIC NIC
TCP Send45%
Other system calls9%
User space20%
TCP Receive 7%
IP Send0%
IP Receive0%
Software Interrupt Processing
11%
Hardware Interrupt Processing
8%
A Closer LookA Closer Look
Multiprocessor Server Multiprocessor Server Performance Performance Does not Does not ScaleScale
•0
•100
•200
•300
•400
•500
•600
•700
•300 •350 •400 •450 •500 •550 •600 •650 •700 •750
•Offered load (connections/s)
• Th
rou
gh
pu
t (r
equ
ests
/s) Dual Processor
Uniprocessor
Apache Web server 1.3.20 on 1 Way and 2 Way 300MHz Pentium II SMP with repeatedly accessing a static16 KB file
TCP/IP-Application Co-HabitationTCP/IP-Application Co-Habitation TCP/IP “steals” compute cycles and memory from TCP/IP “steals” compute cycles and memory from
applicationsapplications TCP/IP executes in kernel-mode: mode switching TCP/IP executes in kernel-mode: mode switching
overheadoverhead TCP/IP executes asynchronouslyTCP/IP executes asynchronously
interrupt processing overheadinterrupt processing overhead internal synchronization on multiprocessor servers causes internal synchronization on multiprocessor servers causes
execution serializationexecution serialization Cache pollutionCache pollution Hidden “Service-work” Hidden “Service-work”
TCP packet retransmissionTCP packet retransmission TCP ACK processingTCP ACK processing ARP request serviceARP request service
Extreme cases can compromise server performanceExtreme cases can compromise server performance Receive livelocksReceive livelocks Denial-of-service (DoS) attacksDenial-of-service (DoS) attacks
Two SolutionsTwo Solutions Replace TCP/IP with a lightweight transport protocolReplace TCP/IP with a lightweight transport protocol Offload some/all of the TCP from host to a dedicated Offload some/all of the TCP from host to a dedicated
computing unit (processor, computer or “intelligent” computing unit (processor, computer or “intelligent” network interface)network interface)
Industry: high-performance, expensive solutionsIndustry: high-performance, expensive solutions Memory-to-Memory Communication: InfiniBandMemory-to-Memory Communication: InfiniBand ““Intelligent” network interface: TCP Offloading Engine(TOE)Intelligent” network interface: TCP Offloading Engine(TOE)
Cost-effective and flexible solutions:Cost-effective and flexible solutions: TCP Servers TCP Servers
Memory-to-Memory(M-M) Memory-to-Memory(M-M) CommunicationCommunication
OS
NIC
OS
Remote DMA
NIC
MemoryBuffer
TCP/IP
Application
OS
ReceiveSend
Network Interface (NIC)
Sender Receiver
M-M
Memory-to-Memory Memory-to-Memory Communication Communication
is Non-Intrusiveis Non-Intrusive
Sender: low overhead Receiver: zero overhead
App App
RDMA_Write(a,b)
b is updated
NIC
a transferred into b
NIC
TCP Server at a GlanceTCP Server at a Glance
A software offloading architecture using existing hardwareA software offloading architecture using existing hardware Basic idea: Basic idea: Dedicate one or more computing units exclusively for Dedicate one or more computing units exclusively for
TCP/IPTCP/IP Compared to TOECompared to TOE
track technology better: latest processorstrack technology better: latest processors flexible: adapt to changing load conditionsflexible: adapt to changing load conditions cost-effective: no extra hardwarecost-effective: no extra hardware
Isolate application computation from network processingIsolate application computation from network processing Eliminate network interrupts and context switchesEliminate network interrupts and context switches Efficient resource allocationEfficient resource allocation Additional performance gains (zero-copy) with extended socket APIAdditional performance gains (zero-copy) with extended socket API
Related workRelated work Very preliminary offloading solutions: Piglet, CSPVery preliminary offloading solutions: Piglet, CSP Socket Direct Protocol, Zero-copy TCPSocket Direct Protocol, Zero-copy TCP
Two TCP Server ArchitecturesTwo TCP Server Architectures
CPU CPU
Shared Memory
TCP-Server Server Appl
TCP/IP
TCP/IP
TCP-Server Server Appl
M-M
TCP Servers for Multiprocessor ServersTCP Servers for Multiprocessor Servers
TCP Servers for Cluster-based ServersTCP Servers for Cluster-based Servers
Where to Split TCP/IP Processing? Where to Split TCP/IP Processing? (How much to offload?)(How much to offload?)
SEND
copy_from_application_buffers
TCP_send
IP_send
packet_scheduler
setup_DMA
packet_out
RECEIVE
copy_to_application_buffers
TCP_receive
IP_receive
software_interrupt_handler
interrupt_handler
packet_in
APPLICATION
SYSTEM CALLS
ApplicationProcessors
TCP Servers
Evaluation TestbedEvaluation Testbed
Multiprocessor ServerMultiprocessor Server 4-Way 550MHz Intel Pentium II system 4-Way 550MHz Intel Pentium II system
running Apache 1.3.20 web server on Linux running Apache 1.3.20 web server on Linux 2.4.9 2.4.9
NIC : 3-Com 996-BT Gigabit Ethernet NIC : 3-Com 996-BT Gigabit Ethernet Used sclients as a client program [Banga 97]Used sclients as a client program [Banga 97]
Comparative ThroughputComparative Throughput
0
500
1000
1500
2000
2500
3000
3500
Uniprocessor SMP 4 processors SMP - 1 TCPServer
SMP - 2 TCPServers
Th
rou
gh
pu
t (r
eq
ue
sts
/se
c)
Clients issue file requests according to a web server trace
Adaptive TCP ServersAdaptive TCP Servers
Static TCP Server configuration Static TCP Server configuration Too few TCP Servers can lead to network Too few TCP Servers can lead to network
processing becoming the bottleneckprocessing becoming the bottleneck Too many TCP Servers lead to degradation in Too many TCP Servers lead to degradation in
performance of CPU intensive applicationsperformance of CPU intensive applications Dynamic TCP Server configurationDynamic TCP Server configuration
Monitor the TCP Server queue lengths and Monitor the TCP Server queue lengths and system load system load
Dynamically add or remove TCP Server Dynamically add or remove TCP Server processorsprocessors
Next Target: The Storage NetworkingNext Target: The Storage Networking
non-TCP/IP solutions require new wiring or non-TCP/IP solutions require new wiring or tunneling over IP-based Ethernet networkstunneling over IP-based Ethernet networks
TCP/IP solutions require TCP offloadingTCP/IP solutions require TCP offloading
TCP Offloading
iSCSI (SCSI over IP)
M-M Communication (InfiniBand)
DAFS (Direct Access File Systems)
TCP or not TCP?
Storage Networking dilemmaStorage Networking dilemma
Future Work: TCP Servers & iSCSIFuture Work: TCP Servers & iSCSI
Use TCP-Servers to connect to SCSI storage Use TCP-Servers to connect to SCSI storage using iSCSI protocol over TCP/IP networksusing iSCSI protocol over TCP/IP networks
CPU CPU
Shared Memory
TCP/IP
TCP-Server & iSCSIServer Appl
iSCSI
SCSI Storage
Server vs. Service AvailabilityServer vs. Service Availability client interested in Service availabilityclient interested in Service availability
Adverse conditions may affect service availabilityAdverse conditions may affect service availability internetwork congestion or failureinternetwork congestion or failure servers overloaded, failed or under DoS attackservers overloaded, failed or under DoS attack
TCP has one responseTCP has one response network delays => packet loss => retransmissionnetwork delays => packet loss => retransmission
TCP limits the OS solutions for service availabilityTCP limits the OS solutions for service availability early binding of service to early binding of service to aa server server client cannot switch to another server for sustained client cannot switch to another server for sustained
service after the connection is establishedservice after the connection is established
Problem 2: TCP/IP is too RigidProblem 2: TCP/IP is too Rigid
Service Availability through MigrationService Availability through Migration
Client
Server 1
Server 2
Migratory TCP at a GlanceMigratory TCP at a Glance
Migratory TCP migrates live connections among Migratory TCP migrates live connections among cooperative serverscooperative servers
Migration mechanism is Migration mechanism is genericgeneric (not application specific) (not application specific) lightweightlightweight (fine-grained migration) and (fine-grained migration) and low-latencylow-latency
Migration triggered by client or serverMigration triggered by client or server Servers can be geographically distributed (different IP Servers can be geographically distributed (different IP
addresses)addresses) Requires changes to the server application Requires changes to the server application Totally transparent to the client applicationTotally transparent to the client application Interoperates with existing TCPInteroperates with existing TCP Migration policies decoupled from migration mechanismMigration policies decoupled from migration mechanism
Basic Idea: Fine-Grained State Basic Idea: Fine-Grained State MigrationMigration
Client
Server1 Process
Application state
Connection state
Server2 Process
C1 C2 C3 C4
C5 C6
C2
Migratory-TCP (Lazy) ProtocolMigratory-TCP (Lazy) Protocol
Connect
(0)
C’
< S
tat e
Requ
est
> (
2)
< S
tate
Reply
> (
3)
Client
Server 1
Server 2
Migration Request (1)Migration Accept(4)
Non-Intrusive MigrationNon-Intrusive Migration
Migrate state without involving old-server application Migrate state without involving old-server application (only old server OS)(only old server OS)
Old server exportsOld server exports per-connection state periodically per-connection state periodically Connection state and Application state can go out of syncConnection state and Application state can go out of sync Upon migration, Upon migration, new server importsnew server imports the last exported the last exported
state of the migrated connectionstate of the migrated connection OS uses connection state to synchronize with applicationOS uses connection state to synchronize with application Non-intrusive migration with M-M communicationNon-intrusive migration with M-M communication
uses RDMA read to extract state from the old server with zero-uses RDMA read to extract state from the old server with zero-overheadoverhead
works even when the old server is overloaded or frozenworks even when the old server is overloaded or frozen
Service Continuation (SC) Service Continuation (SC)
Connection state
Back-End Server Process1
Back-End Server Process2
exportedstate
exported state
pipe pipe
Front-End Server Process
exportedstate
socket
SC
Pipe state Pipe state
sc= create_cont(C1);
p1=pipe();
associate(sc,p1);
fork_exec(Process1);
….
export(sc,state)
sc= open_cont(p1);
…
export(sc, state)
sc= open_cont(p2);
….
export(sc,state)
SCAPI
Related WorkRelated Work
Process migration: Sprite [Douglis ‘91], Locus Process migration: Sprite [Douglis ‘91], Locus [Walker ‘83], MOSIX [Barak ‘98], etc.[Walker ‘83], MOSIX [Barak ‘98], etc.
VM migration [Rosemblum ‘02, Nieh ‘02]VM migration [Rosemblum ‘02, Nieh ‘02] Migration in web server clusters [Snoeren ‘00, Luo Migration in web server clusters [Snoeren ‘00, Luo
‘01]‘01] Fault-tolerant TCP [Alvisi ‘00] Fault-tolerant TCP [Alvisi ‘00] TCP extensions for host mobility: I-TCP [Bakre ‘95], TCP extensions for host mobility: I-TCP [Bakre ‘95],
Snoop TCP [Balakrishnan ‘95], end-to-end Snoop TCP [Balakrishnan ‘95], end-to-end approaches [Snoeren ‘00], Msocks [Maltz ‘98] approaches [Snoeren ‘00], Msocks [Maltz ‘98]
SCTP (RFC 2960)SCTP (RFC 2960)
EvaluationEvaluation
Implemented SC and M-TCP in FreeBSD kernelImplemented SC and M-TCP in FreeBSD kernel Integrated SC in real Internet serversIntegrated SC in real Internet servers
web, media streaming, transactional DBweb, media streaming, transactional DB MicrobenchmarkMicrobenchmark
impact of migration on client perceived impact of migration on client perceived throughput for a two-process server using TTCPthroughput for a two-process server using TTCP
Real applicationsReal applications sustain web server throughput under load sustain web server throughput under load
produced by increasing the number of client produced by increasing the number of client connectionsconnections
Impact of Migration on Impact of Migration on ThroughputThroughput
7,300
7,400
7,500
7,600
7,700
7,800
7,900
8,000
No migration 2 5 10
Migration period (s)
Eff
ec
tiv
e t
hro
ug
hp
ut
(KB
/s)
SC size 1 KBSC size 5 KBSC size 10 KB
Web Server ThroughputWeb Server Throughput
0
100
200
300
400
500
600
700
800
900
300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 1,700
Offered load (connections/s)
Th
rou
gh
pu
t(re
plie
s/s)
0
2,000
4,000
6,000
8,000
10,000
12,000
Mig
rate
d c
on
nec
tio
ns
Migrated Connections
M-Apache
Apache
Future Research: Use SC to Build Future Research: Use SC to Build Self-Healing Cluster-based Self-Healing Cluster-based SystemsSystems
SC2
SC3
Linux CarSensors Linux CameraLinux Watch
Problem 3: Computer Systems move Problem 3: Computer Systems move OutdoorsOutdoors
Massive numbers of computers will be embedded Massive numbers of computers will be embedded everywhere in the physical worldeverywhere in the physical world
Dynamic ad-hoc networkingDynamic ad-hoc networking How to execute user-defined applications over these How to execute user-defined applications over these
networks?networks?
Outdoor Distributed ComputingOutdoor Distributed Computing Traditional distributed computing has been indoor Traditional distributed computing has been indoor
Target: performance and/or fault toleranceTarget: performance and/or fault tolerance Stable configuration, robust networking (TCP/IP or M-M)Stable configuration, robust networking (TCP/IP or M-M) Relatively small scaleRelatively small scale Functionally equivalent nodesFunctionally equivalent nodes Message passing or shared memory programmingMessage passing or shared memory programming
Outdoor Distributed ComputingOutdoor Distributed Computing Target: Collect/Disseminate distributed data and/or perform collective tasksTarget: Collect/Disseminate distributed data and/or perform collective tasks Volatile nodes and linksVolatile nodes and links Node equivalence determined by their physical properties (content-based Node equivalence determined by their physical properties (content-based
naming)naming) Data migration is not goodData migration is not good
expensive to perform end-to-end transfer controlexpensive to perform end-to-end transfer controltoo rigid for such a dynamic networktoo rigid for such a dynamic network
Cooperative Computing at a Cooperative Computing at a GlanceGlance
Distributed computing with execution migrationDistributed computing with execution migration Smart MessageSmart Message: carries the execution state (and possibly the : carries the execution state (and possibly the
code) in addition to the payloadcode) in addition to the payload execution state assumed to be small (explicit migration)execution state assumed to be small (explicit migration) code usually cached (few applications)code usually cached (few applications)
Nodes “cooperate” by allowing Smart Messages Nodes “cooperate” by allowing Smart Messages to execute on them to execute on them to use their memory to store “persistent” data (tags)to use their memory to store “persistent” data (tags)
Nodes do not provide routingNodes do not provide routing Smart Message executes on each node of its pathSmart Message executes on each node of its path Application executed on target nodes (nodes of interest)Application executed on target nodes (nodes of interest) Routing executed on each node of the path (Routing executed on each node of the path (self-routingself-routing))
During its lifetime, an application generates at least one, During its lifetime, an application generates at least one, possibly multiple, smart messagespossibly multiple, smart messages
Smart vs. “Dumb” MessagesSmart vs. “Dumb” Messages
Data migration
Mary’s lunch:
AppetizerEntreeDessert
•`Execution migration
Smart MessagesSmart Messages
Hot
Hot
Application
do
migrate(Hot_tag,timeout);
Water_tag = ON;
N=N+1;
until (N==3 or timeout);
Routing
migrate(tag,timeout) {
do
if (NextHot_tag)
sys_migrate(NextHot_tag,timeout);
else {
spawn_SM(Route_Discovery,Hot);
block_SM(NextHot_tag,timeout);
until (Hot_tag or timeout); }
0 1 1 1 2 2 3
SM Execution
Hot
Cooperative Node Architecure Cooperative Node Architecure
AdmissionManager
VirtualMachine
Tag Space
OS & I/O
SM Arrival SM Migration
Scheduling
Admission control for resource securityAdmission control for resource security Non-preemptive scheduling with timeout-killNon-preemptive scheduling with timeout-kill Tags created by SMs (limited lifetime) or I/O tags (permanent)Tags created by SMs (limited lifetime) or I/O tags (permanent)
global tag name space {hash(SM code), tag name}global tag name space {hash(SM code), tag name} five protection domains defined using hash(SM code), SM source node ID, five protection domains defined using hash(SM code), SM source node ID,
and SM starting time.and SM starting time.
Related WorkRelated Work
Mobile agents (D’Agents, Ajanta)Mobile agents (D’Agents, Ajanta)
Active networks (ANTS, SNAP)Active networks (ANTS, SNAP)
Sensor networks (Diffusion, TinyOS, TAG)Sensor networks (Diffusion, TinyOS, TAG)
Pervasive computing (One.world)Pervasive computing (One.world)
8 HP iPAQs running Linux8 HP iPAQs running Linux
802.11 wireless 802.11 wireless
communicationcommunication
Sun Java K Virtual MachineSun Java K Virtual Machine
Geographic (simplified GPSR) Geographic (simplified GPSR)
and On-Demand (AODV) and On-Demand (AODV)
routingrouting
Prototype ImplementationPrototype Implementation
user node intermediate node node of interest
Completion Time
Routing algorithm Code not cached (ms) Code cached (ms)
Geographic (GPSR)On-demand (AODV)
415.6 126.6
506.6 314.7
There is no best routing outdoorsThere is no best routing outdoors Depends on application and node property Depends on application and node property
dynamicsdynamics Application-controlled routingApplication-controlled routing
Possible with Smart Messages (execution state Possible with Smart Messages (execution state carried in the message)carried in the message)
When migration times out, the application is When migration times out, the application is upcalled on the current node to decide what to upcalled on the current node to decide what to do nextdo next
Self-RoutingSelf-Routing
• geographical routing to reach target regions• on-demand routing within region• application decides when to switch between the two
starting node node of interest other node
Self-Routing Effectiveness Self-Routing Effectiveness (simulation)(simulation)
Next Target: Spatial Next Target: Spatial ProgrammingProgramming
Smart Message: too low-level programmingSmart Message: too low-level programming How to describe distributed computing over dynamic How to describe distributed computing over dynamic
outdoor networks of embedded systems with limited outdoor networks of embedded systems with limited knowledge about resource number, location, etcknowledge about resource number, location, etc
Spatial ProgrammingSpatial Programming (SP) design guidelines: (SP) design guidelines: space is a first-order programming concept space is a first-order programming concept resources named by their expected location and properties (spatial resources named by their expected location and properties (spatial
reference) reference) reference consistency: spatial reference-to- resource mappings are reference consistency: spatial reference-to- resource mappings are
consistent throughout the programconsistent throughout the program program must tolerate resource dynamicsprogram must tolerate resource dynamics
SP can be implemented using Smart Messages (the spatial SP can be implemented using Smart Messages (the spatial reference mapping table carried as payload)reference mapping table carried as payload)
Spatial Programming ExampleSpatial Programming Example
Program sprinklers to water the hottest spot of the Left HillProgram sprinklers to water the hottest spot of the Left Hill
Mobile sprinklers withtemperature sensorsLeft Hill Right Hill
Hot spot
for(i=0;i<10;i++) What if <10 hot spots ?
if {Left_Hill:Hot}[i].temp > Max_temp
Max_temp = {Left_Hill:Hot[I]}.temp;
id = i;
{Left_Hill:Hot}[id].water = ON; Spatial Reference consistency
Spatial Reference for Hot spots on Left Hill
Problem 4: Manageable Distributed Problem 4: Manageable Distributed File File SystemsSystems
Most distributed file servers use TCP/IP both for client-server and intra-server communicationMost distributed file servers use TCP/IP both for client-server and intra-server communication Strong file consistency, file locking and load balancing: difficult to provideStrong file consistency, file locking and load balancing: difficult to provide File servers require significant human effort to manage: add storage, move directories, etcFile servers require significant human effort to manage: add storage, move directories, etc Cluster-based file servers are cost-effective Cluster-based file servers are cost-effective Scalable performance requires load balancingScalable performance requires load balancing
Load balancing may require file migration Load balancing may require file migration File migration limited if file naming is location-dependentFile migration limited if file naming is location-dependent
We need a scalable, location-independent and easy to manage cluster-based distributed file We need a scalable, location-independent and easy to manage cluster-based distributed file systemsystem
Federated File System at a GlanceFederated File System at a Glance
Global file name space over cluster of autonomous Global file name space over cluster of autonomous local file systems interconnected by a local file systems interconnected by a M-M networkM-M network
FedFS
LocalFS
M-M Interconnect
LocalFS
LocalFS
LocalFS
A1A2 A2 A3 A3 A3
FedFS
Location Independent Global File Location Independent Global File NamingNaming
Virtual Directory (VD): union of local directories Virtual Directory (VD): union of local directories volatile, created on demand (volatile, created on demand (dirmergedirmerge)) contains information about files including location (homes of files)contains information about files including location (homes of files) assigned dynamically to nodes (managers) assigned dynamically to nodes (managers) supports location independent file naming and file migrationsupports location independent file naming and file migration
Directory Tables (DT): local caches of VD entries (~TLB)Directory Tables (DT): local caches of VD entries (~TLB)
usr
file1
usr
file2
usr
file1 file2
virtual directory
local directories
Local file system 1
Local file system 2
Direct Access File System (DAFS)Direct Access File System (DAFS)
Federated DAFS Federated DAFS
NFS Server
Local FSM-MFedFS
ApplicationNFS Client
+Application
NFS Client
ApplicationNFS Client
NFS Server
Local FSM-MFedFS
NFS Server
Local FSM-MFedFS
Distributed NFS over FedFS
TCP/IP M-M
Application
M-M
DAFS Client DAFS ServerLocal FSM-M
Direct Access FS
(DAFS)
M-M
DAFS Server
Local FSM-MFedFS
Application
M-MDAFS Client
Application
M-MDAFS Client
Application
M-MDAFS Client
DAFS Server
Local FSM-MFedFS
DAFS Server
Local FSM-MFedFS
Federated DAFS
M-M M-M
TCP/IP
TCP/IP
TCP/IP
Related WorkRelated Work
Cluster-based File SystemsCluster-based File Systems Frangipani[Thekkath’97], PVFS Frangipani[Thekkath’97], PVFS
[Carns’00],GFS, Archipelago [JI’00], Trapeze [Carns’00],GFS, Archipelago [JI’00], Trapeze (Duke)(Duke)
DAFS [NetApp’03,Magoutis’01,02,03]DAFS [NetApp’03,Magoutis’01,02,03] User-level communication in cluster-based User-level communication in cluster-based
network servers [Carrera’02]network servers [Carrera’02]
Experimental PlatformExperimental Platform
Eight node server clusterEight node server cluster 800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM 800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM
SCSISCSI Client Client
Dual processor (300 MHz PII), 512 MB SDRAMDual processor (300 MHz PII), 512 MB SDRAM Linux-2.4Linux-2.4 Servers and Clients equipped with Emulex cLAN Servers and Clients equipped with Emulex cLAN
adapter (M-M network)adapter (M-M network)
Workload IWorkload I
Postmark – Synthetic benchmarkPostmark – Synthetic benchmark Short-lived small filesShort-lived small files Mix of metadata-intensive operationsMix of metadata-intensive operations
Postmark outlinePostmark outline Create a pool of filesCreate a pool of files Perform transactions – Perform transactions – READ/WRITEREAD/WRITE paired with paired with CREATE/DELETECREATE/DELETE
Delete created filesDelete created files Each Postmark client performs 30,000 transactionsEach Postmark client performs 30,000 transactions Clients distribute requests to servers using a hash Clients distribute requests to servers using a hash
function on pathnamesfunction on pathnames Files are physically placed on the node which Files are physically placed on the node which
receives client requestsreceives client requests
Postmark ThroughputPostmark Throughput
•0
•5000
•10000
•15000
•20000
•25000
•30000
•0 •1 •2 •3 •4 •5 •6 •7 •8 •9
•Number of Servers
• Po
stm
ark
Th
rou
gh
pu
t (t
xn
s/se
c)
•File size: 2K
•File size: 4K
•File size: 8K
•File size: 16K
Workload IIWorkload II
Postmark performs only READ transactionsPostmark performs only READ transactions No No create/deletecreate/delete operations operations Federated DAFS does not control file placementFederated DAFS does not control file placement No client request sent to file’s correct locationNo client request sent to file’s correct location
Postmark Read ThroughputPostmark Read Throughput
•0
•10000
•20000
•30000
•40000
•50000
•60000
•2 •4
•Number of Servers
• Po
stm
ark
Rea
d T
hro
ug
hp
ut
(txn
s/se
c)
•PostmarkRead
•PostmarkRead - NoCache
Next Target: Federated DAFS Next Target: Federated DAFS over the Internetover the Internet
DAFS Server
Local FSM-MFedFS
Application
M-MDAFS Client
Application
M-MDAFS Client
Application
M-MDAFS Client
DAFS Server
Local FSM-MFedFS
DAFS Server
Local FSM-MFedFS
Internet
TCP/IP
OutlineOutline
TCP ServersTCP Servers Migratory-TCP and Service ContinuationsMigratory-TCP and Service Continuations Cooperative Computing, Smart Messages and Cooperative Computing, Smart Messages and
Spatial ProgrammingSpatial Programming Federated File SystemsFederated File Systems Talk Highlights and ConclusionsTalk Highlights and Conclusions
Talk HighlightsTalk Highlights Back to MigrationBack to Migration
Service Continuation: service availability and self-healing clustersService Continuation: service availability and self-healing clusters Smart Messages: programming dynamic networks of embedded Smart Messages: programming dynamic networks of embedded
systemssystems
Exploit Non-Intrusive M-M CommunicationExploit Non-Intrusive M-M Communication TCP offloadingTCP offloading State migrationState migration Federated file systems Federated file systems
Network and Storage I/O ConvergenceNetwork and Storage I/O Convergence TCP Servers & iSCSITCP Servers & iSCSI Federated File Systems & M-MFederated File Systems & M-M
ProgrammabilityProgrammability Smart Messages and Spatial ProgrammingSmart Messages and Spatial Programming Extended Server API: Service Continuation, TCP Servers, Federated file Extended Server API: Service Continuation, TCP Servers, Federated file
systemsystem
ConclusionsConclusions
Network-Centric Systems: very promising border-crossing Network-Centric Systems: very promising border-crossing systems research areasystems research area
Common issues for a large spectrum of systems and Common issues for a large spectrum of systems and networksnetworks
Tremendous potential to impact industryTremendous potential to impact industry
AknowledgementsAknowledgements UMD students: UMD students: Andrzej Kochut, Chunyuan Liao, Tamer Nadeem, Iulian Neamtiu and Jihwang YeoAndrzej Kochut, Chunyuan Liao, Tamer Nadeem, Iulian Neamtiu and Jihwang Yeo..
Rutgers students:Rutgers students: Ashok Arumugam, Kalpana Banerjee, Aniruddha Bohra, Cristian Borcea, Suresh Gopalakrisnan, Ashok Arumugam, Kalpana Banerjee, Aniruddha Bohra, Cristian Borcea, Suresh Gopalakrisnan, Deepa Iyer, Porlin Kang, Vivek Pathak, Murali Rangarajan, Rabita Sarker, Akhilesh Saxena, Steve Smaldone, Kiran Deepa Iyer, Porlin Kang, Vivek Pathak, Murali Rangarajan, Rabita Sarker, Akhilesh Saxena, Steve Smaldone, Kiran Srinivasan, Florin Sultan and Gang Xu.Srinivasan, Florin Sultan and Gang Xu.
Post-doc: Post-doc: Chalermek IntanagonwiwatChalermek Intanagonwiwat
Collaborations at Rutgers:Collaborations at Rutgers: EEL (Ulrich Kremer), DARK (Ricardo Bianchini), PANIC (Rich Martin and Thu Nguyen) EEL (Ulrich Kremer), DARK (Ricardo Bianchini), PANIC (Rich Martin and Thu Nguyen)
Support: Support: NSF ITR ANI-0121416 and CAREER CCR-013366NSF ITR ANI-0121416 and CAREER CCR-013366