1367-2630_17_8_083023.pdf

8
This content has been downloaded from IOPscience. Please scroll down to see the full text. Download details: This content was downloaded by: 1158062 IP Address: 189.168.211.27 This content was downloaded on 24/08/2015 at 22:43 Please note that terms and conditions apply. Efficient decision-making by volume-conserving physical object View the table of contents for this issue, or go to the journal homepage for more 2015 New J. Phys. 17 083023 (http://iopscience.iop.org/1367-2630/17/8/083023) Home Search Collections Journals About Contact us My IOPscience

Transcript of 1367-2630_17_8_083023.pdf

Page 1: 1367-2630_17_8_083023.pdf

This content has been downloaded from IOPscience. Please scroll down to see the full text.

Download details:

This content was downloaded by: 1158062

IP Address: 189.168.211.27

This content was downloaded on 24/08/2015 at 22:43

Please note that terms and conditions apply.

Efficient decision-making by volume-conserving physical object

View the table of contents for this issue, or go to the journal homepage for more

2015 New J. Phys. 17 083023

(http://iopscience.iop.org/1367-2630/17/8/083023)

Home Search Collections Journals About Contact us My IOPscience

Page 2: 1367-2630_17_8_083023.pdf

New J. Phys. 17 (2015) 083023 doi:10.1088/1367-2630/17/8/083023

PAPER

Efficient decision-making by volume-conserving physical object

Song-JuKim1,MasashiAono2 andEtsushiNameda3

1 WPICenter forMaterials Nanoarchitectonics, National Institute forMaterials Science, Tsukuba, Ibaraki 305-0044, Japan2 Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo 152-8550, and PRESTO JST, Japan3 RIKEN,Wako, Saitama 351-0198, Japan

E-mail: [email protected]

Keywords: randomwalk, natural computing, decision-making

AbstractDecision-making is one of themost important intellectual abilities of not only humans but also otherbiological organisms, helping their survival. This ability, however,may not be limited to biologicalsystems andmay be exhibited by physical systems.Herewe demonstrate that any physical object, aslong as its volume is conservedwhen coupledwith suitable operations, provides a sophisticateddecision-making capability.We consider themulti-armed bandit problem (MBP), the problemoffinding, as accurately and quickly as possible, themost profitable option from a set of options thatgives stochastic rewards. EfficientMBP solvers are useful formany practical applications, becauseMBP abstracts a variety of decision-making problems in real-world situations inwhich an efficienttrial-and-error is required. These decisions aremade as dictated by a physical object, which ismovedin amanner similar to thefluctuations of a rigid body in a tug-of-war (TOW) game. Thismethod,called ‘TOWdynamics’, exhibits higher efficiency than conventional reinforcement learningalgorithms.We show analytical calculations that validate statistical reasons for TOWdynamics toproduce the high performance despite its simplicity. These results imply that various physical systemsinwhich some conservation law holds can be used to implement an efficient ‘decision-making object’.The proposed schemewill provide a newperspective to open up a physics-based analog computingparadigm and to understanding the biological information-processing principles that exploit theirunderlying physics.

1. Introduction

The computing principles inmodern digital paradigms have been designed to be dissociated from theunderlying physics of natural phenomena [1]. In the construction of CMOSdevices, wide-band-gapmaterialshave been employed so that physical fluctuations such as thermal noise, which often violate logically-validbehavior, could be neglected [2]. Since electron dynamics constrained by physical laws cannot be controlledwhen only parameters of the same degree of freedom as those of logical input–output responses aremodulated,considerably complicated circuits are required for implementing relatively simple logic gates such asNANDandNOR [3].However, these efforts to circumvent the division between physics and computation are costly interms of energy consumption andmanufacturing resources. On the other hand, whenwe look at the naturalworld, information processing in biological systems is elegantly coupledwith their underlying physical laws andfluctuations [4–6]. This suggests a potential for establishing a newphysics-based analog-computing paradigm.In this paper, we show that a physical constraint, the conservation law for the volume of a rigid body, allows forefficient solving of decision-making problemswhen subjected to suitable operations involving fluctuations.

Suppose there areM slotmachines, each of which returns a reward; for example, a coin, with a certainprobability that is unknown to a player. Let us consider aminimal case: twomachinesA andB give rewardswithindividual probabilities PA andPB, respectively. The playermakes a decision onwhichmachine to play at eachtrial, trying tomaximize the total reward obtained after repeating several trials. Themulti-armed bandit

OPEN ACCESS

RECEIVED

27 February 2015

REVISED

27May 2015

ACCEPTED FOR PUBLICATION

17 June 2015

PUBLISHED

11August 2015

Content from this workmay be used under theterms of theCreativeCommonsAttribution 3.0licence.

Any further distribution ofthis workmustmaintainattribution to theauthor(s) and the title ofthework, journal citationandDOI.

© 2015 IOPPublishing Ltd andDeutsche PhysikalischeGesellschaft

Page 3: 1367-2630_17_8_083023.pdf

problem (MBP) is used to determine the optimal strategy for finding themachinewith the highest rewardprobability as accurately and quickly as possible by referring to past experiences.

TheMBP is formulated as amathematical problemwithout loss of generality and so is related to variousstochastic phenomena. In fact,many application problems in diverse fields, such as communications (cognitivenetworks [7, 8]), commerce (advertising on theweb [9]), entertainment (Monte Carlo tree search, which is usedfor computer games [10, 11]), can be reduced toMBPs. Particularly, the ‘upper confidence bound 1 (UCB1)algorithm’ for solvingMBPs is usedworldwide inmany practical applications [17].

In the context of reinforcement learning, theMBPwas originally described by Robbins [12], though theessence of the problemhad been studied earlier by Thompson [13]. The optimal strategy, called the ‘Gittinsindex’, is known only for a limited class of problems inwhich the reward distributions are assumed to be knownto the players [14, 15]. Even in this limited class, in practice, computing theGittins index becomes intractable formany cases. For the algorithms proposed byAgrawal andAuer et al, another indexwas expressed as a simplefunction of the reward sums obtained from themachines [16, 17].

2.Methods

2.1. Tug-of-war (TOW)dynamicsKim et al proposed anMBP solution using a dynamical system, called ‘TOWdynamics’; this algorithmwasinspired by the spatiotemporal dynamics of a single-celled amoeboid organism (the true slimemold P.polycephalum) [18–23], whichmaintains a constant intracellular-resource volumewhile collectingenvironmental information by concurrently expanding and shrinking its pseudopod-like terminal parts. In thisnature-inspired algorithm, the decision-making function is derived from its underlying physics, resembling thatof a TOWgame. The physical constraint in TOWdynamics, the conservation law for the volume of theamoeboid body, entails a non-local correlation among the terminal parts, that is, the volume increment in onepart is immediately compensated by volume decrement(s) in the other part(s). In our previous studies [18–23],we showed that, owing to the non-local correlation derived from the volume-conservation law, TOWdynamicsexhibit higher performance than otherwell-known algorithms such as themodified ϵ-greedy algorithm and themodified softmax algorithm,which is comparable to theUCB1-tuned algorithm (seen as the best choice amongparameter-free algorithms [17]). These observations suggest that efficient decision-making devices could beimplemented using any physical object as long as it held some commonphysical attributes such as theconservation law. In fact, Kim et al demonstrated that optical energy-transfer dynamics between quantumdots,inwhich energy is conserved, can be exploited for the implementation of TOWdynamics [24, 25].

In this paper, we extract themost essential ingredients of the TOWdynamics and formulated its simplifiedversion, called the ‘TOWprinciple’, in order to summarize themechanismof high performance by showinganalytical calculations of ‘regret’ that can characterize the performance of themulti-armed bandit algorithms.Consider a volume-conserving physical object; for example, a rigid body like an iron bar (the slot-machineʼshandle), as shown infigure 1.Here, the variableXk represents the displacement of terminal k from an initialposition, where k A B{ , }∈ . IfXk is amaximum,we assume that the bodymakes a decision to playmachine k. InTOWdynamics, theMBP is represented in its inverse form: instead of ‘rewarding’ the player whenmachine kproduces a coinwith a probability Pk, we ‘punish’ the player when themachine gives no coinwith a probability

Figure 1.TOWdynamics. Ifmachine k (k A B{ , }∈ ) is played at each time t, +1 and ω− are added to X t( )k for rewarding (a) andnon-rewarding cases (b), respectively.

2

New J. Phys. 17 (2015) 083023 S -J Kim et al

Page 4: 1367-2630_17_8_083023.pdf

P1 k− . In this respect, the displacementXA ( XB= − ) is determined by the following equations:

X t Q t Q t t( 1) ( ) ( ) ( ), (1)A A B δ+ = − +

Q t N t L t( ) ( ) (1 ) ( ). (2)k k kω= − +

Here, Q t( )k (k A B{ , }∈ ) is an ‘estimate’ of information on past experiences accumulated from the initial time 1to current time t, N t( )k counts the number of times thatmachine k has been played until time t, L t( )k counts thenumber of punishments when playingmachine k until time t, t( )δ is an arbitraryfluctuation towhich the bodyis subjected, andω is a weighting parameter to be described in detail later on in this paper. Equation (2) called the‘learning rule’. Consequently the TOWdynamics evolve according to a particularly simple rule: in addition tothefluctuation, ifmachine k is played at each time t, +1 and ω− are added to X t( )k when rewarded and non-rewarded, respectively (figure 1).

3. Results

3.1. Solvability of theMBPTo explore theMBP solvability of TOWdynamics, let us consider a random-walkmodel. As shown infigure 2(a), α (rightflight when rewarded) and β (left flight when non-rewarded) are the parameters.We assumethatPA>PB for simplicity. After time step t, the displacement R t( )k (k A B{ , }∈ ) can be described by

( )R t N L L

N L

( )

( ) . (3)

k k k k

k k

α β

α α β

= − −

= − +

The expected value ofRk can be obtained from the following equation:

{ }( ) ( )E R t P P N( ) 1 . (4)k k k kα β= − −

In the overlapping area between the two distributions shown infigure 2(b), we cannot accurately estimatewhich is larger. The overlapping area should decrease asNk increases so as to avoid incorrect judgments. Thisrequirement can be expressed by the following forms:

( )P P1 0, (5)A Aα β− − >

( )P P1 0. (6)B Bα β− − <

These expressions can be rearranged into the form

P P . (7)B Aβ

α β<

+<

In otherwords, the parameters α and βmust satisfy the above conditions so that the randomwalk correctlyrepresents the larger judgment.

Figure 2. (a) Randomwalk: flightαwhen rewardedwith Pk orflight β− when non-rewardedwith P1 k− . (b) Probability distributionsof two randomwalks.

3

New J. Phys. 17 (2015) 083023 S -J Kim et al

Page 5: 1367-2630_17_8_083023.pdf

Wecan easily confirm that the following form satisfies the above conditions:

P P

2. (8)A Bβ

α β+=

+

From R t( )k α = Q t( )k , we obtain ω = βα. From this and equation (8), we obtain

2, (9)0ω γ

γ=

P P . (10)A Bγ = +

Here, we have set the parameterω to 0ω . Therefore, we can conclude that the algorithmusing the learning ruleQkwith the parameter 0ω can solve theMBP correctly.

3.2. TOWprinciple: origin of the high performanceInmany popular algorithms such as the ϵ-greedy algorithm, at each time t, an estimate of reward probability isupdated for either of the twomachines being played. On the other hand, in an imaginary circumstance inwhichthe sumof the reward probabilities γ=PA+PB is known to the player, we can update both of the two estimatessimultaneously, even though only one of themachines was played.

The top and bottom rows of table 1 provide estimates based on the knowledge thatmachineAwas playedNA

times and thatmachineBwas playedNB times, respectively. Note thatwe can also update the estimate of themachine that was not played, owing to the given γ.

From the above estimates, each expected reward Qk′ (k A B{ , }∈ ) is given as follows:

Q NN L

NN

N L

N

N L N L( 1) , (11)

A AA A

AB

B B

B

A A B B

⎛⎝⎜

⎞⎠⎟γ

γ

′ =−

+ −−

= − + − +

Q NN L

NN

N L

N

N L N L( 1) . (12)

B AA A

AB

B B

B

B B A A

⎛⎝⎜

⎞⎠⎟γ

γ

′ = −−

+−

= − + − +

These expected rewards, Q j′ s, are not the same as those given by the learning rules of TOWdynamics,Qjs inequation (2).However, what we use substantially in TOWdynamics is the difference

( ) ( )Q Q N N L L(1 ) . (13)A B A B A Bω− = − − + −

Whenwe transform the expected rewards Q sj′ into Q Q (2 )j j γ″ = ′ − , we can obtain the difference

( ) ( )Q Q N N L L2

2. (14)A B A B A Bγ

″ − ″ = − −−

Comparing the coefficients of equations (13) and (14), the differences in their constituent terms are always equalwhen 0ω ω= (equation (9)) is satisfied. Eventually, we can obtain the nearly optimal weighting parameter 0ω interms of γ.

This derivation implies that the learning rule for TOWdynamics is equivalent to that of the imaginarysystem inwhich both of the two estimates can be updated simultaneously. In otherwords, TOWdynamicsimitates the imaginary system that determines its nextmove at time t 1+ in referring to the estimates of the twomachines, even if one of themwas not actually played at time t. This unique feature in the learning rule, derivedfrom the fact that the sumof reward probabilities is given in advance,may be one of the origins of the highperformance of TOWdynamics.

Monte Carlo simulationswere performed it was verified that TOWdynamics with 0ω exhibits anexceptionally high performance, which is comparable to its peak performance—achievedwith the optimal

Table 1.Estimates for each reward probabilitybased on the knowledge thatmachineAwas playedNA times and thatmachineBwas playedNB times—on the assumption that the sumof the rewardprobabilities γ = PA+PB is known.

A:N L

NA A

A

−B: γ −

N L

NA A

A

A: γ−N L

NB B

B

−B:

N L

NB B

B

4

New J. Phys. 17 (2015) 083023 S -J Kim et al

Page 6: 1367-2630_17_8_083023.pdf

parameter optω . To derive the optimal value optω accurately, we need to take into account thefluctuation andother dynamics of terminals [21].

In addition, the essence of the process described here can be generalized toK-machine cases. To separatedistributions of the topMth and top M( 1)+ thmachine, as shown infigure 2(b), all we need is the following

:0ω

2, (15)0ω γ

γ=

′− ′

P P . (16)M M( ) ( 1)γ′ = + +

Here, PM( )denotes the topMth reward probability, andM is any integer from1 to K 1− . TheMBP is a specialcase whereM= 1. In fact, forK-machine andM-player cases, we have designed a physical system that candetermine the overall optimal state, called the ‘socialmaximum’, quickly and accurately [26].

3.3. Performance characteristicsTo characterize the high performance of TOWdynamics, let us consider the imaginarymodel for solving theMBP, called the ‘cheater algorithm’. The cheater algorithm selects amachine to play according to the followingestimate Sk (k A B{ , }∈ )

S X X X, , (17)A A A A N,1 ,2 ,= + + ⋯ +

S X X X, . (18)B B B B N,1 ,2 ,= + + ⋯ +

Here, Xk i, is a randomvariable that takes either 1 (rewarded) or 0 (non-rewarded). If SA> SB at time t =N,machineA is played at time t N 1= + . If SB> SA at time t = N, machineB is played at time t N 1= + . If SA = SBat time t=N, amachine is played randomly at time t N 1= + . Note that the algorithm refers to results of bothmachines at time twithout any attention towhichmachinewas played at time t 1− . In otherwords, thealgorithm ‘cheats’ because it plays bothmachines and collects both results, but declares that it plays only onemachine at a time.

The expected value and the variance ofXk are defined as E X( )k kμ= andV X( )k k2σ= . Here, kμ is the same

as thePk defined earlier. From the central-limit theorem, Sk has aGaussian distributionwith E S N( )k kμ= and

V S N( )k k2σ= . If we define a new variable S S SA B= − , S has aGaussian distribution and carries the following

values:

( )E S N( ) , (19)A Bμ μ= +

( )V S N( ) , (20)A B2 2σ σ= +

S N( ) . (21)A B2 2σ σ σ= +

From figure 3, the probability of playingmachineB, which has a lower reward probability, can be described

as Q( )E S

S

( )

( )σ. Here, xQ( ) is aQ-function.We obtain

( )P t N B NQ( 1, ) . (22)ϕ= + =

Here,

. (23)A B

A B2 2

ϕμ μ

σ σ=

+

Using theChernoff bound xQ( ) exp( )x1

2 2

2

⩽ − , we can calculate the upper bound of ameasure, called the

‘regret’, which quantifies the accumulated losses of the cheater algorithm.

Figure 3.E S

SQ

( )

( )

⎛⎝⎜

⎞⎠⎟σ: probability of selecting the lower-rewardmachine using the cheater algorithm.

5

New J. Phys. 17 (2015) 083023 S -J Kim et al

Page 7: 1367-2630_17_8_083023.pdf

( )E Nregret ( ) . (24)A B Bμ μ= −

( ) ( )E N t

t

t

t t

N

Q

1

2exp

2

1

2

1

2exp

2

1

2

1

2exp

2d

1

2

1exp

2( 1) 1 (25)

B tN

tN

tN

N

01

01

2

11

2

0

1 2

2

2

⎛⎝⎜

⎞⎠⎟

⎛⎝⎜

⎞⎠⎟

⎛⎝⎜

⎞⎠⎟

⎛⎝⎜⎜

⎛⎝⎜

⎞⎠⎟

⎞⎠⎟⎟

ϕ

ϕ

ϕ

ϕ

ϕϕ

= ∑

⩽ ∑ −

= + ∑ −

⩽ + −

= − − − −

=−

=−

=−

1

2

1. (26)

2ϕ→ +

Note that the regret becomes constant asN increases.Using the ‘cheated’ results, we can also calculate the regret of TOWdynamics in the sameway. In this case,

S X X X L, , (27)A A A A N A,1 ,2 , A ω= + + ⋯ + −

S X X X L, . (28)B B B B N B,1 ,2 , B ω= + + ⋯ + −

Xk i, is also a random variable that takes either 1 (rewarded) or 0 (non-rewarded). Here, we use

L N(1 )k k kμ= − . Then, we obtain E S N( ) { (1 ) }k k k kμ μ ω= − − andV S N( )k k k2σ= .

Using the new variables S S SA B= − , N N NA N= + , and D N NA N= − , we also obtain

{ }E S N D( )2

(1 )2

(1 ) , (29)A B A Bμ μω

μ μω ω=

−+ +

++ −

V S N D( )2 2

. (30)A B A B2 2 2 2σ σ σ σ

=+

+−

If the conditions 0ω ω= and A Bσ σ= σ≡ are satisfied, we then obtain

( )E S N( )2

1 , (31)A B0

μ μω=

−+

V S N( ) , (32)2σ=

and

( )P t N B NQ( 1, ) . (33)Tϕ= + =

Here,

( )( ) 1

2. (34)T

A B 0ϕ

μ μ ω

σ=

− +

Wecan then calculate the upper bound of the regret for TOWdynamics

( ) ( )E N t

N

Q

1

2

1exp

2( 1) 1 (35)

B tN

T

T

T

01

2

2⎛⎝⎜⎜

⎛⎝⎜⎜

⎞⎠⎟⎟

⎞⎠⎟⎟

ϕ

ϕϕ

= ∑

⩽ − − − −

=−

1

2

1. (36)

T2ϕ

→ +

Note that the regret for TOWdynamics also becomes constant asN increases.It is known that optimal algorithms for theMBP, defined byAuer et al, have a regret proportional toNlog( ) [17]. The regret has nofinite upper bound asN increases because it continues to require playing the

lower-rewardmachine to ensure that the probability of incorrect judgment goes to zero. A constant regretmeans that the probability of incorrect judgment remains non-zero in TOWdynamics, although this probabilityis nearly equal to zero.However, it would appear that the reward probabilities change frequently in actualdecision-making situations, and their long-term behavior is not crucial formany practical purposes. For thisreason, TOWdynamics would bemore suited to real-world applications.

6

New J. Phys. 17 (2015) 083023 S -J Kim et al

Page 8: 1367-2630_17_8_083023.pdf

4. Conclusion anddiscussion

In this paper, we proposed TOWdynamics for solving theMBP and analytically validated that their highefficiency inmaking a series of decisions formaximizing the total sumof stochastically obtained rewards isembedded in any volume-conserving physical object when subjected to suitable operations involvingfluctuations. In conventional decision-making algorithms for solving theMBP, the parameter for adjusting the‘exploration time’must be optimized. This exploration parameter often reflects the difference between therewarded experiences, i.e., P PA B∣ − ∣. In contrast, TOWdynamics demonstrates that a higher performance canbe achieved by introducing aweighting parameter 0ω that refers to the sumof the rewarded experiences, i.e., PA+PB. Owing to this novelty, the high performance of TOWdynamics can be reproducedwhen implementingthese dynamics with various volume-conserving physical objects. Thus, our proposed physics-based analog-computing paradigmwould be useful for a variety of real-world applications and for understanding thebiological information-processing principles that exploit their underlying physics.

Acknowledgments

Thisworkwas partially undertakenwhen the authors belonged to the RIKENAdvanced Science Institute, whichwas reorganized and integrated into RIKEN as of the end ofMarch 2013.We thank ProfessorMasahikoHara forvaluable discussions.

References

[1] HerkenR 1995TheUniversal TuringMachine AHalf-Century Survey 2nd edn (Berlin: Springer)[2] Sze SM1981Physics of Semiconductor Devices 2nd edn (NewYork:Wiley)[3] Baker R J 2010CMOSCircuit Design Layout, and Simulation 3rd edn (NewYork:Wiley)[4] Castro LN 2007Phys. Life Rev. 4 1–36[5] Kari L andRozenbergG 2008Commun. ACM 51 72–83[6] Lee J, Imai K andZhuQ2012 Inf. Sci. 187 266–77[7] Lai L, JiangH and PoorHV2008 System and computersProc. IEEE 42ndAsilomar Conference on Signals pp 98–102[8] Lai L, GamalHE, JiangH and PoorHV2011 IEEE Trans.Mobile Comput. 10 239–53[9] Agarwal D, ChenBC and Elango P 2009Proc. ICDM2009 pp 1–10[10] Kocsis L and Szepesvári C 2006ECML2006, LNAI 4212 (Berlin: Springer) pp 282–93[11] Gelly S,WangY,Munos R andTeytaudO2006RR-6062-INRIA pp 1–19[12] RobbinsH 1952Bull. Am.Math. Soc. 58 527–36[13] ThompsonW1933Biometrika 25 285–94[14] Gittins J and JonesDA1974Progress in Statistics (Amsterdam:North-Holland) pp 241–66[15] Gittins J 1979 J. R. Stat. Soc.B 41 148–77[16] Agrawal R 1995Adv. Appl. Probab. 27 1054–78[17] Auer P, Cesa-BianchiN and Fischer P 2002Mach. Learn. 47 235–56[18] Kim S-J, AonoMandHaraM2010UC2010, LNCS 6079 (Berlin: Springer) pp 69–80[19] Kim S-J, AonoMandHaraM2010BioSystems 101 29–36[20] Kim S-J,Nameda E, AonoMandHaraM2011Proc. NOLTA2011 pp 176–9[21] Kim S-J, AonoM,Nameda E andHaraM2011Technical Report of IEICE (CCS-2011-025) pp 36–41 (in Japanese)[22] AonoM,KimS-J,HaraMandMunakata T 2014BioSystems 117 1–9[23] Kim S-J andAonoM2014NOLTA 5 198–209[24] Kim S-J,NaruseM, AonoM,OhtsuMandHaraM2013 Sci. Rep. 3 2370[25] NaruseM et al 2014 J. Appl. Phys. 116 154303[26] Kim S-J andAonoM2015 Spec. Issue Adv. Sci. Technol. Environmentol.B11 41–5

7

New J. Phys. 17 (2015) 083023 S -J Kim et al