Reinforcement Learning-based NOMA Power Allocation in the … · 2018. 10. 18. · 1 Reinforcement...

1

Reinforcement Learning-based NOMA Power Allocation inthe Presence of Smart Jamming

Liang Xiao∗, Yanda Li∗, Canhuang Dai∗, Huaiyu Dai†, H. Vincent Poor‡∗ Dept. of Communication Engineering, Xiamen University, China. Email: [email protected]

† Dept. of Electrical and Computer Engineering, NC State University, Raleigh, NC. Email: huaiyu [email protected]‡ Dept. of Electrical Engineering, Princeton University, Princeton, NJ. Email: [email protected]

Abstract—Non-orthogonal multiple access (NOMA) systemsare vulnerable to jamming attacks, especially smart jammers whoapply programmable and smart radio devices such as softwaredefined radios to flexibly control their jamming strategy accord-ing to the ongoing NOMA transmission and radio environment.In this paper, the power allocation of a base station in a NOMAsystem equipped with multiple antennas contending with a smartjammer is formulated as a zero-sum game, in which the basestation as the leader first chooses the transmit power on multipleantennas while a jammer as the follower selects the jammingpower to interrupt the transmission of the users. A Stackelbergequilibrium of the anti-jamming NOMA transmission game isderived and the conditions assuring its existence are provided todisclose the impact of multiple antennas and radio channel states.A reinforcement learning based power control scheme is proposedfor the downlink NOMA transmission without being aware of thejamming and radio channel parameters. The Dyna architecturethat formulates a learned world model from the real anti-jamming transmission experience and the hotbooting techniquethat exploits experiences in similar scenarios to initialize thequality values are used to accelerate the learning speed ofthe Q-learning based power allocation and thus improve thecommunication efficiency of NOMA transmission in the presenceof smart jammers. Simulation results show that the proposedscheme can significantly increase the sum data rates of usersand thus the utilities compared with the standard Q-learningbased strategy.

Index Terms—NOMA, smart jamming, power allocation, gametheory, reinforcement learning

I. INTRODUCTIONBy using power domain multiplexing with successive inter-

ference cancellation, non-orthogonal multiple access (NOMA)as an important candidate for 5G cellular communications cansignificantly improve both the outage performance and userfairness compared with orthogonal multiple access systems[1]. The Multiple-input and multiple-output (MIMO) NOMAtransmission can achieve even higher spectral efficiency. Forinstance, the MIMO NOMA system as proposed in [2] ap-plies zero-forcing based beamforming and pairing to reducethe cluster size of the cellular system and interference andthus significantly increases the capacity. The MIMO NOMAscheme as developed in [3] provides fast transmission of small-size packets following the required outage performance. The

Copyright (c) 2015 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

This work was supported in part by National Natural Science Foundationof China under Grant 61671396 and in part by the U.S. National ScienceFoundation under Grants CCF-1420575, CNS-1456793, ECCS-1307949 andEARS-1444009.

MIMO-NOMA system implements successive interferencecancellation to detect and remove the signals of the weakerusers according to downlink cooperative transmission protocolsuch as those in [4] and [5].

NOMA transmission is vulnerable to jamming attacks, asan attacker can use programmable and smart radio devicessuch as software defined radios to flexibly control the jammingpower according to the ongoing communication [6]–[9]. As anextreme case, all the downlink users in a NOMA system canbe simultaneously blocked, if a jammer sends strong jammingpower on the frequency at which the users send their signals.By reducing the signal-to-interference-plus-noise ratio (SINR)of the user signals, jammers can significantly decrease the datarates of the NOMA transmission and even result in denial ofservice attacks.

Game theoretic study of anti-jamming communications inwireless networks such as [9]–[13] has provided insights intothe defense against smart jammers. In this paper, the anti-jamming transmissions of a base station (BS) in an MIMONOMA system are formulated as a zero-sum Stackelberggame. In this game, the transmit power of the BS for eachuser is first determined on each antenna and then a smartjammer chooses the jamming power in an attempt to interruptthe NOMA transmission at a low jamming cost. The Stack-elberg equilibrium (SE) of the anti-jamming MIMO NOMAtransmission game is derived, showing that the BS tends toallocate more power to the strong user but has to satisfy theminimum rate demand of weak users under full jamming-power attack of the smart jammer. Conditions under whichthe SE exists in the game are provided to disclose the impactof the radio channel states and the number of antennas on thecommunication efficiency of NOMA transmissions.

Reinforcement learning techniques, such as Q-learning canderive the optimal strategy with probability one, if all thefeasible actions are repeatedly sampled over all the statesin the Markov decision process [14]. Therefore, we proposea Q-learning based power allocation strategy that choosesthe transmit power based on the observed state of the radioenvironment and the jamming power and a quality function orQ-function that describes the discount long-term reward foreach state-action pair. This scheme is applied for the BS toderive the optimal policy for multiple users in the dynamicanti-jamming MIMO NOMA game without being aware ofthe channel model and the jamming model. A hotbootingQ-learning based power allocation exploits the experiencesin similar anti-jamming NOMA transmission scenarios to set

2

the initial value of the quality function for each state-actionpair and thus accelerates the convergence of the standard Q-learning algorithm that always uses zero as the initial Q-function value. Meanwhile, the Dyna architecture that formu-lates a learned world model from real experience can be usedto increase the learning speed to derive the optimal policy [15].The Dyna algorithm based power allocation scheme emulatesthe planning and reactions from hypothetical experiencesand takes extra computation for updating Q-function values.Simulation results show that the proposed power allocationschemes enable the BS to improve the sum data rate for theusers in the downlink NOMA transmission against jamming.

The contributions of this work can be summarized asfollows:• We formulate an anti-jamming MIMO NOMA transmis-

sion game and derive the SEs of the game.• We propose a hotbooting Q-learning based power allo-

cation scheme for multiple users in the dynamic NOMAtransmission game against a smart jammer without know-ing the jamming and channel models.

• The Dyna architecture and hotbooting techniques areapplied to accelerate the learning speed and thus enhancethe anti-jamming NOMA communication efficiency.

The remainder of the paper is organized as follows: Therelated work is reviewed in Section II. The anti-jammingNOMA transmission game is formulated in Section III, and theSE of the game is derived in Section IV. Two reinforcementlearning based power allocation schemes are proposed for thedynamic NOMA transmission game in Section V and VI,respectively. Simulation results are provided in Section VII.The conclusions are drawn in Section VIII.

II. RELATED WORK

Power allocation is critical for NOMA transmissions. Forinstance, the low-complexity scheme as proposed in [16]improves the ergodic capacity for a given total transmitpower constraint. The NOMA power allocation as proposedin [17] increases the sum data rates by pairing users withdistinctive channel conditions. The joint power allocation anduser pairing method proposed in [4] employs the minorization-maximization algorithm to further improve the sum data ratesin the multiple-input and single-output NOMA system. Thepower allocation method for multicarrier NOMA transmissiondesigned in [18] exploits the heterogeneity of service qualitydemand to perform the successive interference cancellationand reduce power consumption.

User fairness is one of the key challenges in the NOMApower allocation. The power allocation based on proportionalfairness proposed in [19] improves both the communicationefficiency and user fairness in a two-user NOMA system. Abisection-based iterative NOMA power allocation developedin [20] enhances user fairness with partial channel stateinformation. The cognitive radio inspired power allocationdeveloped in [5] satisfies the quality-of-service requirementsfor all the users in the NOMA system. A NOMA systemthat pairs users with independent and identically distributedrandom channel states was investigated in [21] to reduce the

outage probabilities for each user. However, most existingNOMA power allocation strategies do not provide jammingresistance, especially against smart jamming.

Game theoretic study of anti-jamming communications pro-vides insights into the design of the power allocation [9]–[11],[22]–[26]. For example, the power allocation game formulatedin [9] provides a closed-form solution to improve the ergodiccapacity of MIMO under smart jamming. The MIMO trans-mission game formulated in [10] studies the power allocationstrategy on the artificial noise against a smart attacker who canchoose active eavesdropping or jamming. The anti-jammingtransmission in a peer-to-peer network was formulated as aBayesian game in [11], in which each jammer is identifiedbased on the attack in the previous time slots. The stochasticcommunication game as analyzed in [25] proposes an intrusiondetection system to assist the transmitter and increase thenetwork capacity against both eavesdropping and jamming.

Learning based attacker type identification developed in [27]improves the transmission capacity under an uncertain type ofattacks in the time-varying radio channels for the stochasticsecure transmission game. The Q-learning based power controlproposed in [28] resists smart jamming in cooperative cog-nitive radio networks. The hotbooting learning technique asproposed in [29] exploits the historical experiences to initializethe Q values and thus accelerates the convergence rate of Q-learning. Model learning based on tree structures as proposedin [30] further improves the sample efficiency in stochasticenvironments and thus the convergence rate of the learningprocess.

The Q-learning based power control strategy proposed in[31] improves the secure capacity of the MIMO transmissionagainst smart attacks. Compared with our previous work in[31], we consider the anti-jamming MIMO NOMA trans-mission and improve the game model by incorporating thesum data rates and the minimum data rates required by eachuser in the utility function. In addition, a fast Q-based powerallocation algorithm that combines the hotbooting techniqueand Dyna architecture is proposed to improve the communi-cation efficiency compared with the standard Q-learning basedscheme as developed in [31].

III. ANTI-JAMMING MIMO NOMA TRANSMISSIONGAME

As shown in Fig. 1, we consider a time-slotted NOMAsystem consisting of M mobile users each equipped withNR antennas to receive the signals from a BS against asmart jammer. The BS uses NT transmit antennas to sendthe signal denoted by xkm to User m, with m = 1, 2, · · · ,Mand k = 1, 2, 3, · · · . A superimposed NT -dimensional signalvector xk =

∑Mn=1 x

kn is sent at time k.

The smart jammer uses NJ antennas to send jamming signaldenoted by zJ with the power constraint pkJ = E

[zTJ zJ

]. The

jamming power pkJ ∈ [0, PJ ] is chosen based on the ongoingdownlink transmit power, where PJ is the maximum jammingpower.

Let gkj,i,m denote the channel power gain from the i-th antenna of the BS to the j-th antenna of User m, and

3

NT antennas

UserM

BS

User 1

Signal detection

of User 1

Signal detection

of User 1

SIC of the

signals frff om the

weaker M-MM 1users

SIC of the

signals from the

weakerM-1users

Signal detection

of User M

Signal detection

of User M

...

NJ antennas

Smart jammer

...

User 22

...

...

antennasNR

antennasNR

...antennasNR

,

k

J MH

,1

k

JH

,

k

B MH

,1

k

BH

SIC of the signal

frff om User 1

SIC of the signal

from User 1Signal detection

of User 2

Signal detection

of User 2

,2

k

JH

,2

k

BH

...

Fig. 1. NOMA transmission against a smart jammer.

HkB,m = [gkj,i,m]1≤j≤NT ,1≤i≤NR denote the channel matrixbetween the BS and User m. The time index k is omitted ifno confusion occurs. Without loss of generality, the channelconditions of users are sorted as ||HB,M || > · · · > ||HB,1||,where ||X|| denotes the Frobenius norm of a matrix X, i.e.,the user rank follows the order of channel power gains,indicating that User M may stay in the cell central areawhile User 1 is at the cell-edge. More specifically, the MIMOchannel can be viewed as a bundle of independent sub-channels and thus the channel gain matrices of the users canbe ordered by the squared Frobenius norm, according to [4]and [32]. Similarly, HkJ,m denotes the channel matrix betweenthe jammer and User m. Each channel matrix is assumed tohave independently and identically distributed (i.i.d.) complexGaussian distributed elements, i.e., Hµ,m ∼ CN (0, σµ,mI),with µ = B, J and m = 1, 2, · · · ,M , and the i-th largesteigenvalue of Hµ,mHHµ,m is denoted by h

µm,i. Thus, the signal

received by User m is given by

ym = HB,mM∑n=1

xn + HJ,mzJ + nm, m = 1, 2, · · · ,M, (1)

where the noise vector nm consists of NR additive normalizedzero-mean Gaussian noise for User m.

Let θkm be the power allocation coefficient for User m, i.e.,the allocated transmit power is PT θkm. A BS has difficultyobtaining instantaneous channel state information accurately intime in the time-varying wireless environment and thus equallyallocates the transmit power PT over the NT antennas, asmentioned in [16]. On the other hand, the BS can obtain thestatistical results regarding channel states of the users via thefeedback from the users and decide the signal decoding orderaccordingly. Thus the transmit covariance matrix of User mdenoted by Qm is given by

Qm = E(xmxHm) =θmPTNT

INT , (2)

where a superscript H denotes the conjugate transpose. In thisgame, the power allocation vector chosen by the BS is denotedby θk = [θkm]1≤m≤M for all the M users with the total powerconstraint

∑1≤m≤M θ

km = 1. For simplicity, let Ω denote the

vector space of all the available power allocation strategies,with ∀θ ∈ Ω.

On the other hand, without knowing the downlink channelpower gain of the users, the jammer has to equally allocate the

TABLE ISUMMARY OF SYMBOLS AND NOTATION

M Number of usersNT/R Number of antennas of BS/RXNJ Number of jamming antennasxk Superimposed transmit signal for usersHkµ,m Channel matrixhµm,i i-th eigenvalue of channel matrixσµ,m Channel matrix covarianceθk =

[θkm]1≤m≤M Power allocation coefficient vector for users

Ω Space of feasible power allocation vectorsPJ Maximum jamming poweru Utility of the BSγ Cost coefficient of the jammerξ Feasible set of jamming powerR0 Minimum data rate required by a userL Number of the SINR quantization levelsI Number of emulated environmentsT Size of training dataΦ Observation counterτ Modeled rewardτ ′ Reward recordΨ State transition probability

jamming power, with the jamming power covariance matrixgiven by

QJ = E(zJzHJ) =pJNJ

INJ . (3)

The effective precoding and successive interference cancel-lation design for the MIMO-NOMA system have been studiedin works such as [4] and [5]. More specifically, the downlinkNOMA transmission scheme developed in [4] increases thesum data rate under the decodability constraint accordingto a simplified SIC method that orders the users based onthe norm of the channel gain vector. The signal alignmentscheme developed in [5] decomposes the multi-user MIMO-NOMA scenario into multiple separate single-antenna NOMAchannels, and orders the users based on the large-scale channelfading gain with user pairing.

In this work, we apply the successive interference cancel-lation scheme that orders the users according to the channelpower gains under jamming, similar to [16] and [33]. Thedecoding order according to the Frobenius norms of thechannel matrices is optimal if the jamming power is not strongenough to change the SIC order, i.e., for any two users j andm, hBm/h

Jm ≥ hBj /hJj holds if m > j. Nevertheless, if this

assumption does not hold as the jamming power received bya user is much stronger than that received by the user whodecodes afterwards, the NOMA system ranks the decodingorder according to the channel power gain normalized by thejamming power in SIC, i.e., User m decodes before User j,if hBm/h

Jm ≥ hBj /hJj with m > j, if not specified.

According to the singular value decomposition of the chan-nel matrix Hµ,m, µ = B, J and the assumption of the Gaussiandistributed signals, the achievable data rate of User m denotedby Rm depends on the SINR of the signal, and is given by

4

[16], [33] as ∀1 ≤ m ≤M − 1

Rm = log2 det

(I +

(I + HB,m

M∑n=m+1

QnHHB,m

+HJ,mQJHHJ,m

)−1HB,mQmH

HB,m

)

=

NR∑i=1

log2

1 + NJθmPThBm,iNTNJ +NJ

M∑n=m+1

θnPThBm,i +NT pJhJm,i

.(4)

Applying interference cancellation as illustrated in [33], UserM with the best channel condition can subtract the signals ofall the other M − 1 users from the superimposed signal x.Thus the data rate of User M denoted by RM is given by

RM = log2 det

(I +

(I + HJ,MQJH

HJ,M

)−1HB,MQMH

HB,M

)=

NR∑i=1

log2

(1 +

NJθMPThBM,i

NTNJ +NT pJhJM,i

).

(5)

Note that the received jamming signal is viewed as noise,and does not change the decoding sequence according to thesuccessive interference cancellation strategy in [33].

By exploiting the difference of the channel power gainsbetween the users, the BS usually increases the power alloca-tion coefficient for weak users to satisfy the quality of service(QoS) required by the user, such as the minimum data ratedemand of the user denoted by R0. Therefore, the NOMAtransmission has to ensure min(R1, · · · , RM ) ≥ R0.

The jammer aims to interrupt the NOMA transmission andmake at least one user’s data rate less than the quality ofservice requirement, i.e., Rm ≤ R0, for 1 ≤ m ≤ M ,as shown in the first term in (6). If failing to do that, thejammer has a goal to reduce the overall data rate with lessjamming power and avoid being detected, as indicated in (6).Let γ denote the jamming cost, including the jamming powerconsumption and the risk to be detected, which depends onthe intrusion detection methods used by the BS.

The anti-jamming NOMA transmissions are formulated as azero-sum Stackelberg game denoted by G, in which the BS asthe leader first chooses the power allocation coefficient vectorθ = [θm]1≤m≤M to improve the system throughput underthe user rate constraint, and a smart jammer as the followerchooses the jamming power in an attempt to interrupt theNOMA transmission at a low jamming cost.

The utility of the BS denoted by u depends on the sum datarate of the M users and the jamming cost, and is given by

u = I(

min1≤m≤M

Rm ≥ R0)( M∑

m=1

Rm + γpJ

), (6)

where I(·) is the indicator function that takes value 1 if theevent is true and 0 otherwise, and the transmission fails if theQoS is not satisfied. In summary, we have considered an anti-jamming NOMA transmission game given by G = 〈{B, J},{θ, pJ}, {u,−u}〉. For ease of reference, we also summarizeour commonly used notation in Table 1.

IV. SE OF THE NOMA TRANSMISSION GAMEAt a Stackelberg equilibrium of the anti-jamming NOMA

transmission game, the BS as the leader chooses its powerallocation strategy for all downlink users first to maximizeits utility given by (6) considering the response of the smartjammer as the follower. Then the jammer as the followerdecides the jamming power to minimize the utility u basedon the observed ongoing transmission. The SE strategies ofthe NOMA transmission game denoted by (θ∗, p∗J (θ

∗)) aregiven by definition as

p∗J (θ) = arg min0≤pJ≤PJ

u(θ, pJ) (7)

θ∗ = arg maxθ∈Ω

u(θ, p∗J (θ)). (8)

In the following, we first consider the transmission game forthe BS with NT ×NR MIMO that serves M users, and thenthe specific case with 2 users to evaluate the SE strategiesunder different anti-jamming NOMA transmission scenarios.

Lemma 1. The anti-jamming NT ×NR NOMA transmissiongame with M users has a unique SE (θ∗, p∗J) given by (9)-(10),if hBm,ih

JM,i < h

BM,ih

Jm,i,∀1 ≤ m ≤M −1,∀1 ≤ i ≤ NR and

(11) holds.

Proof. See Appendix A.

According to (9)-(10), the SE of the anti-jamming NOMAtransmission game depends on the QoS in the transmission(R0), the number of antennas

(NT and NR

), the channel

condition of the weak users and the maximum jamming power(PJ). If the jammer has good channel conditions to the weaker

users, the BS allocates more power to the strongest user (i.e.,User M ) to improve the sum data rates, and meets the QoSof the weaker users. If the total transmit power is low for thetransmission, the jammer applies full jamming power to blockall the M users in terms of the transmission QoS.

Corollary 1. The anti-jamming NT×NR NOMA transmissiongame G with M users has a unique SE (θ∗, 0), where θ∗ isgiven by ∀1 ≤ m ≤M − 1

NR∑i=1

log2

1 + θ∗mPThBm,iNT +

(1−

m∑n=1

θ∗n

)PThBm,i

= R0, (18)if hBm,ih

JM,i < h

BM,ih

Jm,i,∀1 ≤ m ≤M −1,∀1 ≤ i ≤ NR and

PT

NR∑i=1

M−1∑m=1

NT θ∗mh

Jm,ih

Bm,i(

NT +M∑

n=m+1

θ∗nPThBm,i

)(NT +

M∑n=m

θ∗nPThBm,i

)

+

(1−

M−1∑n=1

θ∗n

)hBM,ih

JM,i

NT +

(1−

M−1∑n=1

θ∗n

)PThBM,i

< NJγ ln 2.(19)

Proof. See Appendix B.

If the jamming cost is high or the jamming channel expe-riences serious fading as shown in Eq. (19), a smart jammer

5

NR∑i=1

log2

1 + NJθ∗mPThBm,iNTNJ +NJ

(1−

m∑n=1

θ∗n

)PThBm,i +NTPJh

Jm,i

= R0, ∀1 ≤ m ≤M − 1 (9)NR∑i=1

M−1∑m=1

NTNJθ∗mPTh

Jm,ih

Bm,i(

NTNJ +NJM∑

n=m+1

θ∗nPThBm,i +NT p

∗Jh

Jm,i

)(NTNJ +NJ

M∑n=m

θ∗nPThBm,i +NT p

∗Jh

Jm,i

)

+

NJ

(1−

M−1∑n=1

θ∗n

)PTh

BM,ih

JM,i(

NJ + p∗JhJM,i

)(NTNJ +NJ

(1−

M−1∑n=1

θ∗n

)PThBM,i +NT p

∗Jh

JM,i

) = γ ln 2 (10)

PT

NR∑i=1

M−1∑m=1

NT θ∗mh

Jm,ih

Bm,i(

NT +M∑

n=m+1

θ∗nPThBm,i

)(NT +

M∑n=m

θ∗nPThBm,i

) +(

1−M−1∑n=1

θ∗n

)hBM,ih

JM,i

NT +

(1−

M−1∑n=1

θ∗n

)PThBM,i

> NJγ ln 2(11)

NR∑i=1

log2

(1 +

NJθ∗1PTh

B1,i

NTNJ +NJ(1− θ∗1)PThB1,i +NTPJhJ1,i

)= R0 (12)

NR∑i=1

(NTNJθ

∗1PTh

J1,ih

B1,i(

NTNJ +NJ(1− θ∗1)PThB1,i +NT p∗JhJ1,i) (NTNJ +NJPThB1,i +NT p

∗Jh

J1,i

)+

NJ(1− θ∗1)PThB2,ihJ2,i(NJ + p∗Jh

J2,i

) (NTNJ +NJ(1− θ∗1)PThB2,i +NT p∗JhJ2,i

)) = γ ln 2 (13)PT

NR∑i=1

(NT θ

∗1h

J1,ih

B1,i(

NT + (1− θ∗1)PThB1,i) (NT + PThB1,i

) + (1− θ∗1)hB2,ihJ2,iNT + (1− θ∗1)PThB2,i

)> NJγ ln 2 (14)

NR∑i=1

log2

(1 +

NJ(1− θ∗1)PThB2,iNTNJ +NT p∗Jh

J2,i

)= R0 (15)

NR∑i=1

(NTNJθ

∗1PTh

J1,ih

B1,i(

NTNJ +NJ(1− θ∗1)PThB1,i +NT p∗JhJ1,i) (NTNJ +NJPThB1,i +NT p

∗Jh

J1,i

)+

NJ(1− θ∗1)PThB2,ihJ2,i(NJ + p∗Jh

J2,i

) (NTNJ +NJ(1− θ∗1)PThB2,i +NT p∗JhJ2,i

)) = γ ln 2 (16)PT

NR∑i=1

(NT θ

∗1h

J1,ih

B1,i(

NT + (1− θ∗1)PThB1,i) (NT + PThB1,i

) + (1− θ∗1)hB2,ihJ2,iNT + (1− θ∗1)PThB2,i

)> NJγ ln 2 (17)

will keep silent to reduce the power consumption. In addition,as the jamming power at SE also decreases with NT , MIMOsignificantly improves the jamming resistance.

Corollary 2. The anti-jamming NT×NR NOMA transmissiongame G with M = 2 users has a unique SE (θ∗1 , p∗J) given by(12)-(13) if hB1,ih

J2,i < h

J1,ih

B2,i,∀1 ≤ i ≤ NR and (14) holds.

Proof. See Appendix C.

As a concrete example, the utility of the BS in Fig. 2 ismaximized if the rate demand of the weak user is satisfiedunder full jamming power (i.e., (12)). Moreover, the feasibleallocation coefficient for the strong user increases with thetotal transmission power PT and the jamming power at theSE decreases with it.

Figure 3 presents the jamming power and the sum data rateof 2 users in terms of the channel power gain of User 1, σB,1.

More specifically, the average SINR of users increases withσB,1 and the number of transmission antennas NT , because thejammer will fail to interrupt the user experience and decreasethe jamming power to reduce the cost. Therefore, the sum datarate of the M users improves with σB,1 and NT .

Corollary 3. The anti-jamming NT×NR NOMA transmissiongame G with M = 2 users has a unique SE (θ∗1 , p∗J) given by(15)-(16) if hB1,ih

J2,i > h

J1,ih

B2,i +

(hB2,i − hB1,i

)NJ/p

∗J ,∀1 ≤

i ≤ NR and (17) holds.

Proof. See Appendix D.

If the jamming channel condition to the stronger user (i.e.,User 2) is high as shown in the given condition, the sumdata rate increases with more power allocated for the weakuser (i.e., User 1). On the other hand, the smart jammer, as afollower observing the ongoing transmission will apply its full

6

05

1015

20

0

0.5

10

5

10

15

20

Jamming po

wer

← SE

Power allocation coefficient

Util

ity o

f the

BS

(a) PT = 20W

05

1015

20

0

0.5

10

5

10

15

20

25

Jamming po

wer

← SE

Power allocation coefficient

Util

ity o

f the

BS

(b) PT = 40W

Fig. 2. Performance of the anti-jamming transmission game for the MIMONOMA system with M = 2, NT = NR = 3, NJ = 2, σB,1 = 10 dB,σB,2 = 20 dB, σJ,1 = σJ,2 = 12 dB, PJ = 20 W, R0 = 1 bps/Hz andγ = 0.5.

jamming power to interrupt the communication of the strongeruser if its transmit power is not high enough as in (15).

V. NOMA POWER ALLOCATION BASED ON HOTBOOTINGQ-LEARNING

The repeat interactions between a BS and a smart jam-mer can be formulated as a dynamic anti-jamming NOMAtransmission game. The optimal downlink power allocationdepends on the locations of the users, the radio channelstates and the jamming parameters in the time slot, whichare challenging for a BS to accurately estimate. In addition,the power allocation strategy of the BS has an impact on thejamming strategy and the received SINR of the users andthus the anti-jamming NOMA transmission process can beformulated as a finite Markov decision process. Therefore, theNOMA system can apply Q-learning, a widely-used model-free reinforcement learning technique in the power allocationwithout being aware of the instantaneous channel state infor-mation at the base station. Instead of relying on the knowledgeof the instantaneous channel state information, this NOMAsystem uses the statistical channel state information and thetransmission experience to determine the power allocation.

The proposed power allocation depends on the qualityfunction or Q-function — the discount expected reward foreach state-action pair. The state at time k that reflects the

4 5 6 7 8 9 10 11 1215

15.5

16

16.5

17

17.5

18

18.5

19

19.5

20

Channel gain of User 1, σB,1 (dB)

AverageSIN

Rofusers

σB,2 = 18 dB

σB,2 = 19 dB

σB,2 = 20 dB

NT = NR = 6

NT = NR = 4

(a) Average SINR of users

4 5 6 7 8 9 10 11 1212

14

16

18

20

22

24

26

28

Channel gain of User 1, σB,1 (dB)

Sum

data

rate

ofusers

σB,2 = 18 dB

σB,2 = 19 dB

σB,2 = 20 dB

NT = NR = 6

NT = NR = 4

(b) Sum capacity of users

Fig. 3. Performance of the anti-jamming transmission game for the MIMONOMA system at the SE with M = 2, NJ = 3, σJ,1 = σJ,2 = 12 dB,PT = PJ = 20 W, R0 = 1 bps/Hz and γ = 0.5.

environment dynamics, denoted by sk, is chosen as the lastreceived SINR of each user denoted by SINRk−11≤m≤M , i.e.,sk = [SINRk−1m ]1≤m≤M ∈ ξ, where ξ is the space of all thepossible SINR vectors. More specifically, each user quantizesthe received SINR into one of L levels for simplicity, andsends back the quantized value to the BS as the observedstate for the next time slot. Therefore, the BS uses the MLvector as feedback information to update its Q function in theQ-learning based NOMA system.

For simplicity, the feasible action set of the BS in the powerallocation, or each of the transmit power coefficient is quan-tized into L non-zero levels denoted by θkm ∈ {l/L}0≤l≤L,∀1 ≤ m ≤ M , and the power allocation strategy of the BSis denoted by θk =

[θkm]1≤m≤M ∈ Ω. The transmit signal of

the M users is sent at the power allocation given by θk.Note that the random exploration of standard Q-learning

at the beginning of the game due to the all-zero Q-valueinitialization usually requires exponentially more data than theoptimal policy [34]. Therefore, we propose a hotbooting tech-nique, which initializes the Q-value based on the training dataobtained in advance from large-scale experiments in similarscenarios. As we will see, the hotbooting Q-learning basedpower allocation can decrease the useless random explorations

7

Algorithm 1: Hotbooting preparation1: Q(s,θ) = 0, V (s) = 0, ∀s ∈ ξ, ∀θ ∈ Ω, and s0 = 02: for i = 1, 2, · · · , I do3: Emulate a similar environment4: for k = 1, 2, · · · ,K do5: Choose action θk at random6: Obtain the utility uk and the received SINR of each

user SINRk1≤m≤M7: sk+1 = [SINRkm]1≤m≤M8: Q∗(sk,θk) = (1−α)Q∗(sk,θk)+α(uk+δV ∗(sk+1))9: V ∗(sk) = max

θ∈ΩQ∗(sk,θ)

10: end for11: end for

in an initial environment and thus accelerate the learning speedof the Q-learning scheme in the dynamic game and improvethe communication efficiency of the anti-jamming NOMAtransmission.

The preparation stage of the hotbooting technique aspresented in Algorithm 1 performs I anti-jamming MIMONOMA transmission experiments before the start of the game.The number of experiments I is chosen as a tradeoff betweenthe convergence rate and the risk of overfitting in severalspecial scenarios. In each experiment, a BS randomly selectsthe transmit power for M users at a given state in similarradio transmission scenarios against jammers and observesthe resulting transmission status and the received utility. LetQ(s,θ) denote the quality or Q function of the BS for systemstate s and action θ, which is the expected discounted long-term reward observed by the BS. The discount factor denotedby δ ∈ [0, 1] is chosen according to the uncertainty regardingthe future gains. As shown in Algorithm 2, at each time slot,the BS updates both the Q function and the value function,given by

Q(sk,θk

)← (1− α)Q

(sk,θk

)+ α

(u(sk,θk

)+ δV (sk+1)

),

(20)

where the learning rate α ∈ (0, 1] represents the weight of thecurrent experience in the learning process. The value functionV (s) is the maximum of Q(s,θ) over all the available powerallocation vectors, given by

V (sk) = maxθ∈Ω

Q(sk,θ). (21)

The resulting Q-function denoted by Q∗ will be used toinitialize the Q-value in the hotbooting Q-learning basedscheme.

During the learning process as summarized in Algorithm2, the tradeoff between exploitation and exploration has animportant impact on the convergence performance. Therefore,according to ε-greedy policy, the power allocation coefficientθk is based on the system state sk and the Q-function and isgiven by

Pr(θ = θ̂

)=

1− ε, θ̂ = arg maxθ′∈ΩQ(sk,θ′)

ε|Ω|−1 , o.w.

. (22)

Algorithm 2: Hotbooting Q-learning based NOMApower allocation

1: Set Q(s,θ) = Q∗(s,θ), ∀s ∈ ξ, ∀θ ∈ Ω, and s0 = 02: for k = 1, 2, 3, ... do3: Choose θk via (22)4: for m = 1, 2, ...,M do5: Allocate power θkmPT for the signal to User m6: end for7: Send the superimposed signal xk over NT antennas8: Observe SINRk1≤m≤M and the utility u

k

9: sk+1 = [SINRkm]1≤m≤M10: Update Q(sk,θk) via (20)11: Update V (sk) via (21)12: end for

The BS with our proposed RL based power allocationscheme does not know the instantaneous channel state infor-mation, the jamming power and the jamming channel gains,which are required by the global optimization algorithm suchas [35]. In this way, the BS learns the jamming strategyaccording to the anti-jamming NOMA transmission historyand derives the optimal transmit power allocation strategythat improves the long-term anti-jamming communication ef-ficiency.

However, the convergence complexity of Algorithm 2 de-pends on the size of the action-state space |ξ × Ω|, whichexponentially increases with the number of users and thequantization levels of the power allocation coefficients. Thechannel time variation on the other hand, requires much fasterlearning rate of the power allocation. Therefore, we furtherpropose a fast Q-learning based power allocation scheme thatcombines the hotbooting technique and the Dyna architectureto improve the communication efficiency of Algorithm 2 indynamic radio environments in the following section.

VI. NOMA POWER ALLOCATION BASED ON FASTQ-LEARNING

The Dyna architecture formulates a learned world modelfrom the real anti-jamming NOMA transmission experiencesto accelerate the learning speed of Q-learning in dynamicradio environments. More specifically, we apply the Dynaarchitecture that emulates the planning and reactions fromhypothetical experiences to improve the anti-jamming effi-ciency in the fast Q-learning based NOMA power allocationalgorithm. The algorithm is summarized in Algorithm 3, andthe main structure is illustrated in Fig. 4.

The fast Q-learning power allocation algorithm also appliesAlgorithm 1 to initialize the Q values with the hotbootingtechnique and updates the Q functions via (20) and (21)similarly to Algorithm 2 according to the NOMA transmissionresult in each time slot. The BS observes the received SINRof each user at last time slot denoted by SINRk−11≤m≤M as thestate sk, i.e., sk = [SINRk−1m ]1≤m≤M . The power allocationstrategy θk is also chosen according to the system state andthe Q-function based on the ε-greedy policy as shown in (22).

8

1[SINR ]k k

m m M£ £=s

Q value updates

Wireless network with jammers

Evaluate reward from

the real experience

kuAction selection

with -greedy

Record the

real experience1( , , )k k ku+s s

Generate D

hypothetical

experiences

( , , , , )' 't tF F Y

Dyna architecture

1SINRk

m M£ £

Training in

similar NOMA

transmission

environments

BS

Send superimposed signal xk Observe

Fig. 4. Illustration of the fast Q-based NOMA power allocation scheme.

Meanwhile, the Dyna-Q based algorithm also uses the realexperiences of the NOMA transmission at time k to build ahypothetical anti-jamming NOMA transmission model. Morespecifically, the real experience of the NOMA transmission attime k is saved as an experience record for the correspondingstate-action pair, i.e., (sk,θk). The experience record stored atthe BS consists of the occurrence counter vector of the state-action pairs denoted by Φ, the occurrence counter vector ofthe next state denoted by Φ′, the reward record τ ′, and themodeled reward τ . The hypothetical experience is generatedaccording to the real experience (Φ′,Φ, τ ′, τ,Ψ). As shown inAlgorithm 3, the counter vector of the next state Φ′ is updatedaccording to the anti-jamming NOMA transmission at time k,i.e.,

Φ′(sk,θk, sk+1) = Φ′(sk,θk, sk+1) + 1. (23)

The occurrence counter vector in the hypothetical experienceΦ defined as the sum of Φ′ over all the feasible next state isthen updated with

Φ(sk,θk) =∑

sk+1∈ξ

Φ′(sk,θk, sk+1). (24)

The reward record from the real experience τ ′ in this time slotis the utility of the BS at time k given by

τ ′(sk,θk,Φ(sk,θk)

)= u(sk,θk). (25)

Based on the reward record τ ′ in (25) and the state transitionfrom sk to sk+1, we can formulate a hypothetical anti-jammingNOMA transmission model to generate several hypotheticalNOMA transmission experiences, in which the reward to theBS denoted by τ is defined as the utility of the BS averagedover all the previous real experiences and is given by

τ(sk,θk) =1

Φ(sk,θk)

Φ(sk,θk)∑n=1

τ ′(sk,θk, n

). (26)

The transition probability from sk to sk+1 in the hypothetical

Algorithm 3: NOMA power allocation with fast Q-learning

1: Set Q(s,θ) = Q∗(s,θ), ∀s ∈ ξ, ∀θ ∈ Ω, and s0 = 02: for k = 1, 2, 3, ... do3: Choose θk via (22)4: for m = 1, 2, ...,M do5: Allocate power θkmPT for the signal to User m6: end for7: θk =

[θkm]1≤m≤M

8: Send xk according to θk over NT antennas9: Observe SINRk1≤m≤M and the utility u

k

10: sk+1 = [SINRkm]1≤m≤M11: Update Q(sk,θk) via (20)12: Update V (sk) via (21)13: Calculate Φ′(sk,θk, sk+1) via (23)14: Calculate Φ(sk,θk) via (24)15: Calculate Ψ(sk,θk, sk+1) via (27)16: Calculate τ(sk,θk) via (25)17: Randomly select ŝ0 ∈ ξ and θ̂0 ∈ Ω18: for d = 1 to D do19: Select the next state ŝd+1 according to (28)20: Obtain τ(ŝd, θ̂d) via (26)21: Update Q(ŝd, θ̂d) via (29)22: Update V (ŝd) via (30)23: end for24: end for

experience denoted by Ψ is given by (23) and (24) as

Ψ(sk,θk, sk+1) =Φ′(sk,θk, sk+1)

Φ(sk,θk). (27)

The Q function is updated with D more times with theDyna architecture, in which D hypothetical experiences aregenerated according to the world model Ψ learned from theprevious real experience (Φ′,Φ, τ ′, τ,Ψ). More specifically,the system state denoted by ŝd and the hypothetical actiondenoted by θ̂d in the d-th additional update are generatedbased on the transition probability Ψ(ŝd, θ̂d, ŝd+1) given in(27), i.e.,

Pr(ŝd+1 |̂sd, θ̂d) = Ψ(ŝd, θ̂d, ŝd+1), (28)

where (ŝ0, θ̂0) is randomly chosen from ξ and Ω, respectively.The Q function is updated in the d-th hypothetical experi-

ence with d = 1, 2, · · · , D by the following:

Q(ŝd, θ̂d)← (1− α)Q(ŝd, θ̂d) + α(τ(ŝd, θ̂d) + δV (ŝd+1))

(29)

V (ŝd) = maxθ̂∈Ω

Q(ŝd, θ̂). (30)

As shown in Algorithm 3, the proposed NOMA powerallocation algorithm requires an additional memory in eachtime slot to store the experience record (Φ′,Φ, τ ′, τ,Ψ).Compared with the standard Q-learning based power allo-cation in Algorithm 2, this scheme has D more updates ofthe Q functions. However, the proposed fast Q-based power

9

allocation algorithm applies the Dyna architecture to increasethe convergence speed of the reinforcement learning processand thus improve the NOMA communication efficiency in thepresence of smart jamming.

VII. SIMULATION RESULTS

In this section, we evaluate the performance of the MIMONOMA power allocation scheme for M = 3 users in the dy-namic anti-jamming communication game via simulations. Ifnot specified otherwise, we set NT = NR = 5, σB,1 = 10 dB,σB,2 = 11 dB, σB,3 = 20 dB, PT = 20 W, R0 = 1 bps/Hz,γ = 0.5, α = 0.2, δ = 0.7, ε = 0.9 and I = 200. In thesimulations, a smart jammer applies Q-learning to determinethe jamming power according to the current downlink transmitpower with NJ = 3, σJ,1 = σJ,2 = 11 dB, σJ,3 = 12 dB, andPJ = 20 W.

As shown in Fig. 5, the Q-learning based power allocationscheme has significantly increased the jamming resistanceof NOMA transmissions, which can be further improved bythe hotbooting technique and the Dyna architecture. In thebenchmark Q-learning based OMA system, the BS equallyallocates the frequency and time resources to each user andchooses the transmit power according to the current state andthe Q function, similarly to Algorithm 2.

According to [36], the NOMA transmission outperformsOMA with less outage probability, which is verified by thesimulation results in Fig. 5. More specifically, the Q-learningbased NOMA system can improve the communication effi-ciency against jamming attack compared to the OMA system.For instance, the Q-learning based anti-jamming NOMA sys-tem exceeds the Q-learning based OMA system with 8.7%higher SINR, 42.8% higher sum data rate, and 7.1% higherutility of the BS at the 1000-th time slot.

The NOMA transmission performance can be further im-proved by the hotbooting technique. For instance, the NOMAsystem with hotbooting Q-learning achieves 12% higher SINR,10% higher sum data rate, and 9% higher utility, at the 1000-th time slot. The reason is that the hotbooting techniquecan formulate the emulated experiences in dynamic radioenvironments to effectively reduce the exploration trials andthus significantly improve the convergence speed of Q-learningand the jamming resistance efficiency. The power allocationperformance can be further improve by the Dyna architecturewith hypothetical experiences. For instance, the average SINR,the sum data rate, and the utility of the proposed fast Q-learning algorithm are 11%, 10%, and 7% higher at the 1000-th time slot, compared with the hotbooting Q-learning scheme.

The performance of the proposed power allocation schemein the 5× 5 MIMO NOMA system with 2 users is evaluatedwith the varying channel gains of User 1 in Fig. 6. Boththe sum capacity of users and the utility of the BS increasewith the channel gain σB,1. For instance, as the channelgain σB,1 increases from 4 to 12 dB, the average SINR, thesum data rate and the utility of the Q-learning based powerallocation increase by 9%, 14% and 17%, respectively. IfσB,1 = 12 dB, the fast Q-based power allocation achieves thebest anti-jamming communication and performance exceeds

0 1000 2000 3000 4000 5000Time slot

15

20

25

30

AverageSIN

Rofusers

NOMA with Q-learningNOMA with Hotbooting QNOMA with Dyna-QNOMA with Fast QOMA with Q-learning

(a) Average SINR

0 1000 2000 3000 4000 5000Time slot

5

10

15

20

25

Sum

datarates


(b) Sum data rates

0 1000 2000 3000 4000 5000Time slot

0

5

10

15

20

25

30

35

Utility


(c) Utility of the BS

Fig. 5. Performance of the power control schemes in the dynamic MIMONOMA transmission game in the presence of smart jamming, with M = 3,NT = NR = 5, NJ = 3, σB,1 = 10 dB, σB,2 = 11 dB, σB,3 = 20 dB,σJ,1 = σJ,2 = 11 dB, σJ,3 = 12 dB, PT = PJ = 20 W, R0 = 1 bps/Hzand γ = 0.5.

10

4 5 6 7 8Channel gain of User 1, σB,1 (dB)

16

18

20

22

24

26

28

Average

SIN

Rof

users

NOMA with Q-learningNOMA with Hotbooting-QNOMA with Dyna-QNOMA with Fast QOMA with Q-learning

σB,2 = 20dB

σB,2 = 24dB

(a) Average SINR


5

10

15

20

25

30

Sum

datarates

σB,2 = 20dB

σB,2 = 24dB


(b) Sum data rates


10

15

20

25

30

35

Utility

σB,2 = 20dBσB,2 = 24dB



Fig. 6. Performance of the power control schemes in the dynamic MIMONOMA transmission game in the presence of smart jamming versus thechannel gain of User 1, with M = 2, NT = NR = 5, NJ = 3,σJ,1 = σJ,2 = 12 dB, PT = PJ = 20 W, R0 = 2 bps/Hz and γ = 0.5.

4 5 6 7 8Number of RX antennas, NR

16

18

20

22

24

26

28

30

32

AverageSIN

Rofusers


NJ=2

NJ = 6

(a) Average SINR


10

20

30

40

50

60

70

Sum

data

rates


NJ = 6

NJ = 2

(b) Sum data rates


10

20

30

40

50

60

70

80

Utility


NJ = 6

NJ = 2


Fig. 7. Performance of the power control schemes in the dynamic MIMONOMA transmission game in the presence of smart jamming versus thenumber of the RX antennas, with M = 2, NT = 10, σB,1 = 12 dB,σB,2 = 20 dB, σJ,1 = σJ,2 = 11 dB, PT = PJ = 20 W, R0 = 2 bps/Hzand γ = 0.5.

11

the standard Q-learning based strategy with 9% higher SINR,14% higher sum capacity, and 22% higher utility.

Fig. 7 shows that the NOMA transmission efficiency im-proves with the number of the receive antennas with NT = 10transmit antennas. The proposed power allocation schemeshave strong resistance against smart jamming even with a largenumber of jamming antennas NJ . For instance, our proposedscheme can improve the average SINR, the sum data rateand the user utility of the 10 × 8 NOMA system against ajammer with 2 antennas by 10%, 12%, and 15%, respectively,compared with the benchmark strategy.

VIII. CONCLUSION

In this paper, we have formulated an anti-jamming MIMONOMA transmission game, in which the BS determines thetransmit power to improve its utility based on the sum datarate of the users, and the smart jammer as the follower in theStackelberg game chooses the jamming power at an attemptto interrupt the ongoing transmission at a low jamming cost.The SE of the game has been derived, and the conditionsassuring its existence have been provided, showing how theanti-jamming communication efficiency of NOMA systemsincreases with the number of antennas and the channel powergains. A fast Q-based NOMA power allocation scheme thatcombines the hotbooting technique and Dyna architectureis proposed for a dynamic game to accelerate the learningand thus improve the communication efficiency against smartjamming. As shown in the simulation results, the proposedNOMA power allocation scheme can significantly improve theaverage SINR, the sum data rate and the utility of the BS soonafter the start of the game. For example, the sum data rate of3 users in the 5×5 MIMO NOMA anti-jamming transmissiongame increases by 128% after 1000 time slots, which is 21%higher than that of the standard Q-learning based scheme.

We have analyzed a simplified jamming scenario for MIMONOMA systems in this work. A future direction of our work isto extend the theoretical analysis of the NOMA transmissionto more practical scenarios with smart jamming, in which ajammer uses programmable radio devices to flexibly choosemultiple jamming policies.

REFERENCES

[1] L. Dai, B. Wang, Y. Yuan, S. Han, I. Chih-Lin, and Z. Wang, “Non-orthogonal multiple access for 5G: Solutions, challenges, opportunities,and future research trends,” IEEE Commun. Mag., vol. 53, no. 9, pp. 74–81, Sept. 2015.

[2] B. Kimy, S. Lim, H. Kim, et al., “Non-orthogonal multiple access ina downlink multiuser beamforming system,” in Proc. IEEE MilitaryCommun. Conf. (MILCOM), pp. 1278–1283, San Diego, CA, Nov. 2013.

[3] Z. Ding, L. Dai, and H. V. Poor, “MIMO-NOMA design for small packettransmission in the Internet of Things,” IEEE Access, vol. 4, pp. 1393–1405, Apr. 2016.

[4] M. F. Hanif, Z. Ding, T. Ratnarajah, and G. K. Karagiannidis, “Aminorization-maximization method for optimizing sum rate in the down-link of non-orthogonal multiple access systems,” IEEE Trans. SignalProcess., vol. 64, no. 1, pp. 76–88, Jan. 2016.

[5] Z. Ding, R. Schober, and H. V. Poor, “A general MIMO framework forNOMA downlink and uplink transmission based on signal alignment,”IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 4438–4454, June2016.

[6] K. Firouzbakht, G. Noubir, and M. Salehi, “On the performance ofadaptive packetized wireless communication links under jamming,”IEEE Trans. Wireless Commun., vol. 13, no. 7, pp. 3481–3495, July2014.

[7] Q. Wang, K. Ren, P. Ning, and S. Hu, “Jamming-resistant multiradiomultichannel opportunistic spectrum access in cognitive radio networks,”IEEE Trans. Veh. Techn., vol. 65, no. 10, pp. 8331–8344, Oct. 2016.

[8] Q. Yan, H. Zeng, T. Jiang, et al., “Jamming resilient communicationusing MIMO interference cancellation,” IEEE Trans. Inf. Forensics andSecurity, vol. 11, no. 7, pp. 1486–1499, July 2016.

[9] X. Zhou, D. Niyato, and A. Hjorungnes, “Optimizing training-basedtransmission against smart jamming,” IEEE Trans. Veh. Techn., vol. 60,no. 6, pp. 2644–2655, July 2011.

[10] A. Mukherjee and A. L. Swindlehurst, “Jamming games in the MIMOwiretap channel with an active eavesdropper,” IEEE Trans. SignalProcess., vol. 61, no. 1, pp. 82–91, Jan. 2013.

[11] A. Garnaev, Y. Liu, and W. Trappe, “Anti-jamming strategy versus a low-power jamming attack when intelligence of adversary’s attack type isunknown,” IEEE Trans. Signal Inf. Process. Netw., vol. 2, no. 1, pp. 49–56, Mar. 2016.

[12] Y. Gwon, S. Dastangoo, C. Fossa, and H. T. Kung, “Competing mobilenetwork game: Embracing antijamming and jamming strategies withreinforcement learning,” in Proc. IEEE Conf. Commun. and Netw.Security (CNS), pp. 28–36, Washington, DC, Oct. 2013.

[13] F. Slimeni, B. Scheers, Z. Chtourou, and V. L. Nir, “Jamming mitigationin cognitive radio networks using a modified Q-learning algorithm,” inProc. IEEE Int’l Conf. Military Commun. and Inf. Systems (ICMCIS),pp. 1–7, Cracow, May 2015.

[14] E. R. Gomes and R. Kowalczyk, “Dynamic analysis of multiagent Q-learning with ε-greedy exploration,” in Proc. ACM Annual Int’l Conf.Machine Learning (ICML), pp. 369–376, Montreal, Jun. 2009.

[15] R. S. Sutton, “Dyna, an integrated architecture for learning, planning,and reacting,” ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, Aug.1991.

[16] Q. Sun, S. Han, I. Chin-Lin, and Z. Pan, “On the ergodic capacity ofMIMO NOMA systems,” IEEE Wireless Commun. Lett., vol. 4, no. 4,pp. 405–408, Aug. 2015.

[17] Z. Ding, P. Fan, and H. V. Poor, “Impact of user pairing on 5Gnonorthogonal multiple-access downlink transmissions,” IEEE Trans.Veh. Techn., vol. 65, no. 8, pp. 6010–6023, Aug. 2016.

[18] Z. Wei, D. W. K. Ng, and J. Yuan, “Power-efficient resource allocationfor MC-NOMA with statistical channel state information,” in Proc. IEEEGlobal Commun. Conf. (GLOBECOM), Washington, DC, Dec. 2016.

[19] F. Liu, P. Mähönen, and M. Petrova, “Proportional fairness-based userpairing and power allocation for non-orthogonal multiple access,” inProc. IEEE Annual Int’l Symp. Personal, Indoor, and Mobile RadioCommun. (PIMRC), pp. 1127–1131, Hong Kong, Aug. 2015.

[20] S. Timotheou and I. Krikidis, “Fairness for non-orthogonal multipleaccess in 5G systems,” IEEE Signal Process. Lett., vol. 22, no. 10,pp. 1647–1651, Oct. 2015.

[21] J. A. Oviedo and H. R. Sadjadpour, “A fair power allocation approachto NOMA in multi-user SISO systems,” IEEE Trans. Veh. Techn., 2017,DOI: 10.1109/TVT.2017.2689000.

[22] L. Xiao, J. Liu, Q. Li, N. B. Mandayam, and H. V. Poor, “User-centricview of jamming games in cognitive radio networks,” IEEE Trans. Inf.Forensics and Security, vol. 10, no. 12, pp. 2578–2590, Dec. 2015.

[23] X. Tang, P. Ren, Y. Wang, Q. Du, and L. Sun, “Securing wirelesstransmission against reactive jamming: A Stackelberg game framework,”in Proc. IEEE Global Commun. Conf. (GLOBECOM), San Diego, CA,Dec. 2015.

[24] Y. Wu, B. Wang, K. J. R. Liu, and T. C. Clancy, “Anti-jamming gamesin multi-channel cognitive radio networks,” IEEE Journal on SelectedAreas in Commun., vol. 30, no. 1, pp. 4–15, Jan. 2012.

[25] A. Garnaev, M. Baykal-Gursoy, and H. V. Poor, “A game theoreticanalysis of secret and reliable communication with active and passiveadversarial modes,” IEEE Trans. Wireless Commun., vol. 15, no. 3,pp. 2155–2163, Mar. 2016.

[26] L. Xiao, N. B. Mandayam, and H. V. Poor, “Prospect theoretic analysisof energy exchange among microgrids,” IEEE Trans. Smart Grid, vol. 6,no. 1, pp. 63–72, Jan. 2015.

[27] X. He, H. Dai, P. Ning, and R. Dutta, “A stochastic multi-channelspectrum access game with incomplete information,” in IEEE Int’l Conf.Commun. (ICC), pp. 4799–4804, London, Jun. 2015.

[28] L. Xiao, Y. Li, J. Liu, and Y. Zhao, “Power control with reinforcementlearning in cooperative cognitive radio networks against jamming,”Springer The Journal of Supercomputing, vol. 71, no. 9, pp. 3237–3257,Sept. 2015.

12

u =

NR∑i=1

M−1∑m=1

log2

1 + NJθmPThBm,iNTNJ +NJ

(1−

m∑n=1

θn

)PThBm,i +NT pJh

Jm,i

+ log21 + NJ

(1−

M−1∑n=1

θn

)PTh

BM,i

NTNJ +NT pJhJM,i

+ γpJ ,

(31)

∂u(θ, p∗J)

∂θm=

NR∑i=1

NJPT

(NT p

∗J

(hBm,ih

JM,i − hBM,ihJm,i

)−NJPThBm,ihBM,i

M−1∑n=m+1

θn −NTNJ(hBM,i − hBm,i

))ln 2

(NTNJ +NT p∗Jh

Jm,i +NJ

(1−

m∑n=1

θn

)PThBm,i

)(NTNJ +NT p∗Jh

JM,i +NJ

(1−

M−1∑n=1

θn

)PThBM,i

) < 0,∀1 ≤ m ≤M − 1.

(32)

u =

NR∑i=1

(log2

(1 +

NJθ1PThB1,i

NTNJ +NJ (1− θ1)PThB1,i +NT pJhJ1,i

)+ log2

(1 +

NJ (1− θ1)PThB2,iNTNJ +NT pJhJ2,i

))+ γpJ , (33)

∂u(θ, p∗J)

∂θ1=

NR∑i=1

NJPT(NT p

∗J

(hB1,ih

J2,i − hB2,ihJ1,i

)−NTNJ

(hB2,i − hB1,i

))ln 2

(NTNJ +NT p∗Jh

J1,i +NJ (1− θ1)PThB1,i

) (NTNJ +NT p∗Jh

J2,i +NJ (1− θ1)PThB2,i

) < 0. (34)

[29] X. Xiao, C. Dai, Y. Li, C. Zhou, and L. Xiao, “Energy trading gamefor microgrids using reinforcement learning,” in Proc. EAI Int’l Conf.Game Theory for Networks, Tennessee, May 2017.

[30] K. S. Hwang, W. C. Jiang, Y. J. Chen, and W. H. Wang, “Model-based indirect learning method based on dyna-q architecture,” in Proc.IEEE Int’l Conf. Systems, Man, and Cybernetics (SMC), pp. 2540–2544,Manchester, U.K., Oct. 2013.

[31] Y. Li, L. Xiao, H. Dai, and H. V. Poor, “Game theoretic study ofprotecting MIMO transmissions against smart attacks,” in IEEE Int’lConf. Commun. (ICC), London, May 2017.

[32] C. Wang, J. Chen, and Y. Chen, “Power allocation for a downlinknon-orthogonal multiple access system,” IEEE Wireless Commun. Lett.,vol. 5, no. 5, pp. 532–535, Aug. 2016.

[33] Z. Ding, Z. Yang, P. Fan, and H. V. Poor, “On the performance ofnon-orthogonal multiple access in 5G systems with randomly deployedusers,” IEEE Signal Process. Lett., vol. 21, no. 12, pp. 1501–1505, Dec.2014.

[34] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcementlearning: A survey,” Journal of Artificial Intelligence Research, vol. 4,pp. 237–285, 1996.

[35] H. H. Kha, H. D. Tuan, and H. H. Nguyen, “Fast global optimal powerallocation in wireless networks by local DC programming,” IEEE Trans.Wireless Commun., vol. 11, no. 2, pp. 510–515, Feb. 2012.

[36] J. Cui, Z. Ding, and P. Fan, “A novel power allocation scheme underoutage constraints in NOMA systems,” IEEE Signal Process. Lett.,vol. 23, no. 9, pp. 1226–1230, Sept. 2016.

APPENDIX APROOF OF LEMMA 1

According to (6), if min [Rm]1≤m≤M ≥ R0 for ∀0 ≤pJ ≤ PJ , we have (31), and thus ∂2u(θ, pJ)/∂pJ2 > 0, i.e.that u(θ, pJ) is convex in terms of pJ . By (11), we have∂u(θ, pJ)/∂pJ |pJ=0 < 0. Thus, u(θ, pJ) is minimized at∂u(θ, pJ)/∂pJ = 0, and by Eq. (7), p∗J is given by (10).Otherwise, if ∃pJ ′ ≤ PJ that satisfies min [Rm]1≤m≤M < R0,by Eq. (7), p∗J ∈ [pJ ′, PJ ].

According to the assumption that the channel conditionsof User k are stronger than User j (k > j), we havehBk,i > h

Bj,i ∀1 ≤ j < k ≤ NR. In addition, if hBm,ihJM,i <

hBM,ihJm,i,∀1 ≤ m ≤ M − 1,∀1 ≤ i ≤ NR, by (31), we

have ∂u(θm, p∗J)/∂θm < 0, i.e. (32). Clearly, u monotonicallydecreases with θm, ∀1 ≤ m ≤ M − 1. Meanwhile, ifRm(θ, PJ) < R0 ∀1 ≤ m ≤ M − 1, p∗J = PJ and

u(θ, p∗J) = 0. Therefore, u is maximized if Rm(θ, PJ) = R0,∀1 ≤ m ≤ M − 1, i.e. that θ∗ is given by (9) and thus (8)holds. Therefore, the SE (θ∗, p∗J) is given by (9)-(10).

APPENDIX BPROOF OF COROLLARY 1

Similar to the proof of Lemma 1, if min [Rm]1≤m≤M ≥ R0for ∀0 ≤ pJ ≤ PJ , by (31), ∂2u(θ, pJ)/∂pJ2 > 0. By(19), we have ∂u(θ, pJ)/∂pJ |pJ=0 > 0. Thus u(θ, pJ) isminimized at pJ = 0, and we have p∗J = 0 by Eq. (7). IfhBm,ih

JM,i < h

BM,ih

Jm,i,∀1 ≤ m ≤ M − 1,∀1 ≤ i ≤ NR,

we have (32) and thus θ∗ given by (18). Therefore, ifhBm,ih

JM,i < h

BM,ih

Jm,i,∀1 ≤ m ≤ M − 1,∀1 ≤ i ≤ NR

and (19) holds, we have the SE (θ∗, 0).

APPENDIX CPROOF OF COROLLARY 2

By (33), if hB1,ihJ2,i < h

J1,ih

B2,i,∀1 ≤ i ≤ NR and (14) holds,

we have ∂u(θ1, p∗J)/∂θ1 < 0, i.e. (34). Thus we have that umonotonically increases with θ2, 0 ≤ θ2 ≤ 1. On the otherhand, it is clear that R1(θ1, p∗J) = 0 if θ1 = 0, and R1(θ1, p

∗J)

monotonically increases with θ1. Thus u is maximized if R1 =R0, i.e. that θ∗1 is given by (12), and Eq. (8) is satisfied withθ∗1 . Therefore, the SE (θ

∗1 , p∗J) is given by (12)-(13).

APPENDIX DPROOF OF COROLLARY 3

By (32), if hB1,ihJ2,i > h

J1,ih

B2,i +

(hB2,i − hB1,i

)NJ/p

∗J ,∀1 ≤

i ≤ NR and (17) holds, we have ∂u(θ1, p∗J)/∂θ1 > 0. Itis clear that R2(θ1, p∗J) = 0 if θ1 = 1, and R2(θ1, p

∗J)

monotonically decreases with θ1. Thus u is maximized ifR2 = R0, i.e. that θ∗1 is given by (15). Therefore, the SE(θ∗1 , p

∗J) is given by (15)-(16).

13

Liang Xiao (M’09, SM’13) is currently a Professorin the Department of Communication Engineering,Xiamen University, Fujian, China. She has servedas an associate editor of IEEE Trans. InformationForensics and Security and guest editor of IEEEJournal of Selected Topics in Signal Processing.She is the recipient of the best paper award for2016 INFOCOM Big Security WS and 2017 ICC.She received the B.S. degree in communicationengineering from Nanjing University of Posts andTelecommunications, China, in 2000, the M.S. de-

gree in electrical engineering from Tsinghua University, China, in 2003, andthe Ph.D. degree in electrical engineering from Rutgers University, NJ, in2009. She was a visiting professor with Princeton University, Virginia Tech,and University of Maryland, College Park.

Yanda Li received the B.S. degree in communica-tion engineering from Xiamen University, Xiamen,China, in 2015, where he is currently pursuing theM.S. degree with the Department of CommunicationEngineering. His research interests include networksecurity and wireless communications.

Canhuang Dai received the B.S. degree in com-munication engineering from Xiamen University,Xiamen, China, in 2017, where he is currentlypursuing the M.S. degree with the Department ofCommunication Engineering. His research interestsinclude smart grid and wireless communications.

Huaiyu Dai (F’17) received the B.E. and M.S.degrees in electrical engineering from Tsinghua Uni-versity, Beijing, China, in 1996 and 1998, respec-tively, and the Ph.D. degree in electrical engineeringfrom Princeton University, Princeton, NJ in 2002.

He was with Bell Labs, Lucent Technologies,Holmdel, NJ, in summer 2000, and with AT&TLabs-Research, Middletown, NJ, in summer 2001.He is currently a Professor of Electrical andComputer Engineering with NC State University,Raleigh. His research interests are in the general

areas of communication systems and networks, advanced signal processing fordigital communications, and communication theory and information theory.His current research focuses on networked information processing and cross-layer design in wireless networks, cognitive radio networks, network security,and associated information-theoretic and computation-theoretic analysis.

He has served as an editor of IEEE TRANSACTIONS ON COMMUNICA-TIONS, IEEE Transactions on Signal Processing, and IEEE Transactionson Wireless Communications. Currently he is an Area Editor in charge ofwireless communications for IEEE TRANSACTIONS ON COMMUNICATIONS.He co-edited two special issues of EURASIP journals on distributed signalprocessing techniques for wireless sensor networks, and on multiuser infor-mation theory and related applications, respectively. He co-chaired the SignalProcessing for Communications Symposium of IEEE Globecom 2013, theCommunications Theory Symposium of IEEE ICC 2014, and the WirelessCommunications Symposium of IEEE Globecom 2014.

H. Vincent Poor (S’72, M’77, SM’82, F’87) re-ceived the Ph.D. degree in electrical engineeringand computer science from Princeton University in1977. From 1977 until 1990, he was on the facultyof the University of Illinois at Urbana-Champaign.Since 1990 he has been on the faculty at Princeton,where he is the Michael Henry Strater UniversityProfessor of Electrical Engineering. During 2006 to2016, he served as Dean of Princeton’s School ofEngineering and Applied Science. He has also heldvisiting appointments at several other institutions,

most recently at Berkeley and Cambridge. His research interests are in theareas of information theory and signal processing, and their applications inwireless networks, energy systems and related fields. Among his publicationsin these areas is the recent book Information Theoretic Security and Privacyof Information Systems (Cambridge University Press, 2017).

Dr. Poor is a member of the National Academy of Engineering and theNational Academy of Sciences, and is a foreign member of the ChineseAcademy of Sciences, the Royal Society and other national and internationalacademies. Recent recognition of his work includes the 2017 IEEE AlexanderGraham Bell Medal, Honorary Professorships from Peking University andTsinghua University, both conferred in 2017, and a D.Sc. honoris causa fromSyracuse University awarded in 2017.

Reinforcement Learning-based NOMA Power Allocation in the … · 2018. 10. 18. · 1 Reinforcement...

Documents

Transcript of Reinforcement Learning-based NOMA Power Allocation in the … · 2018. 10. 18. · 1 Reinforcement...