Hierarchical Q-Learning Based UAV Secure Communication...

Research ArticleHierarchical Q-Learning Based UAV SecureCommunication against Multiple UAV Adaptive Eavesdroppers

Jue Liu,1,2 Nan Sha,1 Weiwei Yang ,1 Jia Tu,3 and Lianxin Yang1

1College of Communications Engineering, Army Engineering University of PLA, Nanjing 210007, China2School of Information Science and Engineering, Jinshen College of Nanjing Audit University, Nanjing 210023, China3College of International Studies, National University of Defense Technology, Nanjing 210039, China

Correspondence should be addressed to Weiwei Yang; [email protected]

Received 16 June 2020; Revised 19 August 2020; Accepted 13 September 2020; Published 8 October 2020

Academic Editor: Ashok Kumar Das

Copyright © 2020 Jue Liu et al. This is an open access article distributed under the Creative Commons Attribution License, whichpermits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In this paper, we investigate secure unmanned aerial vehicle (UAV) communication in the presence of multiple UAV adaptiveeavesdroppers (AEs), where each AE can conduct eavesdropping or jamming adaptively by learning others’ actions fordegrading the secrecy rate more seriously. The one-leader and multi-follower Stackelberg game is adopted to analyze the mutualinterference among multiple AEs, and the optimal transmit powers are proven to exist under the existing conditions. Followingthat, a mixed-strategy Stackelberg Equilibrium based on finite and discretized power set is also derived and a hierarchical Q-learning based power allocation algorithm (HQLA) is proposed to obtain the optimal power allocation strategy of thetransmitter. Numerical results show that secrecy performance can be degraded severely by multiple AEs and verify theavailability of the optimal power allocation strategy. Finally, the effect of the eavesdropping cost on the AE’s attack modestrategies is also revealed.

1. Introduction

With the inherent advantages in mobility, flexibility, andadaptive altitude, unmanned aerial vehicle (UAV) wirelesscommunication has experienced an upsurge of interest inboth military and civilian applications [1–6]. However, boththe broadcast nature of the wireless medium and the mali-cious attackers make the electromagnetic environment ofUAV communication hostile. Hence, the security issue ofUAV communications is of paramount importance yet asignificant challenge [7].

As an option, the physical layer security (PLS) techniquewith lower computation complexity has been proven that itcan protect wireless communication networks from wiretap-ping and interfering by exploiting the random characteristicsof wireless channels in recent years [8–11]. Naturally, due tothe payload-limited characteristic of UAV, many PLSapproaches [12–23] combining with the high altitude andmobility of UAV have been applied widely in UAV-involved communications.

However, most approaches in the above work that mainlyfocused on the single-mode scenarios are not fully suitablefor the novel attackers, named as “adaptive eavesdroppers(AEs),” “active eavesdropper,” or “smart eavesdropper.”They use programmable radio devices to flexibly choose theirattack methods, such as eavesdropping, jamming, and spoof-ing, according to the ongoing transmission status and theradio channel states. For example, an AE sends spoofing sig-nals if she has a similar channel state with Alice or sends jam-ming signals if she is very close to Bob. Compared with thetraditional single-mode attackers each performing a single-mode attack, an AE can be more harmful to the UAV trans-mission by reducing the secrecy capacity. Therefore, it isurgent to investigate the effective countermeasures againstthis type of eavesdropper.

In recent years, some literature began to investigate AE.One form of the AE is achieved by the manner of multian-tenna full-duplex (FD) technology [24–27], which can assignone part of antennas to wiretap and the other antennas inter-fere simultaneously. Another type of AE emerges during the

HindawiWireless Communications and Mobile ComputingVolume 2020, Article ID 8825120, 15 pageshttps://doi.org/10.1155/2020/8825120

https://orcid.org/0000-0002-0220-4088

https://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1155/2020/8825120

channel estimation phase in the form of time-divisionduplex, and it leads to pilot contamination by sending thesame pilot sequence as the legitimate node [28–31]. Whilein the data transmission phase, the AE reverts to passiveeavesdropper again. Nevertheless, it should be noticed thatthe attack modes of these two forms of AE are predefinedwhich means it cannot change the attack mode adaptively.The third form of AE in [32–36] can adjust its attack strate-gies adaptively. But there are still several problems remainingunsolved. Firstly, current work rarely considered multipleAEs case and the mutual interference between themselves.Secondly, existing studies neglected the AE adaptivity sup-ported by the learning ability in searching for the optimalstrategies of the transmitter and did not reveal the impactof the learning ability of AE on the secrecy performance ofthe considered system. How to search the transmitter’soptimal power strategies in the face of multiple AEs withthe learning ability and how to handle the mutual interfer-ence from AEs to improve the security capacity of the UAVcommunication system are necessary to be considered.

In our work, we mainly concentrate on a secure UAVcommunication scenario in the presence of multiple UAVAEs, which can eavesdrop or jam adaptively by learningothers’ strategies as well as dynamic environments. For theconsidered scenario, each AE’s attack activity may affect thesignal to interference plus noise ratio (SINR) of others. Thisimplies that each AE’s decision-making is not only coupledwith the interactions from the transmitter but also fromother AEs. Considering these hierarchical interactionsbetween the transmitter-side and AEs-side, the Stackelberggame [37–41] is a suitable framework to capture the sequen-tial interactions between the transmitter and AEs. Then, theStackelberg Equilibrium (SE) points of the formulated gameturn to be the feasible solutions to the transmit power alloca-tion problem. However, the SE points solely provide theo-retic solutions and it is challenging to obtain the SEsolutions. In particular, the AE with the learning ability inthis paper makes decisions spontaneously and indepen-dently, which results in unpredictable attack modes of thewhole AE set. In this context, it is not feasible to handle thisproblem by centralized means because the number of eachattack mode and locations of AEs are unknown, which moti-vates applying the idea of reinforcement learning (RL). So, weincorporate RL technology into the proposed game and ahierarchical Q-learning based power allocation algorithm isproposed to obtain the mixed-strategy equilibrium solution.The main contributions of this paper are summarized asfollows:

(i) We propose a secure UAV communication modelwhich constitutes of one transmitter-receiver pairand multiple UAV AEs. Each AE decides to eaves-drop or jam adaptively by learning the other nodes’strategies as well as the dynamic environment tomaximize its damage. Also, the interference amongAEs is investigated.

(ii) We formulate the UAV secure transmission prob-lem as a one-leader and multi-follower Stackelberg

game where the transmitter acts as the leader andall AEs are followers. The optimal transmit powerof leader are obtained by analyzing the pure strategySEs under the existing conditions. Besides, themixed-strategy SE is also derived for the finite anddiscretized power set. Then, we apply a hierarchicalRL framework in which each player chooses itsattack strategy based on a probability distributionand a hierarchical Q-learning based power alloca-tion algorithm is proposed to discover the mixed-strategy equilibrium of the formulated game.Besides, we provide rigorous theoretical proof aboutthe convergence of the proposed algorithm.

(iii) Numerical results show the availability of the opti-mal power allocation strategy of the legitimate trans-mitter in the more hostile situation and reveal theimpact of AE’s learning ability on the secrecy rate.Meanwhile, we show that the proposed algorithmhas a significant convergence advantage over thesingle-agent RL algorithm. Finally, the effect of theeavesdropping cost on the AE’s attack mode strate-gies is also revealed.

(iv) We organize the rest of this paper as follows. InSection 2, we present the related work. Then, wepresent the system model in Section 3. In Section 4,we formulate the UAV secure transmission gameand investigate a power allocation policy in Section5. In Section 6, we provide the simulation resultsand conclude the work in Section 7.

2. Related Work

In UAV communication, there have been abundantapproaches, such as 3D beamforming [12–14], trajectory opti-mization [5, 15–19], multi-UAV cooperation [17, 20], andresource management techniques [21–23], concerning on thesingle attack mode. Whereas, it is inappropriate to apply themdirectly to defend the novel attacker that has the multiple abil-ities of eavesdropping, jamming, spoofing, and so on.

As a novel attacker, the AE can eavesdrop and jam simul-taneously by the FD capability [24–27]. Specifically, Tanget al. investigated the physical layer security issue in the pres-ence of an FD AE within a hierarchical game framework in[24]. In [25], Mukherjee and Swindlehurst examined thedesign of an FD active eavesdropper in the 3-user MIMOMEwiretap channel, where the adversary intends to optimize itstransmit and receive sub-arrays and jamming signal parame-ters to minimize the MIMO secrecy rate of the main channel.In [26], the potential benefits of an FD receiver node in thepresence of an active FD eavesdropper was studied. The opti-mal receive/transmit antennas allocation at the receiveragainst active eavesdropper in an FD pattern is provided in[27]. The second AE scenario adopts time-division duplextechnology. The adaptive eavesdropper sent the same pilotsequence as the legitimate user node in the training phaseleading to pilot contamination [28–31]. Zhou et al. discussedhow an AE attacked the training phase in wireless communi-cation to improve its eavesdropping performance in [28]. A

2 Wireless Communications and Mobile Computing

simple protocol to determine whether an AE is present or notusing the channel properties of MMIMO is proposed in [29].A novel random-training-assisted (RTA) pilot spoofingdetection algorithm and a zero-forcing based secure trans-mission scheme is proposed to protect the confidential infor-mation from the active eavesdropper in [30]. Unfortunately,all AEs in the above scenarios cannot adjust attackmode adap-tively. More recently, the AE that can determine the attackmode autonomously has been studied in [32–36]. To be spe-cific, Li et al. studied the secure communication game underthe AE from UAV with the imperfect channel estimation butignored the mobility of UAV in [32]. Li et al. formulated theMIMO transmission in the presence of AE as a noncoopera-tive game and obtained the power control strategy based onQ-learning in [33]. Zhu et al. proposed a noncooperative stra-tegic game tomake a complex decision between users that per-form uplink transmission via relay and an active maliciousnode in [34]. In [35], Xiao et al. formulated a subjective smartattack game for the UAV transmission and proposed a deepQ-learning RL based UAV power allocation strategies. How-ever, these above researches did not refer to the multipleAEs’ scenario, and the mutual interference between AEs ishardly considered. Moreover, these AEs cannot learn fromothers’ strategies and the dynamic environment. A summaryof the proposed literature about AE has been given in Table 1.

Our work in this paper is different from the aboveresearches that we focus on the AE with learning ability thatcan choose the attack mode independently and investigatethe secure transmission problem of UAV communicationin the presence of multiple AEs. Note that the approach ofdefending multiple AEs using the Stackelberg game in UAVcommunication networks was presented in our previouswork [37], and the main differences and new contributionsare (i) aim to the actual UAV communication, we introducethe mixed-strategies for the discretized transmit power set,and (ii) we assume that each AE has the learning abilityand reveal the impact of the AE’s learning ability on thesecrecy rate. Besides, the similarity between the most relatedwork in [32] and our work is that the Stackelberg game-basedpower allocation problem in the secure transmission of UAVcommunication is investigated. The main differences are (i)we consider the multiple AEs case which is more actual inUAV communication while the work in [32] ignores it, and(ii) the mutual interference among themselves is considered.

3. System Model

As shown in Figure 1, we consider the downlink of a UAVcommunication system consisting of a transmitter (Alice), areceiver (Bob), and M number of UAV AEs randomly dis-tributed around transmitter-receiver pairs, where all nodesare single-antenna and UAVs are all hovering. Here, weadopt a 3D Cartesian coordinate system with the Alice,Bob, and theAEm located at ðxa, ya, haÞ, ðxb, yb, hbÞ, andðxm, ym, hmÞ. Alice communicates with Bob by using trans-mit power that is denoted by Ps ∈ ½0, Pmax�, where Pmax isthe maximum transmit power. Without the loss of generality,being a programmable radio device, when Alice is transmit-ting a signal to Bob, some AEs act as passive eavesdroppers

to overhear Alice’s signals if they can derive enough informa-tion. The rest of the AEs send jamming signals if they caneffectively block Alice’s signal to Bob. Each AE can eithereavesdrop on Alice or jam Bob, under a half-duplex con-straint. Here qm ∈ fe, jg,m ∈ ½1,M�, corresponding to eaves-dropping and jamming, denotes the specific attack mode ofAEm. Hence, the sets of the passive eavesdroppers and theactive jammers can be denoted by ΦE and ΦJ , respectively,where ∣ΦJ ∣ +jΦEj =M.

Considering the low mobility of low-altitude UAVs, allthe channels are assumed to be quasi-static fading, i.e., thechannel gains are constant with each transmission block.Besides, the channel gains between the UAVs follow thefree-space path loss model, which is determined by thedistance between the UAVs, i.e.,

gi,j = β0d−ηi,j =

β0ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiζi − ζj�� 2q� �η , ð1Þ

where β0 is the channel power gain at the reference distanceof d0 = 1m, di,j is the distance from node i to node j, ζi is thecoordinate of node i , and η is the path loss exponent.

Remark 1. Since each AE can eavesdrop or interfere adaptivelyby learning the communication environment, Alice and otherAEs can monitor the AE’s position when it chooses to jam. So,we assume that the number and locations of all nodes (legiti-mate communication pairs and all AEs) are available betweeneach other via a priori measurement following the above anal-ysis. In addition, as the AE considered in this paper can onlyeavesdrop and interfere, at each time slot, each node judgedother’s actions by sensing the jamming signal. If one AE doesnot jam, other nodes consider that it chooses to eavesdrop.

At each time slot, Alice first sends a normalized signal xawith transmit power Ps. Then, all AEs conduct differentattack modes by learning others’ strategies. The legitimatelink and all passive eavesdroppers suffer interference fromall active jammers. The interference to legitimate link andthe kth passive eavesdropper (k ∈ΦE) is given by ∑j∈Φ J

Pjgj,band ∑j∈Φ J

Pjgj,k, where Pj is the jamming power.

The received signal at Bob can be expressed as

yb =ffiffiffiffiffiffiffiffiffiffiffiffiPsga,b

qxa + 〠

j∈Φ J

ffiffiffiffiffiffiffiffiffiffiffiPjgj,b

qxj + nb, ð2Þ

where nb ~ CNð0, σ2nÞ is the additive white Gaussian noise

(AWGN) at Bob. The received SINR at Bob can be expressed as

rab =Psω0d

−ηa,b

I J ,B + 1 , ð3Þ

where ω0 = β0/σ2n and I J ,B =∑j∈Φ J

Pjω0d−ηj,b that denotes the

interference from all AEs who choose to jam. We can obtainthe data rate of the Alice-Bob link as

3Wireless Communications and Mobile Computing

Table1:Asummaryof

theprop

osed

literatureabou

tAE.

Ref.

FD/H

DAttacker’s

antenn

asAttackmod

eand

phase

Attacker

number

Solution

Advantages

Disadvantages

Theory

Algorithm

24(2016)

FDMultipleantenn

asEavesdrop

andjam

during

data

transm

ission

1Stackelberggame

Primal-dualinterior-po

int

metho

d

Con

sidertheph

ysicallayersecurity

issue

formulti-chann

elwireless

commun

ications

inthepresence

ofan

FDactive

eavesdropp

er.

Not

adaptive

25(2011)

FDMultipleantenn

asEavesdrop

andjam

during

data

transm

ission

1Gradientprojection

+fixed-po

intiteration

algorithm

Solvetheworst-casejammingcovariance

forarbitraryandGaussianinpu

tsignalin

g.Not

adaptive

26(2017)

FDMultipleantenn

asEavesdrop

andjam

during

data

transm

ission

1Con

vex

optimization

Primaldecompo

sition

iterative

algorithm

The

channelstateinform

ationis

uncertain.

Not

adaptive

27(2017)

HD

Multipleantenn

asEavesdrop

andjam

during

data

transm

ission

1Con

structingtheprecod

ingmatrixpairatAliceand

Bob.

Optim

alantenn

asallocation

ofthe

receiver

tomaxim

izetheachievable

secrecydegreesof

freedo

m.

Not

adaptive

28(2012)

HD

Singleantenn

a

Jam

during

pilot

training

and

eavesdropdu

ring

data

transm

ission

1Con

vexop

timization

Introd

ucetheactive

eavesdropp

erinto

the

pilotcontam

inationph

enom

enon

.Not

adaptive

30(2017)

HD

Singleantenn

a

Jam

during

pilot

training

and

eavesdropdu

ring

data

transm

ission

1Rando

m-training-assisted

spoofing

detection

scheme

Add

ingarand

omtraining

phaseafterthe

convention

alpilottraining

phase.

Not

adaptive

32(2018)

HD

Singleantenn

a

Silence,spoof,

eavesdrop,

andjam

during

data

transm

ission

1Stackelberggame

Single-agent

Q-learningalgorithm

Solvetheph

ysicallayersecurity

issue

underim

perfectchannelestim

ation.

Single-agent

Q-learning

33(2017)

FDMultipleantenn

as

Eavesdrop

,jam

,spoof,or

keep

silent

during

data

transm

ission

1Non

coop

erative

game

Single-agent

Q-learningalgorithm

Solvetheph

ysicallayersecurity

issue

underim

perfectchannelestim

ation.

Not

adaptive

34(2011)

HD

Singleantenn

aEavesdrop

andjam

during

data

transm

ission

1

Mixed-strategy

based

noncooperative

nonzero-sum

game

Fictitious

play-based

iterative

algorithm

Revealthe

impactof

thepresence

ofa

malicious

node

onthemulti-relay

tomulti-userchoices.

Adaptive

35(2017)

HD

Singleantenn

aEavesdrop

,jam

,and

spoofdu

ring

data

transm

ission

1PT-based

dynamicsm

art

attack

game

Q-learning\WoL

F-PHC\D

QN

algorithms

App

lytheprospecttheory.

Adaptive


Rab = log2 1 + rabð Þ: ð4Þ

Due to Remark 1, each AE can get the other AEs’ actions.So, the signal received at the kth passive eavesdropper can beexpressed as

ye =ffiffiffiffiffiffiffiffiffiffiffiPsga,k

qxa + 〠

j∈Φ J

ffiffiffiffiffiffiffiffiffiffiffiPjgj,k

qxj + ne, ð5Þ

where ne ~ CNð0, σ2nÞ is the AWGN at the kth passive eaves-

dropper. Similarly, the received SINR at the kth passive eaves-dropper can be expressed a

rk =Psω0d

−ηa,k

I J ,E + 1 , ð6Þ

where I J ,E =∑j∈Φ JPjω0d

−ηj,k .

Assuming the maximal eavesdropped information isdetermined by the maximal SINR among all passive eaves-droppers, i.e., rE =max

k∈ΦE

rk. We obtain the maximal data rate

of the Alice-AE links, which is given as

Rae = log2 1 + rEð Þ: ð7Þ

From (4) and (7), the secrecy rate of Alice can be written as

Ra = Rab − Rae½ �+, ð8Þ

where ½X�+ returnsX ifX is positive, while returns 0 otherwise.

4. Secure Transmission Game

In this section, we investigate the secure transmission prob-lem with multiple UAV AEs. The interactions between thetransmitter and multiple UAV AEs are formulated underthe Stackelberg game framework. The optimal power alloca-tions and secrecy rate of Alice and the best attack modes of allAEs are derived by analyzing the equilibrium of the game.

4.1. Secure Transmission Game Formulation. The securetransmission problem of this proposed system can be formu-lated as a two-stage Stackelberg game. Specifically, Alice is aleader and all AEs are followers. Alice decides its transmitpower firstly and all AEs take their action adaptively basedon the observation of the leader’s action in the sequel. Thesecure transmission game is formulated as

G = N ,P ,Q,Ua,Umf g: ð9Þ

Here, N = fAlice, AE1,⋯, AEm,⋯, AEMg is modeled asthe players, and P ∈ ½0, Pmax� and Q = fe, jg are the strategyspace of Alice and AE, respectively. Also, Ua and Um arethe utility of Alice and AE, respectively.

In this system, Alice wants to send a confidential messageand thus naturally intends to maximize its secrecy rate.Meanwhile, the transmission cost is inevitable during thetransmission. Therefore, the utility of the leader is thetrade-off of the secrecy rate and transmission cost, whichcan be formulated as

Alice

Bob

Adaptive eavesdropper

Eavesdropping/jamming

Legitimate link

ga,b

gj,b

gs,b

ga,e

Figure 1: System Model.


Ua = Ra ln 2 − CaPs, ð10Þ

where Ca denotes the cost of the unit transmit power of Alice.For computational convenience, we multiply the data rate bya coefficient ln 2.

The objective of the leader is to solve the followingproblem to obtain the optimal power allocation:

P∗s = arg max

Ps∈ 0,Pmax½ �Ua Ps, q∗m, q∗−mð Þ , ð11Þ

where q∗1 ,⋯, q∗m denotes the optimal action of all AEs.On the other hand, each AE attempts to minimize the

secrecy rate of Alice by changing its attack mode adaptivelyaccording to Alice’s transmit power. Therefore, we formulatethe utility of AEm with the trade-off of the secrecy rate and itsattack cost as follows

Um = −Ra − θqm , θqm ∈ e, jf g, ð12Þ

where θe and θj denotes the cost of each AE to perform as thepassive eavesdropper and active jammer, respectively. Weassume that θe is related to the Rae, i.e., θe = CeRae, whereCe denotes the cost of unit rate of Rae. θj = CjPj, where Cj

denotes the cost of the unit transmit power of jammer.To calculate the utility of a single AE accurately, at each

time slot, when Alice is transmitting a signal to Bob, wedivide all AEs into three parts, which are denoted as Φ−m

E ,Φ−m

J , and AEm, respectively, i.e., ∣Φ−mE ∣ + ∣Φ−m

J ∣ +jAEmj =

M. Φ−mE is the set of passive eavesdroppers except AEm and

Φ−mJ is the set of active jammers except AEm.If AEm decides to act as a passive eavesdropper, the Ra

can be expressed as

Ra = Rab − Rae½ �+ = log2 1 + rabð Þ − log2 1 + rEð Þ½ �+

= log2 1 + Psω0d−ηa,b

I−mJ ,B + 1

!− log2 1 + max max

k∈Φ−mE

rkð Þ, rm� �� " #+


I−mJ ,B + 1

!− log2 1 + max r−mE , rmð Þð Þ

" #+,

ð13Þ

where I−mJ ,B is the interference received at Bob from Φ−mJ , and

rm is the SINR of AEm:Similarly, if AEm selects to jam, Ra can be expressed as

Ra = Rab − Rae½ �+ = log2 1 + rabð Þ − log2 1 + rEð Þ½ �+


I−mJ ,B + Im + 1

!− log2 1 + max

k∈Φ−mE

rkð Þ� �" #+

,

ð14Þ

where Im is the jamming power of AEm, and r−mE is the max-imal SINR among all passive eavesdroppers in Φ−m

E .In conclusion, Um can be expressed as

Similarly, the objective of AEm is to solve the followingproblem:

q∗m = arg maxqm∈ e,jf g

Um P∗s , qm, q∗−mð Þ, ð16Þ

where q∗−m denotes the optimal action of all AEs except AEm.

4.2. Analysis of Strategy Equilibrium. Now, we will analyzethe proposed Stackelberg game model and solve the optimi-zation subproblems of (11) and (16). As a follower, eachAE will adjust its attack mode after sensing Alice’s strategy.Therefore, the subgame of followers is analyzed firstly.

Proposition 1. Given the strategy of Alice, the optimal attackmode strategy ofAEm is expressed as (17) if (17(a)) and (17(b))hold.

Um =log2 1 + Psω0d

−ηa,b

I−mJ ,B + 1

!− log2 1 + max r−mE , rmð Þð Þ

" #+− CeRae, qm = e að Þ,

log2 1 + Psω0d−ηa,b

I−mJ ,B + Im + 1

!− log2 1 + max

k∈Φ−mE

rkð Þ� �" #+

− CjPj, qm = j bð Þ:

8>>>>><>>>>>:

ð15Þ

q∗m Psð Þ =e, if CjPj ≥ log2

1 + P∗s ω0d

−ηa,b + I−m∗

J ,B� �

I−m∗J ,B + Im + 1

� �1 + r−m∗

Eð Þ1 + P∗

s ω0d−ηa,b + I−m∗

J ,B + Im� �

I−m∗J ,B + 1

� �1 +max r−m∗

E , rmð Þð Þ1−Ce

" # að Þ,

j, if CjPj ≤ log21 + P∗


J ,B� �

I−m∗J ,B + Im + 1

� �1 + r−m∗

Eð Þ1 + P∗


J ,B + Im� �

I−m∗J ,B + 1

� �1 +max r−m∗


" # bð Þ,

8>>>>><>>>>>:

ð17Þ


where P∗s is the optimal power allocation, and I−m∗

J ,B =∑j∈Φ−m∗

JPjω0d

−ηj,b denotes the interference from Φ−m∗

J , in which

each AE chooses to jam as an optimal strategy, r−m∗E = max

k∈Φ−m∗E

ðrkÞ denotes the maximal SINR among all AEs in Φ−m∗E where

each AE chooses to overhear as an optimal strategy.

Proof. If (17(a)) holds, from (12), we have

If (17(b)) holds, from (12), we have

Um P∗s , j, q∗−mð Þ −Um P∗

s , e, q∗−mð Þ

= log2 1 + P∗s ω0d

−ηa,b

I−m∗J ,B + 1

!− log2 1 +max r−m∗

E , rmð Þð Þ" #

− log2 1 + Psω0d−ηa,b

I−m∗J ,B + Im + 1

!− log2 1 + r−m∗

Eð Þ" #

+ Ce log2 1 + max r−m∗E , rmð Þð Þ − CjPj

= log21 + P∗


J ,B� �

I−m∗J ,B + Im + 1

� �1 + r−m∗

Eð Þ1 + P∗


J ,B + Im� �

I−m∗J ,B + 1

� �1 + max r−m∗


" #

− CjPj ≥ 0:

ð19Þ

Thus (17) holds.As shown in Proposition 1, if passive eavesdropping can

bring worse secrecy rate and less cost than active jamming,the AE will select to overhear and vice versa.

As the leader of the game, Alice first chooses to transmitpower. The optimal power strategy of Alice can be derived bysolving (11), which is revealed in Proposition 2.

Proposition 2. The optimal power allocation is P∗s , which sat-

isfies the following equation:

ω0d−ηa,b − ω0 max

k∈Φ∗E

d−ηa,k/I∗J ,E + 1� �

I J ,B − ω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �

1 + I∗J ,B + P∗s ω0d

−ηa,b

� �1 + P∗

s ω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �� = Ca að Þ,

0 ≤ P∗s ≤ Pmax bð Þ,

8>>>><>>>>:

ð20Þ

if (21(a)) and (21(b)) hold.

where I∗J ,B and I∗J ,E denotes the interference from Φ∗J in which

each AE chooses to jam as an optimal strategy to Bob and thekth passive eavesdropper, respectively.

Proof. We obtain the following differential equation describ-ing the evolution of the utility of Alice:

∂Ua

∂Ps=ω0d

−ηa,b − ω0 max

k∈Φ∗E

d−ηa,k/I∗J ,E + 1� �


E

d−ηa,k/I∗J ,E + 1� �

1 + I∗J ,B + P∗s ω0d

−ηa,b

� �1 + P∗

s ω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� ��

− Ca,ð22Þ

Um P∗s , e, q∗−mð Þ −Um P∗

s , j, q∗−mð Þ = − log2 1 + P∗s ω0d

−ηa,b

I−m∗J ,B + 1

!− log2 1 +max r−m∗

E , rmð Þð Þ" #

+ log2 1 + Psω0d−ηa,b

I−m∗J ,B + Im + 1

!− log2 1 + r−m∗

Eð Þ" #

− Ce log2 1 + max r−m∗E , rmð Þð Þ + CjPj

= − log21 + P∗


J ,B� �

I−m∗J ,B + Im + 1

� �1 + r−m∗

Eð Þ1 + P∗


J ,B + Im� �

I−m∗J ,B + 1

� �1 + max r−m∗


" #+ CjPj ≥ 0:

ð18Þ

maxk∈Φ∗

E

d−ηa,kI∗J ,E + 1

!< d−ηa,bI∗J ,B + 1 að Þ,

ω0d−ηa,b − ω0 max

k∈Φ∗E

d−ηa,k/I∗J ,E + 1� �


E

d−ηa,k/I∗J ,E + 1� �

1 + I∗J ,B + P∗s ω0d

−ηa,b

� �1 + P∗

s ω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �� ≤ Ca ≤

ω0d−ηa,b

1 + I∗J ,B− ω0 max

k∈Φ∗E


! bð Þ,

8>>>>>>>><>>>>>>>>:

ð21Þ


∂2Ua

∂P2s

= ω0d−ηa,b

1 + I∗J ,B + Psω0d−ηa,b

� � + ω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �

1 + Psω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� ��

2664

3775

�ω0 max

k∈Φ∗E

d−ηa,k/I∗J ,E + 1� �

1 + Psω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �� −

ω0d−ηa,b

1 + I∗J ,B + Psω0d−ηa,b

� �2664

3775:

ð23ÞIf (21(a)) holds, (23) is less than zero. Thus, we have

∂2Ua

∂P2s

< 0, ð24Þ

which indicates that ∂Ua/∂Ps monotonically decreases withPs. Therefore, if (21(b)) holds, we have

∂Ua

∂Ps

��Ps=0

= ω0d−ηa,b

1 + I∗J ,B� � − ω0 max

k∈Φ∗E


!− Ca > 0, ð25Þ

∂2Ua

∂P2s

��Ps=Pmax

= ω0d−ηa,b

1 + I∗J ,B + Pmaxω0d−ηa,b

� �

−ω0 max

k∈Φ∗E

d−ηa,k/I∗J ,E + 1� �

1 + Pmaxω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �� − Ca < 0,

ð26Þindicating that there is a sole solution to ∂Ua/∂Ps = 0, givenin (20(a)). From (22)–(24), we can find that UaðPs, q∗m, q∗−mÞincreases with Ps, if Ps < P∗

s , while it decreases otherwise.Thus, (11) also holds and ðPs, q∗m, q∗−mÞ is a Nash Equilibrium(NE) of the game. In this way, we have completed the proofof Proposition 2.

As shown in Proposition 2, Alice stops the transmissionwhen (21(b)) does not hold. In other words, Alice will stopthe transmission under the circumstances that radio channeldegradation is serious and the security cannot be guaranteed.

Another NE ðPmax, q∗m, q∗−mÞ is revealed in Proposition 3.

Proposition 3. The secure game has the NE ðPmax, q∗m, q∗−mÞ if(21(a)) and the following equation hold:

ω0d−ηa,b

1 + I∗J ,B + Pmaxω0d−ηa,b

� � − ω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �

1 + Pmaxω0 maxk∈Φ∗

E

d−ηa,k/I∗J ,E + 1� �� > Ca:

ð27Þ

Proof. (21(a)) has been discussed above.Therefore, if (27) holds, we have

∂Ua

∂Ps≥∂Ua

∂Ps

��Ps=Pmax

≥ 0,∀0 ≤ Ps ≤ Pmax, ð28Þ

which indicates that Ua monotonically increases with Ps,ðPmax, q∗m, q∗−mÞ is also an NE of the game. In this way,we have completed the proof of Proposition 3.

As shown in Proposition 3, low transmission costs in (27)will make Alice select the maximum transmit power to trans-mit the signals.

5. Hierarchical Reinforcement LearningFramework for Secure Transmission Game

The proposed UAV secure communication problem withmultiple AEs has been formulated as a Stackelberg game,which belongs to the category of two-stage dynamic gameand has a significant two-layer game structure. Alice and allAEs become intelligent agents and have the learning abilityto automatically optimize their configuration. Besides, themixed-strategy is applied by both sides of the communicationto confuse each other. In this section, we apply a hierarchicalRL framework to derive the mixed-strategy equilibrium andimplement the UAV secure communication.

5.1. Analysis of Mixed-Strategy Equilibrium. Considering theactual wireless communication scenario, we assume thatAlice has a finite and discretized power set. Specifically, apolicy of Alice at time slot t is defined to be a probabilityvector πt = ðπt

1, πt2,⋯πt

LÞ, where πtl means the probability

with which Alice chooses action (power level) Pl from afinite discrete set P , which satisfies ∑L

l=1πtl = 1. Similarly,

δtm = ðδtm,1m , δtm,2mÞ denotes the policy of AEm at time slot t,

where δtm,im means the probability with which AEm choosesaction (attack mode) Qi from a finite discrete set Q, whichsatisfies ∑2

i=1δtm,im = 1.

Based on the above analysis, we have the following defini-tion of an SE for the hierarchical RL framework based on Eqs.(10) and (12). Alice’s objective is to maximize its revenue as

π∗ = arg maxπ

Ua π, δ∗m, δ∗−mð Þ, ð29Þ

Similarly, each AE’s objective is

δ∗m = arg maxδm

Um π∗, δm, δ∗−mð Þ, ð30Þ

Then, we will define the SE in a hierarchical reinforce-ment learning framework.

Definition 1. A stationary policy profile ðπ∗, δ∗m, δ∗−mÞ is theSE for hierarchical RL framework if the followings hold.

Ua π∗, δ∗m, δ∗−mð Þ ≥Ua π, δ∗m, δ∗−mð Þ að ÞUm π∗, δ∗m, δ∗−mð Þ ≥Um π∗, δm, δ∗−mð Þ bð Þ

(: ð31Þ

Proposition 4. For the proposed hierarchical RL framework,there exists Alice’s stationary policy and an AEs’ NE policythat form an SE.


Proof. If the Alice follows a stationary policy π, the Stackel-berg game is simplified into an M-player hierarchical RLgame. It has been shown in [42] that every finite strategic-form game has a mixed policy equilibrium. As a result, therealways exists an NE ðπÞ in our formulation of the discretepower allocation game given Alice’s policy π. The rest ofthe proof follows directly from the definition of an SE andis thus omitted for brevity.

5.2. Hierarchical Q-Learning Based Power AllocationAlgorithm. In the proposed UAV secure transmission game,since there is no information exchange between Alice andAEs, both sides can only maximize their expected utilitiesthrough repeated interactions with each other. When theaction taken by the agent (Alice or AEs) brings positivefeedback to the agent, the agent will strengthen the action,otherwise the agent will weaken the action. Agents constantlyadjust their strategies based on the feedback to achieve opti-mal long-term returns. Thus, a hierarchical Q-learning basedpower allocation algorithm (HQLA) is adopted, where eachagent’s policy is parameterized through the Q-function thatcharacterizes the relative expected utility of a particularaction.

To be specific, for the follower’s learning, let Qtmðqtm,imÞ

denote the corresponding Q-function of AEm ′s action qtm,imbased on current police δtm,im at time slot t. Then, after con-ducting the action qtm,im , the corresponding Q-value isupdated as follows

Qt+1m qm,im

=Qt

m qtm,im

+ α Um Pt+1

l , qtm,im , qt+1−m,i−m

−Qt

m qtm,im

,

ð32Þ

where α ∈ ½0, 1Þ is the learning rate and UmðPt+1l , qtm,im ,

qt+1−m,i−mÞ is the utility of AEm at time slot t + 1.Each AE updates its policy based on Boltzmann distri-

bution

δt+1m,im qm,im

=

exp Qtm qtm,im

/τ

h i∑qs,is∈Q

exp Qtm qts,js

/τ

h i , ð33Þ

where temperature τ controls the trade-off between explora-tion and exploitation, i.e., for τ→ 0, AEm greedily choosesthe policy corresponding to the maximum Q-value whichmeans pure exploitation, whereas for τ→∞, AEm ′s policyis completely random which means pure exploration [43].Accordingly, the Q-value of Alice is updated as follows

Qt+1a Plð Þ =Qt

a Plð Þ + α Ua Pl, qt+1m,im , qt+1−m,i−m

−Qt

a Plð Þ

,

ð34Þ

where UaðPl, qt+1m,im , qt+1−m,i−mÞ is the utility of Alice at time slot

t + 1. Then, Alice updates its policy based on Boltzmanndistribution

πt+1l Plð Þ = exp Qt

a Plð Þ/τ� �∑Pj∈P exp Qt

a Pj

� �/τ

� � : ð35Þ

Now, we present the detailed description of the Q-learningbased hierarchical RL algorithm.

5.3. Convergence Analysis of Algorithm 1. The learningalgorithm results in a stochastic process of choosing a powerlevel, so we need to investigate the long-term behavior of thelearning procedure. Along with the discussion in [43], weobtain the following differential equation describing theevolution of the Q-values:

dQt+1a Plð Þdt

= α Ua Pl, qt+1m,im , qt+1−m,i−m

−Qt

a Plð Þ

, ð36Þ

dQt+1m qm,im

dt

= α Um Pt+1l , qtm,im , q

t+1−m,i−m

−Qt

m qtm,im

:

ð37Þ

In the following, we would like to express the dynamics interms of strategies rather than the Q-values. Toward this end,we differentiate (35) with respect to time t and use (36). Sim-ilarly, we differentiate (33) with respect to time t and use (37).

Hierarchical Q-learning Based Power Allocation Algorithm1: Initialize t = 0,Qt

aðPlÞ = 0,Qtmðqtm,imÞ = 0, πt

l ðPlÞ = 1/L, δtm,im = 1/2,m ∈N \ fAliceÞ;2: Loop:3: t = t + 1;4: Update Alice’s policies πt

l ðPlÞ and AEm ′s policies δtm,imðqm,imÞ according to (35) and (33), respectively;

5: Alice chooses the action Pl with πtl ;

6: Each AEm sensing the Alice’s transmit power, and selects qm,im with δtm,im ;

7: Alice updates UaðPl , qt+1m,im , qt+1−m,i−mÞ according to (10), and each AEm updates UmðPt+1

l , qtm,im , qt+1−m,i−mÞ according to (12);

8: Alice and all AEs update Q-values according to (34) and (32), respectively;9: End Loop;

Algorithm 1.


We can obtain the equations like (38) and (39).

dπt+1l Plð Þdt

= πt+1l Plð Þ α

τUa Plð Þ − 〠

j∈Pπt+1j Pj

� �Ua Pj

� �" #(

− τ〠j∈P

πt+1j Pj

� �ln πt+1

l Plð Þπt+1j Pj

� � !)

,

ð38Þ

dδt+1m,im qm,im

dt

= δt+1m,im qm,im

ατ

Um qm,im

− 〠

is∈Qδt+1s,is qs,is

Um qs,is

" #(

− τ〠is∈Q

δt+1s,is qs,is

ln

δt+1m,im qm,im

δt+1s,is qs,is

0@

1A):

ð39ÞThe steady-state strategy profile zs = ðπsðPlÞ, δsm,imðqm,imÞÞ

can be obtained [43].

πs Plð Þ = exp Qta Plð Þ/τ� �

∑Pj∈P exp Qta Pj

� �/τ

� � , ð40Þ

δsm,im qm,im

=

exp Qtm qtm,im

/τ

h i∑qs,is∈Q

exp Qtm qts,js

/τ

h i : ð41Þ

Let Zt = ðzt1,⋯, ztNÞ the strategy profile of all players attime slot t. In the following analysis, we resort to an ordinarydifferential equation (ODE) whose solution approximates theconvergence of Zt . The right-hand side of (38) and (39) canbe represented by a function f ðZtÞ as α→ 0. Zt will convergeweakly to Z∗ = ðπ∗

0 , δ∗0 Þ, which is the solution to

dZdt

= f Zð Þ, Z0 = Z0: ð42Þ

Proposition 5. The HQLA can discover a mixed-strategy SE.

Proof. We prove this by contradiction. Suppose that theprocess generated by (33) and (35) converges to a non-SE.But the solutions of (42) are by definition stationary points.This implies that HQLA will only converge to stationarypoints. This means that stationary points that are not SEsare stable, which is contradicting Proposition 4.

6. Simulation Results

Simulations are carried out to evaluate the performance ofthe proposed power allocation strategies against multipleUAV AEs. This scenario has one transmitter-receiver pairand three UAV AEs denoted as Alice, Bob, AE1, AE2, andAE3, respectively. We set up a scenario network where allthe UAVs are distributed in a 200m ∗ 200m region. Thesystem parameters are chosen for some typical scenariosincluding the cost of unit transmit power and jamming

power, i.e., Ca = Cj = 0:1 and Ce = 0:5, the path loss exponentη = 2 and ω0 = 80:

Figure 2 shows the expected utilities of the leader underdifferent algorithms. We can find that the expected utilityachieved by the proposed HQLA is significantly lower thanthe single-agent Q-learning algorithm (SAQL). This isbecause in SAQL, only Alice applies the reinforcement learn-ing mechanism to maximize the secrecy rate but all AEs’behaviors constituting joint actions are considered to bestated in the Q-learning algorithm which means each AEcannot choose the optimal strategy adaptively to maximizeits utility. While in HQLA, each AE with reinforcementlearning ability can maximize its damages to the secrecy rateof the considered system through repeated interactions withAlice’s and other AEs’ strategies. The comparison of theleader’s expected utilities with SAQL implies that the agent’slearning ability has a significant impact on its utility. So, theproposed HQLA provides an optimal power allocation strat-egy in a more hostile case that suffers the adaptive attacksfrom multiple AEs. On the other hand, the proposed HQLAis superior to the random selection algorithm (RS) becausethe proposed HQLA may converge to a desirable solution,whereas the RS is an instinctive approach.

Figure 3 shows the cumulative distribution function(CDF) of the convergence of HQLA and SAQL. As observedfrom Figure 3, we can find that the proposed algorithmconverges at about 500 iterations, while the contrast algo-rithm converges at about 1000 iterations, which meansthe convergence rate of HQLA is significantly better thanSAQL. This is because that all AEs in SAQL select actionrandomly without learning ability whereas in HQLA, takingthe interactions between two sides of communication intoaccount, all AEs make decisions according to the mixed-strategy derived by RL which can obtain the optimal strat-egy via trials-and-errors. This also means that the learningability has a significant positive impact on the convergencerate.

Figure 4 presents the strategy selection probabilities evo-lution of the leader’s transmit power. At the very beginning,Alice randomly selects transmit power according to a uni-form distribution. As Algorithm 1 iterates, the strategy selec-tion probabilities keep on updating until convergence afterabout 500 iterations. It is worthy to note that Algorithm 1under this scenario converges to pure strategy NE pointssince the probability of selecting one power level is equal to1, while the probabilities of the other levels of transmit powerdecrease to 0 if the time slots are large enough. So, the theo-retical prediction in Proposition 4 is verified under the exist-ing conditions. Specifically, the Pmax in Figure 4(a) as theoptimal transmit power is consistent with Proposition 3,and the P∗

s in Figure 4(b) is consistent with Proposition 2.The leader’s expected utility comparison under different

Ce is shown in Figure 5(a). It is noted that the steady valueof the leader’s expected utility increase with the value of Cegrowing because that Ce leads to changes in AEs’ attack strat-egies. Specifically, as a rational agent, all AEs choose to inter-fere with Bob finally in Figure 5(b) because they find theutility of the jammer is higher than eavesdropper according


to the learning process when the difference of Ce = 0:8. Thus,the maximal data rate of the Alice-AE link is zero whichmeans the leader will obtain the maximal secrecy rate andexpected utility. Similarly, in Figure 5(d), all AEs find the util-ity of eavesdropper is higher than jammer when Ce = 0:2 andevery AE choose to eavesdrop on Alice. As a result, the max-imal data rate of the Alice-AE link between all AEs isachieved and the leader suffers the lowest utility. When Ce

= 0:5 (in Figure 5(c)), according to the utilities of themselves,AE1 and AE3 always choose to interfere with Bob and AE2prefers eavesdropping which makes the expected utility ofleader is between Ce = 0:8 and Ce = 0:2. In addition, it is wor-thy to note that the attack strategies of all AEs have a purestrategy equilibrium since the probability of selecting oneattack mode is equal to 1 while the probability of anotherattack mode decreases to 0.

1000

0.1

0.2

0.3

0.4

0.5

CDF

0.6

0.7

0.8

0.9

1

200 300 400 500Iteration numbers

600 700 800 900 1000

HQLASAQL

Figure 3: The CDF of convergence of HQLA and SAQL.

00

0.02

0.04

0.06

0.08

0.1

0.12

Expe

cted

util

ity o

f lea

der 0.14

0.16

0.18

0.2

50 100 150 200 250 300Time slots

350 400 450 500

HQLASAQLRS

Figure 2: Expected utility of leader.


Prob

abili

ty

0 50 100 150 200 250 300Time slots

350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P1P2P3

P4P5P6

(a)

Prob

abili

ty

0 50 100 150 200 250 300Time slots

350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P1P2P3

P4P5P6

(b)

Figure 4: Power allocation probabilities of leader.


7. Conclusions and Future Work

In this paper, we have investigated the transmit power opti-mization problem of secure UAV communication in thepresence of multiple UAV AEs. A secure transmission gameis formulated to prove the existence of the NE by analyzing

the interactions between the legitimate user and AEs. Withina hierarchical game framework, we obtain the optimaltransmit power solutions for the legitimate transmissions.Numerical results verified the theoretical analysis and shownthat the secrecy performance could be degraded severely byAEs’ learning ability. Moreover, the outperformance of the

0 50 100 150 200 250 300Time slots

350 400 450 500

Ce = 0.5Ce = 0.8Ce = 0.2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Expe

cted

util

ity o

f lea

der

(a) Expected utility of leader under different Ce

0 50 100 150 200 250 300Time slots

350 400 450 500

Eavesdropping of UAV1Jamming of UAV1Eavesdropping of UAV2

Jamming of UAV2Eavesdropping of UAV3Jamming of UAV3

0

0.1

0.2

0.3

0.4

0.5

Prob

abili

ty

0.6

0.7

0.8

0.9

1

(b) Ce = 0:8

0 50 100 150 200 250 300Time slots

350 400 450 5000

0.1

0.2

0.3

0.4

0.5

Prob

abili

ty

0.6

0.7

0.8

0.9

1



(c) Ce = 0:5

0 50 100 150 200 250 300Time slots

350 400 450 5000

0.1

0.2

0.3

0.4

0.5

Prob

abili

ty

0.6

0.7

0.8

0.9

1



(d) Ce = 0:2

Figure 5: Expected utility of leader and attack mode probabilities of all AEs under different Ce:


HQLA’s convergence and the impact of the eavesdroppingcost on the decision of AE’s attack mode is also demon-strated. To take advantage of the UAV’s mobility that canbring the potential performance enhancement, in futurework, we will devote our efforts to joining the UAV’s trajec-tory and resource allocation optimization against multipleAEs.

Data Availability

The data (figures) used to support the findings of this studyare included within the article. Further details can be pro-vided upon request.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (no. 61771487 and no. 61471393) andthe National Key R&D Program of China under Grant2018YFB1801103.

References

[1] V. Mayor, R. Estepa, A. Estepa, and G. Madinabeitia, “Deploy-ing a reliable UAV-aided communication service in disasterareas,” Wireless Communications and Mobile Computing,vol. 2019, Article ID 7521513, 20 pages, 2019.

[2] X. Fan, C. Huang, B. Fu, S. Wen, and X. Chen, “UAV-assisteddata dissemination in delay-constrained VANETs,” MobileInformation Systems, vol. 2018, Article ID 8548301, 12 pages,2018.

[3] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah,“Unmanned aerial vehicle with underlaid device-to-devicecommunications: performance and trade-offs,” IEEE Transac-tions on Wireless Communications, vol. 15, no. 6, pp. 3949–3963, 2016.

[4] J. Lyu, Y. Zeng, R. Zhang, and T. J. Lim, “Placement optimiza-tion of uav-mounted mobile base stations,” IEEE Communica-tions Letters, vol. 21, no. 3, pp. 604–607, 2017.

[5] Q. Wu, Y. Zeng, and R. Zhang, “Joint trajectory and commu-nication design for multi-uav enabled wireless networks,”IEEE Transactions on Wireless Communications, vol. 17,no. 3, pp. 2109–2121, 2018.

[6] N. H. Motlagh, M. Bagaa, and T. Taleb, “Uav-based iot plat-form: a crowd surveillance use case,” IEEE CommunicationsMagazine, vol. 55, no. 2, pp. 128–134, 2017.

[7] Yingbin Liang, H. V. Poor, and S. Shamai, “Secure communi-cation over fading channels,” IEEE Transactions on Informa-tion Theory, vol. 54, no. 6, pp. 2470–2492, 2008.

[8] W. Yang, L. Tao, X. Sun, R. Ma, Y. Cai, and T. Zhang, “Secureon-off transmission in mmwave systems with randomly dis-tributed eavesdroppers,” IEEE Access, vol. 7, pp. 32681–32692, 2019.

[9] Z. Xiang, W. Yang, G. Pan, Y. Cai, and Y. Song, “Physical layersecurity in cognitive radio inspired Noma network,” IEEEJournal of Selected Topics in Signal Processing, vol. 13, no. 3,pp. 700–714, 2019.

[10] X. Sun, W. Yang, Y. Cai, L. Tao, Y. Liu, and Y. Huang, “Securetransmissions in wireless information and power transfermillimeter-wave ultra-dense networks,” IEEE Transactionson Information Forensics and Security, vol. 14, no. 7,pp. 1817–1829, 2019.

[11] R. Ma, W. Yang, X. Sun, L. Tao, and T. Zhang, “Secure com-munication in millimeter wave relaying networks,” IEEEAccess, vol. 7, pp. 31218–31232, 2019.

[12] J. Huang and A. L. Swindlehurst, “Robust secure transmissionin miso channels based on worst-case optimization,” IEEETransactions on Signal Processing, vol. 60, no. 4, pp. 1696–1707, 2012.

[13] Q. Yuan, Y. Hu, C. Wang, and Y. Li, “Joint 3d beamformingand trajectory design for uav-enabled mobile relaying system,”IEEE Access, vol. 7, pp. 26488–26496, 2019.

[14] L. Zhu, J. Zhang, Z. Xiao, X. Cao, D. O. Wu, and X.-G. Xia,“3-D beamforming for flexible coverage in millimeter-waveuav communications,” IEEE Wireless CommunicationsLetters, vol. 8, no. 3, pp. 837–840, 2019.

[15] G. Zhang, Q. Wu, M. Cui, and R. Zhang, “Securing uav com-munications via trajectory optimization,” in GLOBECOM2017 - 2017 IEEE Global Communications Conference, pp. 1–6, Singapore, Singapore, Jan. 2017.

[16] G. Zhang, Q. Wu, M. Cui, and R. Zhang, “Securing uav com-munications via joint trajectory and power control,” IEEETransactions on Wireless Communications, vol. 18, no. 2,pp. 1376–1389, 2019.

[17] C. Zhong, J. Yao, and J. Xu, “Secure uav communication withcooperative jamming and trajectory control,” IEEE Communi-cations Letters, vol. 23, no. 2, pp. 286–289, 2019.

[18] Q. Wang, Z. Chen, H. Li, and S. Li, “Joint power and trajectorydesign for physical-layer secrecy in the uav-aided mobile relay-ing system,” IEEE Access, vol. 6, pp. 62849–62855, 2018.

[19] X. Sun, C. Shen, D.W. K. Ng, and Z. Zhong, “Robust trajectoryand resource allocation design for secure uav-aided communi-cations,” in 2019 IEEE International Conference on Communi-cations Workshops (ICC Workshops), pp. 1–6, Shanghai,China, China, 2019.

[20] A. Li, Q. Wu, and R. Zhang, “UAV-enabled cooperativejamming for improving secrecy of ground wiretap channel,”IEEE Wireless Communications Letters, vol. 8, no. 1, pp. 181–184, 2019.

[21] X. Sun, C. Shen, T.-H. Chang, and Z. Zhong, “Joint resourceallocation and trajectory design for uav-aided wireless physicallayer security,” in 2018 IEEE Globecom Workshops (GCWkshps), pp. 1–6, Abu Dhabi, United Arab Emirates, UnitedArab Emirates, 2018.

[22] X. Tang, P. Ren, Y. Wang, Q. Du, and L. Sun, “Securing wire-less transmission against reactive jamming: a stackelberg gameframework,” in 2015 IEEE Global Communications Conference(GLOBECOM), pp. 1–6, San Diego, CA, USA, 2015.

[23] J. Zheng, Y. Cai, and A. Anpalagan, “Astochastic gametheoretic approach for interference mitigation in small cellnetworks,” IEEE Communications Letters, vol. 19, no. 2,pp. 251–254, 2015.

[24] X. Tang, P. Ren, Y.Wang, and Z. Han, “Combating full-duplexactive eavesdropper: a hierarchical game perspective,” IEEETransactions on Communications, vol. 65, no. 3, pp. 1379–1395, 2017.

[25] A. Mukherjee and A. L. Swindlehurst, “A full-duplex activeeavesdropper in mimo wiretap channels: construction and


countermeasures,” in 2011 Conference Record of the Forty FifthAsilomar Conference on Signals, Systems and Computers (ASI-LOMAR), pp. 265–269, Pacific Grove, CA, USA, 2011.

[26] M. R. Abedi, N. Mokari, H. Saeedi, and H. Yanikomeroglu,“Robust resource allocation to enhance physical layer securityin systems with full-duplex receivers: active adversary,” IEEETransactions on Wireless Communications, vol. 16, no. 2,pp. 885–899, 2017.

[27] L. Li, A. P. Petropulu, and Z. Chen, “MIMO secret communi-cations against an active eavesdropper,” IEEE Transactions onInformation Forensics and Security, vol. 12, no. 10, pp. 2387–2401, 2017.

[28] X. Zhou, B. Maham, and A. Hjorungnes, “Pilot contaminationfor active eavesdropping,” IEEE Transactions on WirelessCommunications, vol. 11, no. 3, pp. 903–907, 2012.

[29] A. Al-nahari, “Physical layer security using massive multiple-input and multiple-output: passive and active eavesdroppers,”IET Communications, vol. 10, no. 1, pp. 50–56, 2016.

[30] X. Tian, M. Li, and Q. Liu, “Random-training-assisted pilotspoofing detection and security enhancement,” IEEE Access,vol. 5, pp. 27384–27399, 2017.

[31] Y. Wu, R. Schober, D. W. K. Ng, C. Xiao, and G. Caire, “Securemassive mimo transmission with an active eavesdropper,”IEEE Transactions on Information Theory, vol. 62, no. 7,pp. 3880–3900, 2016.

[32] C. Li, Y. Xu, J. Xia, and J. Zhao, “Protecting secure communi-cation under uav smart attack with imperfect channel estima-tion,” IEEE Access, vol. 6, pp. 76395–76401, 2018.

[33] Y. Li, L. Xiao, H. Dai, and H. V. Poor, “Game theoretic study ofprotecting MIMO transmissions against smart attacks,” in2017 IEEE International Conference on Communications(ICC), pp. 1–6, Paris, France, May 2017.

[34] Q. Zhu, W. Saad, Z. Han, H. V. Poor, and T. Basar, “Eaves-dropping and jamming in next-generation wireless networks:a game-theoretic approach,” in 2011 - MILCOM 2011 MilitaryCommunications Conference, pp. 119–124, Baltimore, MD,USA, 2011.

[35] L. Xiao, C. Xie, M. Min, and W. Zhuang, “User-centric view ofunmanned aerial vehicle transmission against smart attacks,”IEEE Transactions on Vehicular Technology, vol. 67, no. 4,pp. 3420–3430, 2018.

[36] L. Yang, J. Chen, H. Jiang, S. A. Vorobyov, and H. Zhang,“Optimal relay selection for secure cooperative communica-tions with an adaptive eavesdropper,” IEEE Transactions onWireless Communications, vol. 16, no. 1, pp. 26–42, 2017.

[37] J. Liu, W. Yang, S. Xu, J. Liu, and Q. Zhang, “Q-learning basedUAV secure communication in presence of multiple UAVactive eavesdroppers,” in 2019 11th International Conferenceon Wireless Communications and Signal Processing (WCSP),pp. 1–6, Xi'an, China, China, 2019.

[38] D. Yang, G. Xue, J. Zhang, A. Richa, and X. Fang, “Coping witha smart jammer in wireless networks: a Stackelberg gameapproach,” IEEE Transactions on Wireless Communications,vol. 12, no. 8, pp. 4038–4047, 2013.

[39] L. Jia, Y. Xu, Y. Sun, S. Feng, and A. Anpalagan, “Stackelberggame approaches for anti-jamming defence in wireless net-works,” IEEE Wireless Communications, vol. 25, no. 6,pp. 120–128, 2018.

[40] H. Fang, L. Xu, Y. Zou, X. Wang, and K.-K. R. Choo, “Three-stage Stackelberg game for defending against full-duplex activeeavesdropping attacks in cooperative communication,” IEEE

Transactions on Vehicular Technology, vol. 67, no. 11,pp. 10788–10799, 2018.

[41] C. Cheng, Z. Zhu, B. Xin, and C. Chen, “A multi-agent rein-forcement learning algorithm based on Stackelberg game,” in2017 6th Data Driven Control and Learning Systems (DDCLS),pp. 727–732, Chongqing, China, 2017.

[42] D. Fudenberg and T. Jean, Tirole: Game theory, vol. 726, MITPress, 1991.

[43] A. Kianercy and A. Galstyan, “Dynamics of Boltzmann Qlearning in two-player two-action games,” Physical Review E,vol. 85, no. 4, article 041145, 2012.


Hierarchical Q-Learning Based UAV Secure Communication...

Documents

Transcript of Hierarchical Q-Learning Based UAV Secure Communication...