Overview of Nesting in the NMM-B
Tom Black
18 February 2014
● Nests in operational NEMS/NAM
2
● The grid
● Motion
● 2-way exchange
● User specification of nest-related variables
● Sequence of execution
● MPI task usage- 1-way exchange - 2-way exchange - The communicators
General Characteristics of NMM-B Nests
3
Static / moving●
1-way / 2-way interactive●
Multiple nests run simultaneously●
Telescoping domains●
Bit restartable ( static / moving / 1-way / 2-way* )●
Parent-oriented●
NEMS Structure
MAIN
EARTH(1:NM)
Ocean Atm Ice
NMM GSM FIM
Solver
Domain (1:ND)
Wrt Dyn Phy Wrt Dyn Phy Wrt
NEMS
EarthEnsemble Mediator
Chem
All boxes represent ESMF components.
4
Atm-OcnMediator
parentsand
children
5
• Parent runs at 12 km to 84 hr• Four static nests run to 60 hr
– 4 km CONUS nest (3-to-1)– 6 km Alaska nest (2-to-1)– 3 km HI & PR nests (4-to-1)
• Single relocatable 1.33km or 1.5km FireWeather grandchild run to 36hr (3-to-1 or 4-to-1)
5
Current Operational NAM with 1-Way Static Nests
Task Usage for NMM-B 1-Way Nesting
The user distributes available compute tasks among allthe various domains and fine-tunes those assignments(along with those of quilt tasks) so that parents and theirchildren proceed in the forecast at virtually the same rateas all domains integrate concurrently. This gives the userthe ability to optimize the work load balance.
6
Relative Compute Resources used by NAM/Nests
3 km Puerto Rico nest 4%
1.33 km CONUS FireWx nest 17%
4 km CONUS nest 57% 57%
3 km Hawaii nest 5%
12 km parent 10%
6 km Alaska nest 7% 7%
7
1-Way Integration for Three Generations
Parent updates child BCs
Δt par
Δt child
Δt grandchild8
All generations integrate concurrently.
NMM-B with 1-Way Nesting using 72 Compute Tasks
9
generation #2tasks 8-47
generation #1tasks 0-7
generation #3tasks 48-71
2 24 32
24
8
Two Key Timers
10
cpl1_recv_tim: Child wait time to recv BC data
Appears as ‘cpl recv = ‘ in stdout file
cpl2_wait_tim: Parent wait time for BC send to finish
Appears as ‘cpl wait = ‘ in stdout file
If child wait time is large then child is too fast relative to parent.
If parent wait time is large => parent is too fast relative to child.
=> Reduce child tasks, increase parent tasks.
=> Reduce parent tasks, increase child tasks.
2-Way Integration for Three Generations
Parent updates child BCs Child updates parent
Δt par
Δt child
Δt grandchild11
Only one generation can be active at a given time.
Use 1-Way Task Assignment Strategy in 2-Way Nests?
12
NO – Too many tasks can sit idle since domains are active in only one generation at a time.
Therefore use a different approach based on the generations of domains.
NMM-B with 1-Way Nesting using 72 Compute Tasks
13
generation #2tasks 8-47
generation #1tasks 0-7
generation #3tasks 48-71Only 40 of 72 tasks working
in the busiest generation ifusing this method for 2-way.
2 24 32
24
8
Basic Strategy for Task Use by Generations
Generations must wait on each other in 2-way mode.‣
Then reassign only as many compute tasks to domains in each remaining generation as is beneficial in minimizing the clocktimes of those generations by avoiding too small subdomains with too costly halo exchanges.
‣
14
All domains cannot execute concurrently so maximize the amount of work that can be done at any given time by assigning ALL compute tasks to the most expensive generation and distributing them among its domains for optimal efficiency.
‣
Rules for ‘Generational’ Task Usage
A compute task can be in more than one generation but cannot be on more than one domain per generation.
‣
Generations execute sequentially.‣
15
ALL compute tasks are assigned to the most expensive generation.‣
All domains in each generation execute concurrently.‣
The user is now able to optimize speed in 2-way nesting while neverimposing large imbalances. Some tasks might be idle in some generationsbut all generations are running as fast as possible.
Each quilt task must still be uniquely assigned to a single domain to retain asynchronous writing of output.
‣
NMM-B with 2-Way Nesting using 72 Compute Tasks ‘Generational’ task usage
16
generation #2tasks 0-71
generation #1tasks 0-11
generation #3tasks 12-53All 72 of 72 tasks working
in the busiest generation.
42
44 8 56
12
Preliminary Estimate of 1-Way Compute Task Assignments
There are N compute tasks available.
17
Domain #1: IM1 , JM1 DT1 => Work1 = IM1 x JM1Domain #2: IM2 , JM2 DT2 => Work2 = IM2 x JM2 x ( DT1 / DT2 )
Total Work = TW = Work1 + Work2 + Work3 + Work4 + Work5
Domain #1 compute tasks: Work1 / TW x NDomain #2 compute tasks: Work2 / TW x N
Domain #3: IM3 , JM3 DT3 => Work3 = IM3 x JM3 x ( DT1 / DT3 )Domain #4: IM4 , JM4 DT4 => Work4 = IM4 x JM4 x ( DT1 / DT4 )Domain #5: IM5 , JM5 DT5 => Work5 = IM5 x JM5 x ( DT1 / DT5 )
Domain #3 compute tasks: Work3 / TW x NDomain #4 compute tasks: Work4 / TW x NDomain #5 compute tasks: Work5 / TW x N
There are 3 generations with 1 domain, 2 domains, and 2 domains, respectively.
Preliminary Estimate of 2-Way Compute Task Assignments
Same setup as the 1-way case.
18
Domain #1 compute tasks: <= N
Domain #2 compute tasks: Work2 / TW2 x NDomain #3 compute tasks: Work3 / TW2 x N
Domain #4 compute tasks: <= Work4 / TW3 x NDomain #5 compute tasks: <= Work5 / TW3 x N
Assume 2nd generation is the most expensive.
Distribute tasks in 2nd generation as done for all 1-way domains previously.
Assign as many of the N tasks to generations 1 and 3 as possible without slowing down the run.
Total Work = TW2 = Work2 + Work3gen #2:
Total Work = TW3 = Work4 + Work5gen #3:
gen #1:
Example of 2-way Task Assignments
You have 128 available tasks.‣112 compute116 write
--Five domains; 3 generations; 3rd is most expensive.‣
Dom #1 :
Compute Write
Dom #2 : Dom #3 :
7x87x8
1x4Dom #4 : Dom #5 : 1x4
6x66x6
5x8
1x31x3
1x2
gen #2
gen #3
gen #1
= 128
= 112 = 1619
One-Way Communication Between a Parent and Child
MPI intercommunicators are very convenient for this.‣
The lead tasks on both domains have rank 0.‣
MPI sends/recvs use simple target and sender task ranks.‣
20
Example of an Intercommunicator
The global task ranks (unique task assignments to domains):
Parent – 25, 26, 27
Child – 52, 53, 54, 55
The intercommunicator task ranks:
Parent – 0, 1, 2
Child – 0, 1, 2, 3
21
Parent and Child Communications w/ Generations
MPI intercommunicators cannot be used because parent and child may share some of the same tasks. MPI does not allow global task ranks to be repeated in intercommunicators.
‣
Therefore we use MPI intracommunicators.‣
Parent/child task ranks may repeat but will lie in a single non-repeating sequence in the communicator.
‣
22
Example of an Intracommunicator
The global task ranks (tasks can be in more than 1 generation):
Parent – 3, 4, 5, 6
Child – 1, 2, 3, 4, 5, 6, 7
The intracommunicator task ranks (parent first):
Union – 3, 4, 5, 6, 1, 2, 7 -> 0, 1, 2, 3, 4, 5, 6
Parent – 0, 1, 2, 3
Child – 4, 5, 0, 1, 2, 3, 6
23
More bookkeeping for the Init step.
Variable sources/targets in MPI sends/recvs.
v v v
H H H
v v v
H H H
v v v
H H H
v v v
H H H
v v v
H H H
v v v
H H H
B-grid dx and dy
E-grid dx and dy
B-grid
E-grid
B-grid vs. E-grid
B-grid is just a rotated E-grid
24
Parent-Oriented Nests
The southwest H point of the nest domain coincides with a parent H point.
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦
x xx x xx x xx x xxx xx x xx x xx x xxx xx x xx x xx x xxx xx x xx x xx x xxx xx x xx x xx x xxx xx x xx x xx x xx
Portion of Parent Domain
Parent Task Subdomains
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦
◦
◦
◦
Nest Task Subdomains
25
Summary of Parent-Child Gridpoint Relationships
26
Hh h h Hh
Hh h h Hh
v v v
h h h h
v Vv v
h h h h
v v v
ODD space ratio
Child h points lie on parent H points.
Child v points lie on parent V points.
Hh h Hh
Hh h Hh
v v
h Vh h
v v
EVEN space ratio
Child h points lie on parent H and V points.
Child v points do not coincide with parent points.
Parent and Child H Gridpoints for 3:1 Ratio
27
Hh h h Hh h h Hh h h Hh h h Hh h h Hh
I_PARENT_SW=3
I=IDS=1
Parent point locations
Nest point locations
ITS_PARENT=1 ITE_PARENT=5
ITS_PARENT_ON_CHILD=-5 ITE_PARENT_ON_CHILD=9
gap
1st point on parent task
Last point on parent task
SW corner of nest
Parent and Nest V Gridpoints for 3:1 Ratio
28
Hh h h Hh h h Hh h h Hh h h Hh h h Hh
Parent point locations
Nest point locations
I=IDS=1ITS_PARENT_ON_CHILD=-5 ITE_PARENT_ON_CHILD=9
gap
Hh h h Hh h h Hh h h Hh h h Hh h h Hh
v Vv v v Vv v v Vv v v Vv v v Vv v v Vv
ITS_PARENT=1on V
ITS_PARENT=1on H
I_PARENT_SW=3on V
I_PARENT_SW=3on H
ITE_PARENT=5on V
ITE_PARENT=5on H
v
v
v v
vv
NMM-B Moving Nests
29
1-way or 2-way interactive. ●
Forecast can contain multiple nests. ●
Telescoping domains.●
Three Types of Data Motion Needed to Satisfy a Nest’s Shift
Nest domain before shift
Nest domain after shift
Parentupdates
Intra-task Update
Inter-Task Update
Occupies the pre-move ‘footprint’30
Shift onto a Corner
31
Nest domain before shift
Nest domain after shift
Parent updates
Intra-task update
occupies the pre-move ‘footprint’
Simplest Parent Update Over the SW Corner
32
SW corner of pre-move footprint
Here one parent task updatesthe entire parent update regionof this nest task subdomain.
Nest Task Subdomain
Four Parent Tasks Update Over the SW Corner
33
SW corner of pre-move footprint
1st parent task’s update region
2nd parent task’s 1st update region
2nd parent task’s 2nd update region
3rd parent task’supdate region
4th parent task’supdate region
Nest Task Subdomain
Child’s Bookkeeping for Relative Motion
The child tasks determine which of their points are updatedby each of the three processes.
Intra-task updating is the simplest (a shift in memory).‣Inter-task updating is more complex.‣
Updates from the parent are the most complicated.‣Child tasks determine which of their subdomain pointsoutside of the pre-move footprint will be updated bywhich parent tasks.
Child tasks determine which of their subdomain pointsinside the pre-move footprint will be updated bywhich other child tasks and vice versa.
34
Parent’s Bookkeeping for Relative Child Motion
The parent tasks perform bookkeeping to determine whichnest points are updated by the parent outside of the pre-movefootprint.
So due to the complexity involved both the parent and child tasks perform this bookkeeping from their own perspectives to serve as checks on each other as well as to eliminate someadditional communication.
35
The Parent Stores Its Bookkeeping Results
Child task subdomains and those points on them that areupdated by a given parent task change with each shift ofthe nest. Use arrays of linked lists to deal with this continualchange.
Element 1Moving Child #1
Element 2Moving Child #2
Element 3Moving Child #3
Parent array of moving nest update specifications
Nest tasks tobe updated
Each link holds parent taskupdate specifications for eachrelevant task of a moving childfollowing a shift.
36
The Child Stores Its Bookkeeping Results
There is no need for linked list arrays in storing the bookkeeping results from the child’s perspective since the number of parent tasks providing update data isalways between 0 and 4.
=> Allocate a derived datatype array (1:4) and store appropriately.
37
This assumes the geographical area of parent task subdomains isalways larger than that of child task subdomains.
Surface Data
38
Each nest task with a parent update region reads the external files to update those variables rather than receiving them from the parent so as not to lose the higher resolution information.
Among these are topography, land/sea mask, soil type, vegetation type, and vegetation fraction.
For sfc variables NOT among those eight: (a) Generate a search list of I,J increments from near to far. (b) If parent update sfc data is from a different surface type then the nest searches for its own nearest point with the same sfc type (e.g. soil T or SST).
Eight invariant surface fields from NPS cover the uppermost parent domain at each different resolution of all moving nests.
‣
‣
‣
‣
2-Way Exchange
As is done for motion both the child and the parentcompute which parent tasks will receive which upscaledata from which child tasks. This eliminates somecommunication and serves as a check.
39
2-Way Exchange - Child
Is the child at the end of a parent timestep?
40
If so, determine which points on which parent tasks it will update.
Loop through the appropriate parent tasks.
Generate upscale values using the mean of child valuesLoop through the specified 2-way variables.
Send upscale data for all variables to the given parent task.
(1)
(2)
(3)
--
-within the stencil region.
Generate Upscale Values – Odd Space Ratio
41
h h h h
h h h h
v v v
h h h h
v Vv v
h h h h
v v v
v v v v
v v v v
h h h
v v v v
h Hh h
v v v v
h h h
H-pt variables V-pt variables
Average over these stencils
Generate Upscale Values – Even Space Ratio
42
h h h
h h h
v v
h Vh h
v v
h h h
h h h
v v
h Hh h
v v
H-pt variables V-pt variables
Average over these stencils
2-Way Exchange - Parent
43
Determine which of its points are updated by which child tasks. Save each child task’s specs as a link in a linked list(since we do not know ahead of time how many child taskswill send data after each shift of moving nests).
(1)
Loop through the appropriate child tasks.(2)
Recv data for all specified 2-way variables.-
If the parent’s sfc elevation differs from the child’s then adjust-the data using a spline interpolation.Update the parent values applying the user-specified child-weight from the configure file.
Incorporate data if the current timestep does not immediately-follow a restart output time (for bit identical restarts).
Specify Update Variables for Motion and 2-Way Exchange
44
Use the nests.txt file which (like solver_state.txt) lists desiredvariables from the Solver internal state.
●
KEY for moving vbls: H – mass ptV – velocity ptL – land sfc
W – water sfcF – read external file in parent update region x – parent must update halo when child moves
KEY for 2-way vbls: H – mass ptV – velocity pt
Example of ‘nests.txt’ specifications
45
### 2-D Integer‘ISLTYP’ F - ‘Soil type’
### 2-D Real‘FIS’ F - ‘Sfc geopotential (m2 s-2)’‘CMC’ Lx - ‘Canopy moisture (m)’‘SST’ Wx - ‘Sea surface temperature (K)’
### 3-D Real‘T’ H H ‘Sensible temperature (K)’‘U’ V V ‘U component of wind (m s-1)’‘STC’ Lx - ‘Soil temperature (K)’
Moving 2-way###
High Level Order of Execution
46
Children recv BC updates from parents from the end of the current parent timestep.
Parents recv upscale data from children from the end of the previous parent timestep.
Domain integrates
Parents send BC updates to children who are at the beginning of the current parent timestep.
Children send upscale data to parents who recv it at the beginning of the next parent timestep.
►
►
►
►
►
Timestepping loop in subroutine NMM_INTEGRATE
NMM_RUN
CALL phase 2 Parent-Child Coupler Run ( children recv BCs from parents )
DO Loop over all (1-way) or some (2-way) forecast timesteps
CALL phase 1 Domain Run ( integrate the forecast one timestep )
CALL phase 5 Parent-Child Coupler Run ( children send upscale to parents )
CALL phase 3 Domain Run ( write history/restart )
ENDDO Timestep loop
Advance the Clock
CALL phase 3 Parent-Child Coupler Run ( parents recv upscale from children )
CALL phase 4 Parent-Child Coupler Run ( parents send BCs to children )
DO Loop over generations (a single iteration for 1-way interaction)
ENDDO Generations loop
47
CALL phase 1 Parent-Child Coupler Run ( check 2-way signals )
CALL phase 2 Domain Run ( digital filter )
Example of erratic nest motions
48
due to weak storm(s) interacting
with complex terrain.
49
High Priority Development Items
● Finish the user selection of nest boundary variables.
50
● Construct capability for self-oriented (not parent-oriented) nests.
Additional Slides
The Composite Object
28
A derived datatype to hold assorted variables used ●throughout the Parent-Child coupler component.
Allows tasks lying on multiple domains to easily ●reference such variables generically when theyhave different values on different domains.
Composite Object – Defined / Allocated
29
TYPE COMPOSITE
INTEGER(kind=KINT),DIMENSION(1:3) :: PARENT_SHIFT END TYPE COMPOSITE
SUBROUTINE PARENT_CHILD_COUPLER_SETUP
TYPE(COMPOSITE), DIMENSION(:), POINTER, SAVE :: CPL_COMPOSITE
ALLOCATE(CPL_COMPOSITE(1:NUM_DOMAINS),stat=ISTAT)
Top of module before CONTAINS
INTEGER(kind=KINT),DIMENSION(:),POINTER :: PARENT_SHIFT
END SUBROUTINE PARENT_CHILD_COUPLER_SETUP
Composite Object - Used
30
SUBROUTINE CHILDREN_RECV_PARENT_DATA
CALL POINT_TO_COMPOSITE(MY_DOMAIN_ID) CALL MPI_RECV( PARENT_SHIFT , 3 , MPI_INTEGER, ……. END SUBROUTINE CHILDREN_RECV_PARENT_DATA
SUBROUTINE POINT_TO_COMPOSITE(MY_DOMAIN_ID)
TYPE(COMPOSITE), POINTER :: CC CC => CPL_COMPOSITE(MY_DOMAIN_ID) PARENT_SHIFT => CC%PARENT_SHIFT END SUBROUTINE POINT_TO_COMPOSITE
Top Related