ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

14
ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013 ACA plan Manabu Watanabe National Astronomical Observatory of Japan

description

ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013. ACA plan Manabu Watanabe National Astronomical Observatory of Japan. ACA involved failures in Q1 2013. 24% of failures have its origin in the bugs of ACA software. - PowerPoint PPT Presentation

Transcript of ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

Page 1: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ALMA Integrated Computing Team

Coordination & Planning Meeting #1Santiago, 17-19 April 2013

ACA plan

Manabu Watanabe

National Astronomical Observatory of Japan

Page 2: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

ACA involved failures in Q1 2013

24% of failures have its origin in the bugs of ACA software.

(From a simple JIRA

ticket analysis)

Shared memory trouble and the bulk data trouble go up to 10/13.

Origin Failure cause #OccurrenceHardware failure 14Bug in software 13Unclear 4Misconfiguration in CAI connectivity 5Other 4Mistakes in observing scripts 3Other 6

554SUM

Rest

ACA

Operator

Other system

Source Most popular failure cause #OccurrenceFailure in attaching the shared memory 6sendData() method sometimes takes very long to return 4

ACA hardware failure Incorrect total number of samples in histogram 9Misconfiguration of the CAI connectivity 5Insensitivity to the data rate limitation 3Mistakes in observing script 3Notification channel not responding 2

Other subsystem

ACA software bug

Operation mistake

Page 3: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

APR Q2 Q3 Q4 Q1 Q2Failure in attaching the shared memory COMP- 8930 ACACORR 0.5 0.5 1sendData() method sometimes takes very long toreturn

COMP- 9213 ACACORR+ 1 1 2

XP delay should be effective for the single- dishobservation as well

COMP- 7640 ACACORR 0.5 0.5

1.907KHz shift in the center frequency ofchannels

CSV- 2227 DataCapture ??

Phase of the Walsh function should be changedevery subscans

ACACORR 0.5 0.5

Relax a health check of 3bit histogram ACACORR 0.5 0.5Increasing time interval for getting 3bit histogram ACACORR 0.5 0.5Remove the check of FFT overflow flag in CDPnodes

ACACORR 0.25 0.25

Suppress warning for the FFT overflow and thedelta sigma overflow

ACACORR 0.25 0.25

Parallelize the monitor commands for all quadrants ACACORR 0.5 0.5New ACACorrGUI ACACORR 1 2 2 1 6Alarm based on the analysis on container log files ACACORR 1 2 1 0.5 0.5 5ACA specific delay read from TMCDB ACACORR 0.5 0.5Window function read from TMCDB ACACORR 0.5 0.5

Time frame and required time (*1) (*2)Item In chargeJ IRA 2013 2014

Total

Planning items and time frame(1)

ACA planning items with rough time frames

Page 4: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

Finite dead time in the bin switching ACACORR 1 1Increasing the number of bins (3 or more) ACACORR 1 1WVR coefficients ACACORR 0.25 0.25 0.5 1 2ACACORR porting to 64bit OS COMP- 8045 ACACORR 1 0.5 1.5BDNT configuration read from TMCDB COMP- 8339 ACACORR 0.5 0.5TCP connection in BDNT ACACORR 0.5 0.5Increasing efficiency (1) in Tsys measurement CSV- 2704 ACACORR/ CCL 0.5 0.5+??Increasing efficiency (2) ACACORR 0.5 0.5 0.5 1.5Special data rate calculation in AUTO_ONLY mode CSV- 2692 ACACORR 0.5 0.5Reduce unnecessary warning messages ACACORR 0.5 0.5Updating 3bit linearity correction every integrations ACACORR 1 1Updating delta requantization correction everyintegrations

ACACORR 1 1

Automatic self- test of ACA correlator whenACACORR gets started

ACACORR 1 1

Digitizer quantization correction ACACORR 1 1 2Subarraying in an SB Software ??3LO in interferometry CSV- 1249 ACA correlator ??Phase- up mode ACA correlator ??

0.5 0.5 0.5 0.5 0.5 0.5 32 6.75 8.25 7.5 7 4 35.5

(*1) Both time frame and required time are my optimistic guesstimation so should be changed.(*2) Required time is a period in units of month and is different from efforts in units of man month.

Total timeTest

Planning items and time frame(2)

ACA planning items with rough time frames (continued)

APR Q2 Q3 Q4 Q1 Q2

Time frame and required time (*1) (*2)Item In chargeJ IRA 2013 2014

Total

Page 5: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

Bug fix

Failure in attaching the shared memory The observation fails if the problem happens, then the involved container should be

shutdown and restarted.

sendData() method sometimes takes very long to return The observation fails because CDP master fails to send data if the problem happens,

then the observation should be run again. The root cause of the problem is still unclear, network, software (ACACORR, BDS, bulk data receivers),…

XP delay should be effective for the single-dish observation as well The cross polarization delay does not work for the single dish cross polarization

observation. It does work properly for the interferometry.

1.907KHz shift in the center frequency of channels The center frequency of channels are always shifted by about 1.907 kHz between the

frequency label and the actual spectra.

Phase of the Walsh function should be changed every subscans ACACORR has thought the 90 degree phase switching starts at the beginning of each

subscan. But, actually LO starts the switching from 1970-01-01T00:00:00. The phase of the 90 degree phase switching could be different to each other. ACACORR plans to change the beginning phase of the 90 degree phase switching for each subscan.

Page 6: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

Adjustment to ACA correlator (1)

Relax a health check of 3bit histogram The observation fails when the total number of samples in the histogram is NOT equal

to 3886632960 at the correlator calibration. 3886632960 samples corresponds to 960ms which is the sampling period of the histogram. This health check fails frequently in the observation these days and Fujitsu ensures the soundness of the histogram even when the total number of samples of that is different from 3886632960. We plan to relax the health check of the histogram.

Increasing time interval for getting 3bit histogram The ACA correlator sometimes fails in the inter-module communication. The failure

may lead to the observation failure. We suspect that the frequent getting 3bit histogram disturbs the inter-module communication of the ACA correlator. This change should be available in April or May after the further investigation of the problem.

Remove the check of FFT overflow flag in CDP nodes CDP nodes print messages of the FFT overflow when CDP nodes detect FFT

overflow flag in the data header which received from the ACA correlator. But, the ACA correlator had been changed. The FFT overflow flag is still there but it is trustless any more. It should be nice to remove the trustless FFT overflow messages from the container log of CDP nodes.

Page 7: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

Adjustment to ACA correlator (2)

Suppress warning for the FFT overflow and the delta sigma overflow CCC print messages of the FFT overflows and the delta sigma overflows when ACA

correlator detects the overflows. These are useful information. The problem is the FFT overflows and the delta sigma overflows will continuously happen during the interval between observations. During the interval, the input signals from the antennas may NOT be reliable, e.g., missing frames, broken frames, zero signal levels, and so on. So, these overflow messages are useless in that case and very annoying. It should be nice to remove these overflow messages during the interval between observations.

Parallelize the monitor commands for all quadrants CCC monitors the status (temperature, fan speed, voltage) of the ACA correlator. The

monitoring will be parallelized for 4 quadrants. Get hardware failure command will be parallelized as well.

Page 8: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

New features (1)

New ACACorrGUI We have several requests to improve ACACorrGUI. Some of the requests are

motivate of the totally new ACACorrGUI. The new ACACorrGUI will be a receiver of the BDF transmitted from CDP master and display the spectra for all baselines at one time.

Alarm based on the analysis on container log files Failures occur continuously at a certain frequency in the observation with ACA

correlator. It takes long time to identify the root cause of the failure every time. We plan to implement a simple log inspection program to push alarms by identifying some of the failures which are familiar occurrence.

ACA specific delay read from TMCDB ACA correlator needs its specific delay compensation. Takeshi Kamazaki requests

that the specific delay should be in TMCDB for necessary change.

Window function read from TMCDB ACA correlator applies a window function by weighted running mean. Takeshi

requests the weight function should be in the TMCDB for necessary change.

Page 9: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

New features (2)

Finite dead time in the bin switching ACACORR support the bin switching. The dwell time should be given in advance

and the dead time should be zero. These assumptions should be justified for the frequency switching but not for nutator switching.

Increasing the number of bins (3 or more) ACACORR support the bin switching for 2 bins usecase. The bins of ACACORR

should be extended if 3 or more bins are needed.

WVR coefficients ACACORR cares about the effective period of WVR coefficients but CORR does

not. CORR could have multiple WVR coefficients for each spectral windows but ACACORR could have only one WVR coefficients for the receiver band at once. ACACORR should (or should not?) follow CORR.

ACACORR porting to 64bit OS ACACORR should be ported into 64bit RH6.4 or so.

Page 10: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

New features (3)

BDNT configuration read from TMCDB Bogdan requests CORR and ACACORR to read the BDNT configuration from TMCDB.

TCP connection in BDNT Bogdan requests CORR and ACACORR to use TCP instead of UPD in the data

transmission from CDP nodes to CDP master.

Page 11: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

Reqest for improvement (1)

Increasing efficiency (1) in Tsys measurement Stuart requests ACA to reduce the Tsys measurement time to 30 seconds from 2

minutes ACA takes currently. We think we can reduce it up to 1 minutes by taking advantage of the subscan sequence with “delta requantization correction”.

Increasing efficiency (2) Takeshi requests reduce the overhead (lead time and processing time) which

takes about 20 seconds for the correlator calibration and about 15 seconds for the real observation. The slow response of the ACA correlator gives the major part of the lead time so software has a limited amount of time to be reduced.

Special data rate calculation in AUTO_ONLY mode Takeshi requests that the data rate should be calculated as TP array when the

number of antennas is 4 or less in the array regardless of their CAIs in the AUTO_ONLY mode.

Reduce unnecessary warning messages It should be nice to reduce the annoying log messages where practical.

Page 12: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

Reqest for improvement (2)

Updating 3bit linearity correction every integrations Takeshi requests an enhancement of the 3bit linearity correction.

Updating delta requantization correction every integrations Takeshi requests an enhancement of the delta requantization correction.

Automatic self-test of ACA correlator when ACACORR gets started Takeshi requests ACACORR to run a self-test of ACA correlator (mci_st)

automatically whenever ACACORR gets started. This will help the operator.

Page 13: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

New features unspecified yet

Digitizer quantization correction Takeshi should provide the algorithm of the digitizer quantization correction for

ACA correlator. Then ACACORR will implement that.

Subarraying in an SB ACA phase calibration may require subarraying in the execution of SB. Science

should clarify the calibration plan first, then Computing should discuss about the implementation of that in detail. Probably, Scheduling, CONTROL, DataCapture, ASDM, OT, ACACORR should be involved.

Page 14: ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

ICT-CPM1 17-19 April 2013

No plan yet

3LO in interferometry 3LO is available for the single dish observation but 3LO of ACA does not work as

planned for the interferometry. Takeshi explains the root cause of the problem in the ticket. Please refer to the ticket for details. Note that 2LO should work properly and 90 degree phase switching is another alternative.

Phase-up mode ACA phase up mode for VLBI has never been considered seriously. The ACA

correlator should need some further development work if the phase up mode is necessary which naturally requires some further works for ACACORR.