Jenkins World - cloudbees.com · Jenkins World #JenkinsWorld Old Method: Manual sharing of boards...

22
Jenkins World #JenkinsWorld Jenkins and Load Sharing Facility (LSF) Enables Rapid Delivery of Device Driver Software Brian Vandegriend

Transcript of Jenkins World - cloudbees.com · Jenkins World #JenkinsWorld Old Method: Manual sharing of boards...

Jenkins World

#JenkinsWorld

Jenkins and Load Sharing Facility (LSF) Enables Rapid Delivery of Device Driver Software

Brian Vandegriend

Jenkins World

#JenkinsWorld

Jenkins and Load Sharing Facility (LSF) Enables Rapid Delivery of Device Driver Software Brian Vandegriend Product Verification Manager, Microsemi

Twitter: @BVandegriend

Jenkins World

#JenkinsWorld

Overview

How to dynamically allocate hardware evaluation boards across software developers/testers and Jenkins in order to simplify resource sharing and increase automated regression throughput

Agenda: •  Our Build/Test Environment •  Old/New Methods for Allocating Boards to Users/Jenkins •  LSF Concepts and Short Overview of the Tool •  Integrating LSF with Jenkins •  4 Tips for Increasing Reliability and Throughput •  Benefits Realized by the Team

Jenkins World

#JenkinsWorld

Our Build and Test Environment

•  Our group develops, tests and supports device driver software/firmware for Optical/Ethernet networking SOCs with 100 Gbps ports –  Device driver written in C with 150Kloc –  Subversion used for revision control

•  Cloudbees Jenkins Platform is used to continuously build and test the device driver –  1 master with 2 slave nodes (VM/bare-metal) –  Production releases are shipped every 2 to 3 weeks –  Continuous Delivery: Release process is automated

except for posting to web portal (manual approval)

•  Driver testing is done on lab-based boards – Over 500 automated system-level tests that have a

runtime of 200+ board-hours

Packet Generator /

Monitor FPGA

Packets

SoC Evaluation / Test board

Intel-based COMExpress running Linux

Jenkins World

#JenkinsWorld

Old Method: Manual sharing of boards

Problems with approach: •  If an engineer is not using their board, it’s hard for other engineers

to use it. •  If an engineer’s assigned board is used by someone else, they

typically hunt around trying to find a free board. •  Software regressions can’t take advantage of idle boards and

run in parallel.

101010

20

SW Lab (Vancouver)

Boardsaremanuallyassignedtoengineers

Regressionsaresta2callyassigned

25 Jobs

4sites50engineers

Jenkins World

#JenkinsWorld #JenkinsWorld

SW Lab

New Method: Use IBM’s Load Sharing Facility (LSF) tool to dynamically allocate boards

2 Tooldispatchesteststofreeboards

LSF

3Boardbecomesfreewhentestfinishes4 Jenkinsuseslowerpriorityqueue

tomakeuseofidleboards

1 Usersubmitstesttoqueue(FIF0)

25Jobs

Advantages of using a queue-based solution, such as LSF: • Enhanced productivity − users do not have to find/reserve boards. The system will grant a free board to the user based upon their needs.

• Higher reliability − "problematic" boards are taken offline and the system will direct jobs to the other boards. Individual users are not impacted.

• Scalability − as users/boards are added, no re-adjustment of board assignments is necessary.

• Shorter automated testing cycles and higher efficiency of boards

(Boardcanonlyrunonetestata2me)

Jenkins World

#JenkinsWorld

Comparing 2 Solutions: Running LSF versus using Jenkins Slaves on Boards

Running LSF on Boards

•  Board resources are treated as one large resource pool

•  If a board crashes, only 1 test result is lost

•  Test balancing across jobs is not required as tests are dynamically allocated to boards by LSF

•  LSF can allocate all boards to automated tests when users are not using them

Using boards as Jenkins Slaves

•  Boards are divided into 2 groups: 1 for Jenkins slaves and 1 for users

•  If a board crashes, all test results are lost

•  Tests need to be equally partitioned across jobs to maximize throughput

•  Jenkins can’t take advantage of free boards that users are not using

LSF…

25

10

15SlavesLSFHosts

40Job1

40Job10 JenkinsScheduler

.

.

.

400

Jenkins World

#JenkinsWorld

Prerequisites for using LSF for Sharing Boards

•  Boards have a version of Linux installed (CentOs, RedHat, Fedora, and so on) à LSF sees each board as a Linux Server

•  Boards are fairly homogenous in their configuration/hardware –  Many different board types will lead to a fragmented pool

o  LSF can easily handle multiple resource types and allocate jobs based on resource requests

–  For our project, we ensured we had chip fuse overrides that were controllable through SW

•  Requires users to close their debug sessions when finished to allow the board to be allocated to the next user •  Timeouts are enforced by LSF to ensure boards are returned back to the pool

•  Successful adoption by the team relies on individuals to use the system and not circumvent it by logging directly into boards

8

Jenkins World

#JenkinsWorld

LSF Concepts

•  Each queue can enforce user limits and run times –  Short queue typically has a job limit of 1, run time of 1 hour and highest priority –  Long queue typically allows multiple jobs per user and has the lowest priority

•  LSF uses a priority-based, fair-share algorithm to dispatch jobs to hosts •  Each host has a number of attributes which can be requested

Cluster

Hosts

Hosts Proj-1

Proj-2

Queuesshortnormallong

ResourceA2ributes-Proj-1-atom_cpu-Num_devices=3-Greenhills-Fedora_Linux-Proj-2-i5_cpu-Num_devices=2-Fedora_Linux

LSFdaemonsrunonhost(sbatchd/res/lim)

Jenkins World

#JenkinsWorld

Submitting Jobs using LSF

Running an interactive command on a board through LSF: Ø  bsub -Ip -q short -R Proj1 echo "hello world!“ Job <623549> is submitted to queue <short>. <<Starting on board-105>> hello world! Request specific resource constraints: Ø  bsub –R “Proj1 && i5_cpu” –R “num_devices>=2” xterm Ø  bsub –q long –m board_105 test_cmd ; # requests a specific board

To view job status: > bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME 549269 vandegr RUN short swbuild01 board-105 xterm

Jenkins World

#JenkinsWorld

Monitoring Jobs/Queues/Hosts using LSF

To view the list of queues and their status: Ø  bqueues QUEUE_NAME PRIO STATUS JL/U NJOBS PEND RUN short 60 Open:Active 1 1 0 1 normal 25 Open:Active 5 7 3 4 long 10 Open:Active 10 33 18 15 To view the host status: Ø  Bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN swbuild01 ok - 4 2 2 board-105 closed - 1 1 1 board-106 unavail - 1 0 0

Jenkins World

#JenkinsWorld

Boards/Jobs Administration using LSF

# Print out boards in LSF cluster: > lshosts HOST_NAME model cpuf ncpus maxmem maxswp RESOURCES Board-73 Atom 10.0 1 1.8G 3.9G (proj1 atom) Board-38 Intel_i5 40.0 2 5.6G 3.9G (proj2 i5) # Take board offline: Ø  badmin hclose –C “hardware issue” board-38 Kill/suspend/resume a job running in LSF: Ø  bkill/bstop/bresume <job_id>

Jenkins World

#JenkinsWorld

Integrating LSF with Jenkins for Automated Testing

SW Lab

2 ScriptdispatchesteststoLSFasindividualjobswithresourceconstraints

LSF

4 Scriptwaitsun2lalltestsarefinished

5

1Jenkinsjobstartsrun_regressionpythonscriptthatcreatestestlist

25Job

3 ScriptscansLSFstatusandlogfilestoprintoutreal-2mestatustoconsolelog

ScriptconvertstestlogfilesintoJUnitformatwhichissummarizedbyJenkins

Jenkins World

#JenkinsWorld

Enabling User-based Automated Testing through Jenkins

•  Regression script is also used in Jenkins for users to run regression tests against their SW/FW changes

•  Users can submit their project workspaces to Jenkins for testing

•  Helps ensure that the software trunk remains “green”

•  Parameterized build is used to pass variables to the Python script

•  LSF runs user regressions in parallel with the main (trunk) SW regressions

Making QA regressions available to all developers!

Jenkins World

#JenkinsWorld

Tip: Using LSF to prevent corrupted boards from killing your automated regression

Problem: Boards in a bad state will causes tests to fail in rapid succession Solution: Automatically take boards offline that rapidly fail tests:

•  LSF monitors the job exit rate for boards, and closes the board if the rate exceeds a configurable threshold –  For example, LSF takes a board offline if 5 tests exit abnormally in less than 30 sec

•  Run a board initialization script prior to running the test –  LSF executes a pre-execution script that puts the board into a known good state –  If the board initialization script returns an error code, LSF takes the board offline

and finds a new board for the test

•  Re-queue tests that fail because of a board issue –  LSF will re-run a test if it exits with a special error code(such as 99)

This technique has increased overall reliability of the system!

Jenkins World

#JenkinsWorld

Tip: Automatically Reboot/Power-cycle Bad Boards

Problem: •  Boards taken offline by LSF need to be quickly brought back online to maximize

test throughput

Solution:

•  A board monitoring script, which is run periodically by Jenkins, will reset/power-cycle offline boards and bring them back online –  If a soft reset fails, then an Ethernet-based remote power switch (from

Digital Loggers) is used to power-cycle the board –  If power-cycling fails, the Jenkins job fails with an error and e-mail notification is

sent to the administrator

Jenkins World

#JenkinsWorld

Tip: Qualify Boards Prior to using Them in Automated Regression Testing

Problem: •  A small percentage of our boards will have manufacturing defects (primarily

due to the SoC devices being removed/re-soldered •  Boards with hardware faults will cause a small number of tests to fail

intermittently/consistently, which takes a lot of time to track down

Solution:

•  As boards are added, run a battery of working tests with a stable SW release on each board with 5 to 10 iterations

•  Weed out problematic/failing boards and use the rest for automated testing –  Use LSF groups to create a group of “golden” boards

Jenkins World

#JenkinsWorld

Tip: Handling Flaky Tests

Problem: •  Flaky tests (those that can fail or pass with the same software code) make it

difficult for automated regressions to be “green” (no failures) •  Our system suffers from flaky tests, similar to what is experienced by Google

–  Google Test blog article: “We see a continual rate of about 1.5% of all test runs reporting a "flaky" result. There are many root causes why tests return flaky results, including concurrency, relying on non-deterministic or undefined behaviors, flaky third party code, infrastructure problems, etc.”

Workaround:

•  Regression script re-runs test failures 3 times and if all re-runs have passed, then the test result is changed from “failed” to “skipped” –  Test result is only modified if the test has been tagged as a flaky test –  This method also helps to quickly identify new test failures as consistent or

intermittent failures.

Jenkins World

#JenkinsWorld

Tip: Handling Flaky Tests (continued)

•  Flaky tests in Jenkins can be represented by using the “Skip” column –  It doesn’t seem possible to have Jenkins recognize new values for the TEST_RESULT

property: <property name="TEST_RESULT" value="SKIP"/>

Jenkins World

#JenkinsWorld

Benefits realized by adopting LSF

• Our solution using LSF and Jenkins for Automated Regression testing has: – Eliminated manual sharing of boards à tedious and inefficient – Simplifies a users’ experience of finding a free board –  Increased reliability of the system through automatic power-cycling

of corrupted boards and running a board initialization script –  Improved quality of code commits as users run automated regression

tests on their changes through Jenkins/LSF before committing – Shortened release cycles as automated regressions complete faster

by gaining access to more boards

Jenkins World#JenkinsWorld

Jenkins World

#JenkinsWorld

Microsemi Corporation (MSCC) offers a comprehensive portfolio of semiconductor and system solutions for communications, defense and security, aerospace and industrial markets. Products include high-performance and radiation-hardened analog mixed-signal integrated circuits, FPGAs, SoCs and ASICs; power management products; timing and synchronization devices and precise time solutions, setting the world's standard for time; voice processing devices; RF solutions; discrete components; enterprise storage and communication solutions, security technologies and scalable anti-tamper products; Ethernet solutions; Power-over-Ethernet ICs and midspans; as well as custom design capabilities and services. Microsemi is headquartered in Aliso Viejo, Calif., and has approximately 4,800 employees globally. Learn more at www.microsemi.com

©2016 Microsemi Corporation. All rights reserved. Microsemi and the Microsemi logo are registered trademarks of Microsemi Corporation. All other trademarks and service marks are the property of their respective owners.  

Microsemi makes no warranty, representation, or guarantee regarding the information contained herein or the suitability of its products and services for any particular purpose, nor does Microsemi assume any liability whatsoever arising out of the application or use of any product or circuit. The products sold hereunder and any other products sold by Microsemi have been subject to limited testing and should not be used in conjunction with mission-critical equipment or applications. Any performance specifications are believed to be reliable but are not verified, and Buyer must conduct and complete all performance and other testing of the products, alone and together with, or installed in, any end-products. Buyer shall not rely on any data and performance specifications or parameters provided by Microsemi. It is the Buyer’s responsibility to independently determine suitability of any products and to test and verify the same. The information provided by Microsemi hereunder is provided “as is, where is” and with all faults, and the entire risk associated with such information is entirely with the Buyer. Microsemi does not grant, explicitly or implicitly, to any party any patent rights, licenses, or any other IP rights, whether with regard to such information itself or anything described by such information. Information provided in this document is proprietary to Microsemi, and Microsemi reserves the right to make any changes to the information in this document or to any products and services at any time without notice.

Microsemi Corporate Headquarters One Enterprise, Aliso Viejo, CA 92656 USA Within the USA: +1 (800) 713-4113 Outside the USA: +1 (949) 380-6100 Sales: +1 (949) 380-6136 Fax: +1 (949) 215-4996 email: [email protected] www.microsemi.com

Questions?