Un monde où 1 ms vaut 100 M€ - Devoxx France 2015

download Un monde où 1 ms vaut 100 M€ - Devoxx France 2015

If you can't read please download the document

Transcript of Un monde où 1 ms vaut 100 M€ - Devoxx France 2015

  • @Alex_Victoor @ThierryAbalea#sginsideit

    Un monde o 1ms vaut 100M deuros

  • @YourTwitterHandle#DVXFR14{session hashtag} @Alex_Victoor @ThierryAbalea#sginsideit

    Speakers

    Alexandre Victoor@Alex_Victoor

    Thierry Abala@ThierryAbalea

  • 5Kstatus updates / Sec

    6KTweets/ Sec

    1,6MMails / SEC

    40KSearches / SEC

    740KMessages / SEC

    Big Data != Web

  • 5Kstatus updates / Sec

    6KTweets/ Sec

    1,6MMails / SEC

    40KSearches / SEC

    740KMessages / SEC

    1.1MUS OPTIONS

    Trades & quotes / SEC

    Big Data != Web

  • Plus vite que la lumire

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Latence tous les niveaux

    APPLICATIF

    JVMOS

    RESEAU

    DISQUE

    CPUMEMOIRE

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Un Quiz pour schauffer

  • int SIZE = 1000000;int NB_ARRAY = 50;long[][] longs = new long[NB_ARRAY][SIZE];

    long result = 0;

    for (int j=0; j

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Memory Layout

    Le premier programme est le plus rapide !

    47ms

    796ms

    mesurer

  • Cache L3

    Units dexcution

    Cache L1

    Cache L2

    Coeur 1

    Registres

    Cache L1

    Cache L2

    Coeur 2

    Processeur

    < 1 ns

    ~ 1 ns

    ~ 3 ns

    ~ 12 ns

    Registres

    Units dexcution

  • int SIZE = 1000000;int NB_ARRAY = 50;long[][] longs = new long[NB_ARRAY][SIZE];

    long result = 0;

    for (int i=0; i

  • int SIZE = 1000000;int NB_ARRAY = 50;long[][] longs = new long[NB_ARRAY][SIZE];

    long result = 0;

    for (int j = 0; j

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Mesurer micro benchmarks

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    OpenJDK JMH$ mvn archetype:generate

    -DinteractiveMode=false -DarchetypeGroupId=org.openjdk.jmh-DarchetypeArtifactId=jmh-java-benchmark-archetype-DgroupId=org.sample-DartifactId=devoxx-bench-Dversion=1.0

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Mise en situation

  • La vente de produit financiers (avant)

  • RFQ: Request For Quote

  • API - en direct

    CLIEN

    TBANQUE

    WEB Marchs

    SALES +

    Trading

    Risk (contrles)Booking

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Java ?

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Simple pour dmarrer ?

  • Random rand = new Random();IntStream stream = rand.ints(10 * 1024 * 1024, 0, 2);

  • Random rand = new Random();IntStream stream = rand.ints(10 * 1024 * 1024, 0, 2);

    int sum = stream.sum();

  • Random rand = new Random();IntStream stream = rand.ints(10 * 1024 * 1024, 0, 2);

    int sum = stream.sum();

    int sum = stream.parallel().sum();

  • Random rand = new Random();IntStream stream = rand.ints(10 * 1024 * 1024, 0, 2);

    int sum = stream.sum();

    int sum = stream.parallel().sum();

    113 ms

  • Random rand = new Random();IntStream stream = rand.ints(10 * 1024 * 1024, 0, 2);

    int sum = stream.sum();

    int sum = stream.parallel().sum(); 1167 ms

    157 ms

  • Premire archi

    Event loop

  • Premire archi

    Event loopI/O

  • Premire archi

    Event loop I/OI/O

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Queue

  • Lock

  • Nos heures de pointeOpen & Close

  • Cot du lockAppels systme

  • Cot du lockChangements de contexte

  • Cache L3

    Units dexcution

    Cache L1

    Cache L2

    Coeur 1

    Registres

    Cache L1

    Cache L2

    Coeur 2

    Processeur

    < 1 ns

    ~ 1 ns

    ~ 3 ns

    ~ 12 ns

    Registres

    Units dexcution

  • Algorithmes non bloquantsAu moins un thread progresse

  • Algorithmes non bloquantsPas de section critique, locks, mutexes, spin-locks,

  • j.u.c.ConcurrentLinkedQueue

  • P1

    P2

    P3

    C

    Queue MPSC

  • P1

    P2

    P3

    C

    Queues SPSC

  • P1

    P2

    P3

    C

    Queues SPSC

    Single

    Writer

    Princip

    le

  • Deuxime archi

    I/OI/O

  • producerIndex = 42

    Concurrent Reading and Writing, Leslie Lamport, 1977

    E E null null null null null null null E

    offset = 2

    consumerIndex = 39 offset = 9

    Lamport Queue

  • producerIndex = 42

    offset = 2

    consumerIndex = 39

    offset = 9

    Lamport Queue

    nullnull

    E

    E

    E

    nullnull

    null

    null

    null

    Concurrent Reading and Writing, Leslie Lamport, 1977

  • import java.util.AbstractQueue;

    public final class LamportQueue1 extends AbstractQueue {private final E[] buffer;private volatile long producerIndex = 0;private volatile long consumerIndex = 0;

    public LamportQueue1(int capacity) {buffer = (E[]) new Object[capacity];

    }

    @Overridepublic int size() {

    return (int) (producerIndex - consumerIndex);}

  • @Overridepublic boolean offer(final E e) {

    if (size() == buffer.length) {return false;

    }

    final int offset = (int)(producerIndex % buffer.length);buffer[offset] = e;producerIndex++;return true;

    }

  • @Overridepublic boolean offer(final E e) {

    if (size() == buffer.length) {return false;

    }

    final int offset = (int)(producerIndex % buffer.length);buffer[offset] = e;producerIndex++;return true;

    }

  • @Overridepublic boolean offer(final E e) {

    if (size() == buffer.length) {return false;

    }

    final int offset = (int)(producerIndex % buffer.length);buffer[offset] = e;producerIndex++;return true;

    }

  • @Overridepublic boolean offer(final E e) {

    if (size() == buffer.length) {return false;

    }

    final int offset = (int)(producerIndex % buffer.length);buffer[offset] = e;producerIndex++;return true;

    }

  • @Overridepublic E poll() {

    if (consumerIndex == producerIndex) {return null;

    }

    final int offset = (int)(consumerIndex % buffer.length);final E e = buffer[offset];buffer[offset] = null;consumerIndex++;return e;

    }

  • @Overridepublic E poll() {

    if (consumerIndex == producerIndex) {return null;

    }

    final int offset = (int)(consumerIndex % buffer.length);final E e = buffer[offset];buffer[offset] = null;consumerIndex++;return e;

    }

  • @Overridepublic E poll() {

    if (consumerIndex == producerIndex) {return null;

    }

    final int offset = (int)(consumerIndex % buffer.length);final E e = buffer[offset];buffer[offset] = null;consumerIndex++;return e;

    }

  • @Overridepublic E poll() {

    if (consumerIndex == producerIndex) {return null;

    }

    final int offset = (int)(consumerIndex % buffer.length);final E e = buffer[offset];buffer[offset] = null;consumerIndex++;return e;

    }

  • @Overridepublic E poll() {

    if (consumerIndex == producerIndex) {return null;

    }

    final int offset = (int)(consumerIndex % buffer.length);final E e = buffer[offset];buffer[offset] = null;consumerIndex++;return e;

    }

  • @Overridepublic E poll() {

    if (consumerIndex == producerIndex) {return null;

    }

    final int offset = (int)(consumerIndex % buffer.length);final E e = buffer[offset];buffer[offset] = null;consumerIndex++;return e;

    }

  • ,4

    ,16

    ,0

    ,5

    ,10

    ,15

    ,20

    BQ LQ1

    16

    Performance (MOps/s) x 4

    BlockinqQueue Lamport v1

    20

    15

    10

    5

    0

    4

  • Correct ?

  • private final E[] buffer;private volatile long producerIndex = 0;private volatile long consumerIndex = 0;

  • @Overridepublic boolean offer(final E e) {

    if (size() == buffer.length) {// queue is fullreturn false;

    }

    final int offset =(int)(producerIndex

    % buffer.length);buffer[offset] = e;producerIndex++;return true;

    }

    @Overridepublic E poll() {

    if (consumerIndex == producerIndex) { // queue is emptyreturn null;

    }

    final int offset = (int)(consumerIndex% buffer.length);

    final E e = buffer[offset];buffer[offset] = null;consumerIndex++;return e;

    }

    Producer Thread Consumer Thread

  • // consumerIndex = 2// producerIndex = 2// e = 777

    Producer Thread Consumer Thread

  • // consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    Producer Thread Consumer Thread

  • // consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    buffer[offset] = e; // buff[2] = 777

    Producer Thread Consumer Thread

  • // consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    buffer[offset] = e; // buff[2] = 777producerIndex++; // 3

    Producer Thread Consumer Thread

  • if (consumerIndex == producerIndex) { // 2 != 3// queue is empty return null;

    }

    Producer Thread Consumer Thread// consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    buffer[offset] = e; // buff[2] = 777producerIndex++; // 3

  • if (consumerIndex == producerIndex) { // 2 != 3// queue is empty return null;

    }

    final int offset = (int)(consumerIndex% buffer.length); // 2

    Producer Thread Consumer Thread// consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    buffer[offset] = e; // buff[2] = 777producerIndex++; // 3

  • if (consumerIndex == producerIndex) { // 2 != 3// queue is empty return null;

    }

    final int offset = (int)(consumerIndex% buffer.length); // 2

    final E e = buffer[offset]; // null (buffer[2])return e; // return null

    Producer Thread Consumer Thread// consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    buffer[offset] = e; // buff[2] = 777producerIndex++; // 3

  • if (consumerIndex == producerIndex) { // 2 != 3// queue is empty return null;

    }

    final int offset = (int)(consumerIndex% buffer.length); // 2

    final E e = buffer[offset]; // null (buffer[2])return e; // return null

    Producer Thread Consumer Thread

    Java 1.4

    // consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    buffer[offset] = e; // buff[2] = 777producerIndex++; // 3

  • if (consumerIndex == producerIndex) { // 2 != 3// queue is empty return null;

    }

    final int offset = (int)(consumerIndex% buffer.length); // 2

    final E e = buffer[offset]; // 777 (buffer[2])return e; // return 777

    Producer Thread Consumer Thread

    Java 1.5 & +

    // consumerIndex = 2// producerIndex = 2// e = 777final int offset =

    (int)(producerIndex% buffer.length); // 2

    buffer[offset] = e; // buff[2] = 777producerIndex++; // 3

    Happens-Before

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Encore un Quiz !

  • Lequel est le plus rapide ?

    public static final int SIZE = 256 * 1024;private int[] data = new int[SIZE];

    @Setuppublic void init() {

    Random rand = new Random();for (int i = 0; i < SIZE; i++) {

    data[i] = rand.nextInt(100) - 50;}

    }

  • @Benchmarkpublic int mathAbs() {

    int sum = 0;

    for (int x : data) {

    sum += Math.abs(x);

    }return sum;

    }

    @Benchmarkpublic int customAbs() {

    int sum = 0;for (int x : data) {

    if (x < 0) {sum -= x;

    } else {sum += x;

    }}return sum;

    }1 2

    Lequel est le plus rapide ?

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Instruction Pipeline

    Le premier programme est le plus rapide !

    273us

    1180us

    mesurer

  • Fetch Decode Write-backExecute

    Instruction

    Instruction

    Instruction

    Instruction

    Waiting

    Program Order Cycle dhorloge 1

  • Fetch Decode Write-backExecute

    Instruction

    Instruction

    Instruction

    Instruction

    Waiting

    Program Order Cycle dhorloge 2

  • Fetch Decode Write-backExecute

    Instruction

    Instruction

    Instruction

    Instruction

    Waiting

    Program Order Cycle dhorloge 3

  • Fetch Decode Write-backExecute

    Instruction

    Instruction

    Instruction

    Instruction

    Waiting

    Program Order Cycle dhorloge 4

  • Fetch Decode Write-backExecute

    Instruction

    Instruction

    Instruction

    Instruction

    Waiting

    Cycle dhorloge 5Program Order

  • Fetch Decode Write-backExecute

    Instruction

    Instruction

    Instruction

    Waiting

    Cycle dhorloge 6Program Order

  • Fetch Decode Write-backExecute

    jump if sign

    Waiting

    Program Order Cycle dhorloge 1

  • Fetch Decode Write-backExecute

    jump if sign

    Waiting

    Program Order Cycle dhorloge 2

  • Fetch Decode Write-backExecute

    jump if sign

    Waiting

    Program Order Cycle dhorloge 3

  • Fetch Decode Write-backExecute

    jump if sign

    Waiting

    Program Order Cycle dhorloge 4

  • Fetch Decode Write-backExecute

    jump if sign

    Waiting

    Program Order Cycle dhorloge 4

    Poubelle

    Bad Prediction

  • @YourTwitterHandle#DVXFR14{session hashtag} @Alex_Victoor @ThierryAbalea#sginsideit

    Mesurer la latence

  • moyenne4

    Latence

    Temps

    Centile (percentile)

  • Latence

    Temps

    Centile (percentile)

  • Latence

    Temps

    Centile (percentile)

  • 3.550%

    Latence

    Temps

    Centile (percentile)

  • 3.5

    6

    50%

    90%

    Latence

    Temps

    Centile (percentile)

  • HdrHistogramHistogram histo = new Histogram(5);histo.recordValue(end-start);histo.getValueAtPercentile(0.99);histo.outputPercentileDistribution(os, 1D);

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Queue v2

  • Cot du volatile

  • Units dexcution

    Cache L1

    Cache L2

    Coeur

    Registres

    Store Buffer

  • Units dexcution

    Cache L1

    Cache L2

    Coeur

    Registres

    Store Buffer

    Volatile store

  • Units dexcution

    Cache L1

    Cache L2

    Coeur

    Registres

    S5

    S4

    S3

    S2

    S1

    Store Buffer

    Volatile store

  • import java.util.AbstractQueue;

    public final class LamportQueue2 extends AbstractQueue {private final E[] buffer;private final AtomicLong producerIndex = new AtomicLong();private final AtomicLong consumerIndex = new AtomicLong();

    public LamportQueue2(int capacity) {buffer = (E[]) new Object[capacity];

    }

    @Overridepublic int size() {

    return (int) (producerIndex.get() - consumerIndex.get());}

  • @Overridepublic boolean offer(final E e) {

    if (size() == buffer.length) {return false;

    }

    final int offset = (int)(producerIndex % buffer.length);buffer[offset] = e;producerIndex.lazySet(producerIndex.get() + 1);return true;

    }

  • @Overridepublic E poll() {

    if (consumerIndex == producerIndex) {return null;

    }

    final int offset = (int)(consumerIndex.get() % buffer.length);final E e = buffer[offset];buffer[offset] = null;consumerIndex.lazySet(consumerIndex.get() + 1);return e;

    }

  • Performance (MOps/s)

    ,4

    ,16

    ,43

    ,0

    ,10

    ,20

    ,30

    ,40

    ,50

    BQ LQ1 LQ2

    X 10

    BlockinqQueue Lamport v1 Lamport v2

    50

    40

    30

    20

    10

    0

    4

    16

    43

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    LES LOGS

  • Les logs, plusieurs approches Appender fichier standard, critures bloquantes

  • Les logs, plusieurs approches Appender fichier standard, critures bloquantes Appender bufferis

  • Les logs, plusieurs approches Appender fichier standard, critures bloquantes Appender bufferis Appender asynchrone

  • Les logs, plusieurs approches Appender fichier standard, critures bloquantes Appender bufferis Appender asynchrone

    Appender memory map file

  • Memory map file

    Mem virtuelle

  • Memory map file

    Mem virtuelle Mem physique

  • Memory map file

    Mem virtuelle DisqueMem physique

  • Logs & mmapMEMOIRE (OFF HEAP)

    BUFFER

    FICHIER

    Donnes vide

  • Logs & mmapMEMOIRE (OFF HEAP)

    BUFFER

    FICHIER

    Donnes vide

    Appel

    syst

    me

  • perf-test-txt-chronicle

    Chronicle Logback Appender

  • Texte vs Binaire donnes de march

    Code ISIN sur 12 caractres

    Un prix dachat (bid)

    Un prix de vente (ask)

    Isin=FR0000120271 Bid=46.575 Ask=46.590

  • Texte vs Binaire

    Isin=FR0000120271 Bid=46.575 Ask=46.590

    Texte (UTF-8)

    49 73 69 6E 3D 46 52 30 30 30 30 31 32 30 32 37 31 20 42 69 64 3D 34 36 2E 35 37 35 20 41 73 6B 3D 34 36 2E 35 39 30

    39 octets

    Binaire

    46 52 30 30 30 30 31 32 30 32 37 31 40 47 49 99 99 99 99 9A 40 47 4B 85 1E B8 51 EC

    28 octets (20 avec des int)

  • Serialization vs toString()

    public class Quote {String code;double bid;double ask;

    }

    return "Quote {" +"code='" + code + '\'' +", bid=" + bid +", ask=" + ask +

    '}';

  • Serialization vs toString()

    public class Quote {String code;double bid;double ask;

    }

    ByteBuffer buffer = ByteBuffer.allocate(28);

    buffer.put(code.getBytes());buffer.putDouble(bid);buffer.putDouble(ask);buffer.flip();out.write(buffer.array());

  • perf-test-txt-chronicle

    perf-test-bin-chronicle

    Logger en binaire avec Chronicle

  • logger.info("New quote received - {}", quote);

  • Bench logger Log de 100 000 messages de cotation

    Mesure de la latence induite par lappel au logger

    System Under TestCPU : 2x (Xeon E5-2630, 2.3GHz 2.8 GHz, 6 cores)RAM : 64 GB DDR3 PC10600HDD : 2x 300GB en RAID1, 10K tr/min, SAS 6 Gb/s, 1 GB cacheNetwork 1GbEOS: RHEL 6.4 avec un kernel 2.6.32

  • FileAppender vs BinaryIndexedChronicleAppender

    1us

    5us

    2us

    10us

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Conclusion

  • Dernier Quiz : ce quil faut retenir

    Cest toujours le premier programme le plus rapide

    Il faut se mfier de son intuition, en matire de perf, il faut mesurer !!!

    Soyez curieux, le hardware ce nest pas sale

  • DCOUVREZ TOUTES NOS OFFRES SUR

    CAREERS . SOC I E TEGENERALE . COM

  • @YourTwitterHandle@YourTwitterHandle@Alex_Victoor @ThierryAbalea#sginsideit

    Questions ?

  • @Alex_Victoor @ThierryAbalea#sginsideit

    Rfrences

    Talk Lock-free Algorithms for Ultimate Performance, Martin Thompsonhttps://yow.eventer.com/yow-2012-1012/lock-free-algorithms-for-ultimate-performance-by-martin-thompson-1250

    Talk Queue evolution: from 10M to 470M ops/sec, Nitsan Wakarthttps://vimeo.com/100197431 / https://github.com/nitsanw/QueueEvolution

    Talk How NOT to Measure Latency, Gil Tenehttp://www.infoq.com/presentations/latency-pitfalls

    JMH, THE Micro Benchmark Tool for Java http://openjdk.java.net/projects/code-tools/jmh/

    HdrHistogram, THE Latency Measurement & Plotting Toolhttps://github.com/HdrHistogram/HdrHistogram

    Blog posts related to Mechanical Sympathy, Martin Thompsonhttp://mechanical-sympathy.blogspot.com/

    Blog posts related to Java Performance, Nitsan Wakart http://psy-lob-saw.blogspot.com/

    Blog posts related to Java Performance, Peter Lawrey http://vanillajava.blogspot.com/

    Chronicle, OpenHFTs Tools http://openhft.net/

    The Mechanical Sympathy Forum https://groups.google.com/forum/#!forum/mechanical-sympathy

    Code de cette prsentation https://github.com/ThierryAbalea/high-performance-2015-talk

    Slides de cette prsentation http://www.slideshare.net/ThierryAbalea/un-monde-o-1-ms-vaut-100-m-devoxx-france-2015

    slideshare