Themen für Projekt-, Bachelor- und Master-Arbeiten

16
Themen für Projekt-, Bachelor- und Master-Arbeiten https://www.cs12.tf.fau.de/lehre

Transcript of Themen für Projekt-, Bachelor- und Master-Arbeiten

Themen für Projekt-, Bachelor-und Master-Arbeiten

https://www.cs12.tf.fau.de/lehre

Design and Implementation of a data-specific DBMS for IoTedge storage

In the AMMOD project, data files coming from sensors are stored and later

Database

Application

in-memory processing

Base Station

Sensors

uploaded to a cloud by an autonomous base station powered by a variablerenewable energy source. As the energy available is fluctuant, the amount ofdata stored locally can increase a lot when we delay the transmission to saveenergy. This urges the need of a power-efficient database system.We currently use SQLite which is designed for embedded systems. Howeverits lack of required features (like in-memory processing and compression ofdata, a recycle bin) and the data structure (relation tables) do not match ourapplication. On top of that, its flexibility (i.e. query langage) is not required inour case where only a small subset of operations is actually needed.This project involves designing and implementing a custom DBMS and com-pare its performances with state-of-the-art databases. A later extension couldbe the implementation of the resulted DBMS in the programmable logic of theFPGA deployed in the base station processing unit for further performance increases.

Prerequisites: Basic knowledge of a low-level language (e.g. C, C++, Rust)Type of Work: Theory (30%), Design (40%), Implementation (30%)Supervisor: Pierre-Louis Sixdenier ([email protected]), Stefan Wildermann (ste-

[email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Analyzing GPU Tensor Cores on Different Architectures

Modern NVIDIA GPUs are equipped with a number of so-called ten-

f0 f1 f2

i0

i1

i2

o0,0

o1,0

o2,0

o0,1

o1,1

o2,1

o0,2

o1,2

o2,2

sor cores, an emerging architecture that allows calculating a matrix-multiply-accumulate

D = A · B + C

in a single GPU cycle, where A, B, C, and D are small matrices. Theywere initially designed for deep learning applications. But, as new ge-nerations of GPUs also come equipped with hundreds of tensor cores,their performance becomes interesting in other domains.

In this work, your task is to analyze several different applications (e.g.,matrix-matrix multiplication as a baseline, image convolution, and re-duction) using tensor cores and to evaluate potential performance be-nefits on different GPU architectures. This work is suitable for a bache-lor or master project.

Required skills: C++ and CUDANature of work: Theory (10%), Conception (10%), Implementation (60%)Contact: Stefan Groth ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

HiWi: Attacking a Secure Boot process on an FPGA board

Side-channel analysis (SCA) of embedded hardware devices can beof great help for attackers or forensic investigators on a crime scene.SCA can be used to retrieve data and assess the state of a device.Furthermore it can be used to break encryption algorithms running onsaid device. Correlation Power Attack is one possible attack that canbe used to break an AES encryption. Several algorithms and tools ha-ve been developed at the chair to make this analysis feasible for foren-sic investigations. Now we want to show, that these tools can be usedto attack a secure boot process on an FPGA in order to retrieve thedecrypted firmware. In the following trace, a secure boot process is shown with 500 AES encryptions.However, for our example the problem is the association between the ciphertexts used in the decryptionwith the recorded operation.

0.0 0.5 1.0 1.5 2.0Sample 1e8

0

100

200

Ampl

itude

Hashing

ECDSA signature check

500 AES-128 CBC decryptions

Prerequisites: Motivation, Programming skills in Python and some experience with FPGAsType of Work: HiWiSupervisor: Jens Trautmann ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Simulation von Prozessoren mitnichtfluchtigen Speicherhierarchien in Gem5

Energy Harvesting ist eine der vielversprechendsten Optionen fur die

CPU

volatile L1 cache

non-volatile L3 cache

non-volatile mainmemory

registerfile copy

hybrid L2 cache

processor

registerfile

Volatile (DRAM, SRAM) Non-volatile (PCM, STT, FeFET)

Energieversorgung eingebetteter Systeme im Internet der Dinge (IoT).Allerdings kann es dann aufgrund einer instabilen Stromversorgung zuhaufigen Ausfuhrungsunterbrechungen kommen. Neue nichtfluchtigeSpeichertechnologien werden daher zusehends in IoT-Geraten einge-setzt, um die Programmausfuhrung auch uber Stromunterbrechungenhinweg zu uberbrucken. Zusammen mit den Registerinhalten wird da-zu der Cache-Inhalt bei Stromausfallen in den nichtfluchtigen Speicherubertragen (sog. Checkpointing).

In dieser Arbeit soll gem5 (www.gem5.org), ein Simulator fur Rech-nerarchitekturen, mit einem vorhandenen Simulationsmodell fur nicht-fluchtige Speicher gekoppelt werden. Dabei soll eine Speicherhierar-chie mit nichtfluchtigem Hauptspeicher sowie einem Mix aus fluchtigen und nichtfluchtigen Cachesaufgebaut und eine grundlegende Checkpointingstrategien umgesetzt werden. Durch Simulation ver-schiedener Benchmarks soll deren Effektivitat und Overhead untersucht werden.

Voraussetzungen: Programmierkenntnisse in C/C++Art der Arbeit: Theorie (20%), Konzeption (30%), Implementierung (50%)Ansprechpartner: Stefan Wildermann ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Beschleunigung von Entwurfsraumexplorationen durch ma-schinelles Lernen

Bei der Realisierung und Evaluierung von Hardwareimplementierun-

Training

Suchraum

Synthese

gen aus einer gegebenen Spezifikation, stellt haufig die Große desSuchraums ein Problem dar. Um die Vielzahl an Implementierungs-moglichkeiten zu beherrschen, wird deshalb auf eine Entwurfsraumex-ploration zuruckgegriffen. Hierbei wird der Suchraum traversiert, wobeiausgewahlte Implementierungen stetig verbessert werden. Diese Tra-versierung basiert auf dem Erzeugen und Evaluieren vieler Losungen,die stetig verbessert werden. Die Evaluierung benotigt eine Synthese,die wegen ihrer hohen Komplexitat zu sehr hohen Laufzeiten fuhrt.Im Zuge dieser Arbeit wird ein Ansatz untersucht, der diese Laufzeitreduziert, indem nur ein Teil der Losungen tatsachlich synthetisiert undevaluiert wird. Die Evaluierung der anderen Losungen wird durch eine Verfahren ersetzt, das durchmaschinelles Lernen Zusammenhange zwischen charakteristischen Eigenschaften der synthetisiertenLosungen und deren Evaluierungsergebnissen lernt. Dieses Wissen wird dann verwendet, um die Qua-litat anderer Losungen vorherzusagen.

Voraussetzungen: Programmierkenntnisse in C/C++ und JavaArt der Arbeit: Theorie (40%), Konzeption (40%), Implementierung (20%)Ansprechpartner: Peter Brand, Joachim Falk ({peter.brand, joachim.falk}@fau.de)

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Vergleich von simulierten und realen Kommunikationsmus-tern durch maschinelles Lernen

Bei der Entwicklung neuer Kommunikationsalgorithmen und -methoden,Mustererkennen Klassen

bildenModel

Trainieren

Vergleich

MaschinellesLernenReale

Datenströme

SimulierterDatenströme

sind belastbare Evaluierung notig. Eine Evaluation, sofern analytischnicht moglich, erfordert eine Bewertung einer Vielzahl von Testsze-narien, die moglichst alle Auspragungen real moglicher Kommunika-tion umfassen sollten. Da Szenarien in realen Systemen meist wedernachvollziehbar, noch wiederholbar sind und unerwartetes Verhaltenim schlimmsten Fall katastrophale Auswirkungen hat, werden bei derEntwicklung meist Simulationen eingesetzt, die relevantes, reales Sys-temverhalten nachbilden. Dadurch lassen sich nachvollziehbare und wiederholbare Testszenarien er-zeugen, die dynamisch auf das Verhalten des Kommunikationsalgorithmus reagieren. Die Aussagekrafteiner simulativen Evaluation hangt dabei hauptsachlich von der Wirklichkeitstreue des simulierten Ver-haltens ab.Im Zuge dieser Arbeit soll untersucht werden, wie sich die Wirklichkeitstreue vergleichen lasst. Dabeiwerden Methoden des maschinellen Lernens angewandt, um Verhaltensmuster in realen Kommunika-tionsstromen zu erkennen und anschließend mit simulierten Kommunikationsstromen zu vergleichen.

Voraussetzungen: Programmierkenntnisse in C/C++ und JavaArt der Arbeit: Theorie (45%), Konzeption (45%), Implementierung (10%)Ansprechpartner: Peter Brand, Joachim Falk ({peter.brand, joachim.falk}@fau.de)

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

10G Ethernet Controller for Accelerated Stream Processing

With the advent of IoT and Industry 4.0, not only is theamount of data increasing, but also the real-time require-ments for its processing. The ReProVide project targets toovercome these new challenges with the help of FPGAs.These can be used in data centers as intelligent switchesand smart NICs (Network Interface Cards) to deploy ac-celerators for downstream operations directly in the datastream.

To enable such a deployment, it is necessary to processthe incoming network packets directly on the FPGA. Forthis purpose, a UDP/IP stack is to be set up in hardware.Based on the UDP protocol, a lightweight protocol for communication with the FPGA shall be developedto support essential functions of the TCP protocol (detection of losses, order of packets, etc.), howeverwithout covering the full complexity of TCP.

Prerequisites: Basic knowledge in C++, Python, VHDL and ideally VerilogType of Work: Theory (20%), Conception (30%), Implementation (50%)Supervisor: Tobias Hahn ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Unified Design of Compilers for Machine Learningon Specialized Hardware Targets

Due to the relatively recent explosion in practical machine learningtechniques, the necessary tools had to mature at a matching speed.This lead to many design decisions and competing concepts that couldbe unified or set up differently in hindsight.

MLIR is a compiler intermediate representation that aims to provide areusable software layer for various compilers and thereby unify theirdesigns. Especially in machine learning frameworks, the adoption ofMLIR or similar techniques is interesting as well as ongoing. For instan-ce, TensorFlow uses MLIR internally. However, there are many othercompilers, like Google’s XLA, MLIR, TVM, Glow, or Microsoft’s Brain-wave compiler. These compilers perform similar work but must, in part,target radically different hardware: CPUs, FPGAs, GPUs, and TPUs.This and their development history make them so distinct.

In this work, the differences between the approaches used in these compilers, their usefulness for partsin other compilers, and especially the applicability of MLIR in their stead shall be explored.

Required Skills: Programming Skills in C++, knowledge in compiler design and machine learningType of work: Theory (50%), Conception (20%), Implementation (30%)Contact: Patrick Plagwitz ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-DesignCauerstr. 1191058 Erlangen

Optimization of FSM-based Property Enforcers using DSE

The goal of this thesis to optimize the construction of an au-tomaton to enforce execution time as well as energy require-ments for application programs running on an MPSOC by con-trolling the frequency/voltage settings and parallelism at run-time in dependence of changing workloads. For periodic workload,these can be characterized by a Markov chain.

Through proper encoding of an enforcement automaton, evo-lutionary multi-objective algorithms (MOEAs) shall be investi-gated to explore the search space while optimizing estimateda) energy, b) number of deadline violations, and c) automatoncomplexity in terms of number of states and transitions.

Required skills: Programming knowledge in Java, interest in optimization techniquesNature of work: Theory (30%), Conception (35%), Implementation (35%)Contact: Khalil Esper ([email protected])

Stefan Wildermann ([email protected])Jurgen Teich (Jurgen [email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Fehlerpropagation in komplexen Systemen

Beim Approximate Computing wird die Korrektheit von Berech-x ∈ [0,10]± 0 y ∈ [2,7]± 0 z ∈ [0,5]± 1

f(x, y)

g(w, z)

[2,17]± 2

[0,85]

0 0

1

[0,85]± 34

e+ = 2

e∗ = 3

nung zugunsten anderer, nicht-funktionaler Metriken, z.B. kür-zere Berechnungsdauer oder ein verringerter Energieverbrauch,aufgegeben. Bei der Approximation allgemeiner Funktionalität oderkomplexer Systeme ist es jedoch sehr schwierig und aufwändig,den eingeführten Fehler zu bestimmen.Um mit dieser Komplexität umgehen zu können, werden häufigabstrakte Zahlbereiche (z.B. Affine Arithmetik, Dempster-Shafer-Intervalle) verwendet. Diese Zahlbereiche werden dann genutztum den Werte- und Fehlerbereich eines Systems entlang des ab-strakten Syntax-Baums zu propagieren.Im Rahmen dieser Arbeit soll ein generisches C++ framework zur Werte-/Fehlerpropagation mit Unter-stützung für mehrere number ranges entwickelt werden.Ziele der Arbeit sind:

• Verständnis von abstrakten Zahlbereichen und Propagation• Implementierung eines C++ Fehlerpropagations-Frameworks• Evaluation des frameworks und Vergleich mit bestehender Fehlerpropagations-Software

Voraussetzungen: gute Programmierkentnisse in C++(mind. 17, besser 20)Art der Arbeit: Konzeption (40%), Implementierung (60%)Betreuer: Oliver Keszöcze ([email protected])

Lehrstuhl für Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Co-Design of Neural Architecture and Hardware Accelerator

Recently, in the area of Deep Learning,platform-aware Neural Architecture Search(NAS) has become one of the major rese-arch topics with objectives as reducing theerror rate or execution time of Deep NeuralNetworks (DNNs) if deployed on a fixed ac-celerator. With this, the DNNs are optimi-zed regarding, e.g., the number of layers ornumber of neurons per layer.

Usually, the accelerator parameters (e.g., number of processing elements (PEs), number of functionalunits, ...) are considered fixed, restricting the design space for potentially better hardware acceleratorconfigurations. Tightly Coupled Processor Arrays (TCPAs), which are developed at our chair, are tem-plates that can be parameterized at design time. They consist of a reconfigurable 2D grid of processorsand enable a co-design of Neural Architecture and Hardware Accelerator. In this work, existing NASbenchmarks (e.g., NATS-Bench) and a TCPA compilation environment shall be utilized for setting upa co-exploration tool. Finally, the improvements compared to a fixed accelerator shall be evaluated interms of execution time, DNN’s error rate, and area cost of the respective TCPA.

Prerequisites: Knowledge about DNNs, Basic programming skills Python, C++, and JavaType of work: Theory (30%), conception (25%), implementation (45%)Supervisor: Christian Heidorn ({Christian.Heidorn}@fau.de)

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

CNN Acceleration on Processor Arrays for a Prosthetic Hand

Convolutional Neural Networks (CNNs) are frequently appearing in mobile andmedical applications. As an example, the CNN for this work is used for objectdetection to control the grasping force of a prosthetic hand, which has anintegrated camera.For the energy-efficient and real-time capable acceleration of these compute-intensive CNNs, the Tightly Coupled Processor Arrays (TCPAs), developed atour chair, have been designed. TCPAs are processor fields consisting of a re-configurable 2D grid of processors enabling fast acceleration of the compute-intensive layers of a CNN. In this work the TCPA will be prototyped on a FieldProgrammable Gate Array (FPGA), which is integrated in the prosthetic hand.

The work consists of the following tasks:

• Design exploration to find a feasible TCPA architecture fitting on theFPGA in the prosthetic hand. Here, different TCPA architectures are desi-gned and compared regarding hardware cost and performance (FLOPs).

• Exploration of feasible mappings for the CNN layers on the found TCPAarchitecture. Here, different possible layer mappings are compared re-garding the required processing time.

Prerequisites: Basic knowledge about CNNs, Python, and VHDLType of work: Theory (30%), conception (30%), implementation (40%)Supervisor: Christian Heidorn ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

On-Chip Memory Design of Many-Core Loop Accelerators

Tightly Coupled Processor Arrays (TCPAs) are a class of highly customi-zable coarse-grained reconfigurable architectures used as hardware ac-celerators that are perfectly suited to accelerate loop applications like FIRfilters, matrix-matrix-multiplications, and also of the popular ConvolutionalNeural Network (CNN) applications used in machine learning. A TCPA con-sists of an array of processors, while each processor can communicate da-ta directly to its neighbor. This design enables fast and low-power executionof such loop applications.

In computer architectures, a huge fraction of energy is demanded by transporting data from and to off-chip memories. Hence,keeping data local, i.e., close to the functional units, is key to increase energy efficiency. Hardware means to improve datalocality include register files, caches, and scratchpad memories. TCPAs are surrounded by many such scratchpad buffers toenable massively parallel data access and processing.

This work aims to investigate and explore the best way to size, place, access, and to synthesize the memory on an existingTCPA architecture in terms of area, timing, and energy consumption. There will be the opportunity to work with state-of-the-art22 nm FDSOI technologies as well as memory compilers such as OpenRAM.

Requirements: Knowledge in VHDL programming, computer engineering,or chip design helpful

Type of thesis: Theory (20%), concept (40%), implementation (40%)Supervisor: Marcel Brand, Dominik Walter ({marcel.brand, dominik.walter}@fau.de)

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Datenabhängige Laufzeitbedingungen in symbolischemSchleifen-Compiler

Tightly coupled processor arrays (TCPAs) sind energieeffizienteSchleifenbeschleuniger, deren Programmierung mittels eines dedi-zierten Schleifen-Compilers erfolgt. Dieser übersetzt symbolisch, dasheißt, er erzeugt Konfigurationsdaten, die auf die Schleifengröße undAnzahl an Rechenelementen parametrisiert sind. Vor der Ausführungwerden diese Größen eingesetzt und die tatsächlichen Konfigurati-onsdaten instantiiert.Weil viele Schleifen datenabhängige Laufzeitbedingungen aufwei-sen, soll der Compiler in dieser Arbeit nun entsprechend erweitertwerden: Zuerst soll die Zwischendarstellung, ein spezieller Abhän-gigkeitsgraph, um passende Knotentypen erweitert werden. Daraufaufbauend soll die Ablaufplanung und Ressourcenallokation, sowiedie Erzeugung symbolischer Konfigurationsdaten erweitert werden.Letztlich soll die Instantiierung datenabhängiger Laufzeitbedingun-gen implementiert werden. Die gesamte Implementierung soll an-hand von Benchmarks ausgewertet werden.

par(i<N and j<M) {a[i,j]=ifrt(

c[i,j]>255,255,c[i, j]);

}

>

c[i,j]255

merge

Voraussetzungen: Gute Kenntnisse von C++Art der Arbeit: Theorie (20%), Konzeption (40%), Implementierung (40%)Ansprechpartner: Michael Witterauf ([email protected]), Dominik Walter ([email protected])

Lehrstuhl für Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Energy and performance improvements of CNNs onanytime processing elements by exploiting SIMD

Tightly Coupled Processor Arrays (TCPAs) are a class of highly custo-mizable hardware accelerators that are perfectly suited to accelerateloop applications like FIR filters, matrix-matrix-multiplications, and popu-lar Convolutional Neural Network (CNN) applications used in machinelearning. A TCPA consists of an array of processors, while each proces-sor can communicate data directly to its neighbor. This design enablesfast and low-power execution of such loop applications.

The goal of this work is to analyse the benefits of combining the novel anytime processing elements withSIMD vector processing for floating-point computations. Anytime processing is a type of approximatecomputing that trades off the computation accuracy with performance and energy efficiency. In anytimeprocessing, one can even control the accuracy of the instructions on bit level. SIMD vector processing,on the other hand, enables high parallelism with reduced control flow while executing applications.

We want to analyse whether the floating-point anytime functional units that are already integrated in theTCPA architecture can leverage the hardware structures of integrated vector processing and may furtherreduce the run-time of, e.g., the convolutions of a CNN application when computing at low accuracies.

Requirements: Basic knowledge in hardware description languages (e.g., VHDL)Type of thesis: Theory (30%), concept (30%), implementation (40%)Supervisor: Marcel Brand ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen