# Towards Scalable Machine Learning

Janis Keuper

itwm.fraunhofer.de/ml

Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany

Fraunhofer Center Machnine Larning



### Outline

- Introduction / Definitions
- Is Machine Learning a HPC Problem?
- **Case Study: Scaling the Training of Deep Neural Networks**
- **IV** Towards Scalable ML Solutions [current Projects]
- **V** A look at the (near) future of ML Problems and HPC



# Introduction

### **Machine Learning @CC-HPC**

Scalable distributed ML Algorithms Distributed Optimization Methods **Communication Protocols** 

#### **Distributed DL Frameworks**

#### "Automatic" ML

DL Meta-Parameter Learning DL Topology Learning

### HPC-Systems for Scalable ML Distributed I/O

Novel ML Hardware Low Cost ML Systems

#### **DL** Methods:

Semi- and Unsupervised DL **Generative Models** ND CNNs



#### **Industry Applications**

- DL Software optimization for Hardware / Clusters DL for Seismic Analysis DL Chemical Reaction Prediction

- DL for autonomous driving



# Setting the Stage | Definitions

### **Scalable ML**

VS

- Large model size (implies large data as well)
- Extreme compute effort
- Goals:
  - Larger models
  - (linear) strong an weak scaling through (distributed) parallelization

Large Scale ML

- Very large data sets (online stream)
- "normal" model size and compute effort (traditional ML methods)
- Goals:
  - Make training feasible
  - Often online training
    - → Big Data



 $\rightarrow \ HPC$ 

# **Scaling DNNs**

Simple strategy in DL if it does not work: scale it!

Scaling in two dimensions:

1. Add more layers = more matrix mult more convolutions

2. Add more units = larger matrix mult more convolutions

**Don't forget: in both cases** 

MORE DATA! -> more iterations





# **Scaling DNNs**

#### OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

Noam Shazeer<sup>1</sup>, Azalia Mirhoseini<sup>\*†1</sup>, Krzysztof Maziarz<sup>\*2</sup>, Andy Davis<sup>1</sup>, Quoc Le<sup>1</sup>, Geoffrey Hinton<sup>1</sup> and Jeff Dean<sup>1</sup>

<sup>1</sup>Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com <sup>2</sup>Jagiellonian University, Cracow, krzysztof.maziarz@student.uj.edu.pl

#### ABSTRACT

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.



#### **Network of Networks**

**137 billion free parameters !** 



### Is Scalable ML a HPC Problem?

- In terms of compute needed (YES)
- Typical HPC Problem setting: is communication bound = non trivial parallelization (YES)
- I/O bound (New to HPC)



# Success in Deep Learning is driven by compute power:

 $\rightarrow$  # FLOP needed to compute leading model is

- ~ doubling every 3.5 months !
- $\rightarrow$  increase since 2012: factor

~300.000 !



AlexNet to AlphaGo Zero: A 300,000x Increase in Compute



# Impact on HPC (Systems)

- New HPC Systems
  - Like ONCL "Summit"
  - Power 9
  - ~30k NVIDIA Volta GPUs
  - New storage hierarchies
- New Users (=new demands)
- Still limited resources



https://www.nextplatform.com/2018/03/28/a-first-look-at-summit-supercomputer-application-performance/



# Case Study: Training DNNs

- **Overview: distributed parallel training of DNNs**
- Limits of Scalability
  - **Limitation I:** Communication Bounds
  - Limitation II: Skinny Matrix Multiplication
  - Limitation III: Data I/O





### **Deep Neural Networks** In a Nutshell



#### At an abstract level, DNNs are:

- directed (acyclic) graphs *Nodes* are compute entities (=Layers) *Edges* define the data flow through the graph

#### Inference / Training

Forward feed of data through the network



### **Deep Neural Networks** In a Nutshell

Common intuition



#### At an abstract level, DNNs are:

- directed (acyclic) graphs *Nodes* are compute entities (=Layers) *Edges* define the data flow through the graph

#### **Inference / Training**

Forward feed of data through the network



### **Training Deep Neural Networks** The Underlying Optimization Problem

Computed via **Back Propagation** Algorithm:

- 1. feed forward and compute activation
- 2. error by layer
- 3. compute derivative by layer

$$egin{aligned} &\delta_i^{(n_l)} = rac{\partial}{\partial z_i^{(n_l)}} \;\; rac{1}{2} \|y - h_{W,b}(x)\|^2 = -(y_i - a_i^{(n_l)}) \cdot f'(z_i^{(n_l)}) \ &rac{\partial}{\partial W_{ij}^{(l)}} J(W,b;x,y) = a_j^{(l)} \delta_i^{(l+1)} \end{aligned}$$

Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!)

$$J(W,b) = \left[rac{1}{m}\sum_{i=1}^m J(W,b;x^{(i)},y^{(i)})
ight] + rac{\lambda}{2}\sum_{l=1}^{n_l-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}} \left(W_{ji}^{(l)}
ight)^2$$



# **Optimization Problem**

### **By Stochastic Gradient Descent (SGD)**

- 1. Initialize weights W at random
- Take small random subset X (=batch) of the train data
   Run X through network (forward feed)
- 4. Compute Loss
- 5. Compute Gradient
- 6. Propagate backwards through the network
- 7. Update W

Repeat 2-8 until convergence





### **Common approaches to parallelize SGD for DL**

#### Parallelization of SGD is very hard: it is an inherently sequential algorithm

- 1. Start at some state **t** (point in a billion dimensional space)
- 2. Introduce **t** to data batch **d1**
- 3. Compute an update (based on the objective function)
- 4. Apply the update  $\rightarrow$  **t+1**

How to gain Speedup ?

Make faster updates Make larger updates





### **Common approaches to parallelize SGD for DL**





### **Common approaches to parallelize SGD for DL**





### **Common approaches to parallelize SGD for DL**





### **Common approaches to parallelize SGD for DL**





### **Common approaches to parallelize SGD for DL**





### **Limitation I** Distributed SGD is heavily Communication Bound

# Gradients have the same size as the model

- Model size can be hundreds of MB
- Iteration time (GPU) <1s</li>





# **Communication Bound**

**Experimental Evaluation** 





# **Solving the Communication Bottleneck**

**Solutions in Hardware: example NVLink** 



#### NVIDIA DGX-1 / HGX-1

- 8x P100
- DGX-1: NVLink between all GPUs

NVLink spec: ~40GB/s (single direction) NVLink II (Volta): ~75GB/s (single direction)

PCIe v3 16x: ~15GB/s



Diagram DGX-1 by NVIDIA

### **Solving the Communication Bottleneck Solutions in Hardware: NVLink Benchmark**



#### **IBM Minsky**

- 4x P100
- 2x10 core Power8 (160 hw threads) NVLink between all components

NVLink spec: ~40GB/s (single direction)



### **Limitation I** Distributed SGD is heavily Communication Bound

#### How to solve this in distributed environments?



### **Limitation II** Mathematical Problems – aka the "Large Batch Size Problem"

#### **Recall:**

speedup comes from pure data-parallelism

 $\rightarrow$  splitting the batch over the workers





### "Small Batch Size Problem"

#### **Data Parallelization over the Batch Size**

Problem: Batch size decreasing with distributed scaling

Hard Theoretic Limit: b > 0

- → GoogLeNet: No Scaling beyond 32 Nodes
- → AlexNet: Limit at 256 Nodes

External Parallelization hurts the internal (BLAS / cuBlas) parallelization even earlier.

In a nutshell: for skinny matrices there is simply not enough work for efficient internal parallelization over many threads.



# "Small Batch Size Problem"

### **Data Parallelization over the Batch Size**

#### **Computing Fully Connected Layers:**

#### Single dense Matrix Multiplication

| Layer                                                                        | # operations                                   | matrix sizes              |  |
|------------------------------------------------------------------------------|------------------------------------------------|---------------------------|--|
| Fully Connected                                                              | 1                                              | $b \times I * I \times O$ |  |
| Convolutional                                                                | b                                              | $C \times I * I \times Z$ |  |
| Softmax                                                                      | b                                              | $I \times 1 * 1 \times 1$ |  |
| Definitions:                                                                 |                                                |                           |  |
| I: Input size fr                                                             | Input size from top layer                      |                           |  |
| O: Output size                                                               | Output size of this layer                      |                           |  |
| b: local Batch                                                               | local Batch size (train or validation)         |                           |  |
| C: Number of                                                                 | Number of filters                              |                           |  |
| c: Number of                                                                 | Number of input channels (RBG image: $c = 3$ ) |                           |  |
| P: Patch size (                                                              | Patch size (i.e. pixel)                        |                           |  |
| k: kernel size                                                               | kernel size                                    |                           |  |
| Z: Effective siz                                                             | Effective size after kernel application.       |                           |  |
| For convolution $Z := \left(\sqrt{P} - \lfloor (k/2) \right)^2$<br>TABLE III |                                                |                           |  |
| Size and number of of the matrix multiplications (sgemm) per                 |                                                |                           |  |

FORWARD PASS FOR SELECTED LAYERS.





# **Experimental Evaluation**

**Increasing the Batch Size** 

Solution proposed in literature:

**Increase Batch size** 

**But:** 

Linear speedup against original Problem only if we can reduce the number iterations accordingly

This leads to loss of accuracy



Iteration



# The "large batch" Problem

**Central Questions** 

Why is this happening?

Is it dependent on the Topologie / other parameters?

How can it be solved?

 $\rightarrow$  large batch size SGD would solve most scalability problems!



Iteration



# The "large batch" Problem

What is causing this effect? [theoretical and not so theoretical explanations]

- $\rightarrow$  The "bad minimum"
- $\rightarrow$  gradient variance / coupling of learning rate and batch size



Figure 2: Training and testing accuracy for SB and LB methods as a function of epochs.



### The "bad minimum"

Theory: larger batch causes degrease in gradient variance, causing convergence to local minima...



Figure 1: A Conceptual Sketch of Flat and Sharp Minimizers (Y-axis indicates value of the loss function and X-axis indicates the weights)



 $\rightarrow$  empirical evaluation shows high correlation of sharp minima and weak generalization



### **Limitation II** Mathematical Problems – aka the "Large Batch Size Problem"

**Problem not fully understood** 

No general solutions (yet) !

Do we need novel optimization methods ?!



### Limitation III Data I/O

#### Hugh amounts of training data need to Be streamed to the GPUs

- Usually 50x 100x the training data set
- Random sampling (!)
  Latency + Bandwidth competing with optimization communication





# **Distributed I/O**

**Distributed File Systems are another Bottleneck !** 

- Network bandwidth is already exceeded by the SGD communication
- Worst possible file access pattern:
  - Access many small files at random

This problem already has effects on local multi-GPU computations

E.g. on DG-X1 or Minsky, single SSD (~0.5 GB/s) to slow to feed >= 4 GPUs

-> solution: Raid 0 with 4 SSDs



# **Distributed I/O**

#### **Distributed File Systems are another Bottleneck !**



Compute time by Layer

Compute time by Layer

AlexNet (GPU + cuDNN)



Results shown for SINGLE node access to a Lustre working directory (HPC Cluster, FDR-Infiniband)

### Results shown for SINGLE node Data on local SSD.



## **V** Towards Scalable Solutions

### **Distributed Parallel Deep Leaning with HPC tools + Mathematics**







## CaffeGPI:



## Distributed Synchronous SGD Parallelization With asynchronous communication overlay

Better scalability using asynchronous PGAS programming of optimization algorithms with GPI-2.

Direct RDMA access to main and GPU memory instead of message passing

Optimized data-layers for distributed File systems





## **CC-HPC Current Projects**

## **CaffeGPI: Approaching DNN Communication**

### CaffeGPI: Distributed Synchronous SGD Parallelization



- Communication Reduction Tree
- Communication Quantization
- Communication Overlay
- Direct RDMA GPU → GPU
  - Based on GPI



Optimized distributed data-layer





## CaffeGPI: Open Source



https://github.com/cc-hpc-itwm/CaffeGPI





## Projects: Low Cost Deep Learning Setup: Build on CaffeGPI

**Price: <8000 EUR standard components** 



#### Specs:

- 32 GB GPU Mem
- 64 GB PGAS Mem
- 2TB BeeGFS for Train Data
- GPU interconnect: PCIe



## CaffeGPI: Benchmarks: Single Node



Specs:

4x K80 GPU (PCIe Interconnect) Cuda 8 CuDNN 5.1

Topology: AlexNet Batch Size (global) (1024)



## Low Cost Deep Learning Setup: Build on CaffeGPI



Specs: Topology: GoogLeNet, Cuda 8, cuDNN 5.1, CaffeNV 16.4, Batch Size/Node: 64



## CaffeGPI: Benchmarks



#### Specs:

1x K80 GPU (per node) Cuda 8 CuDNN 5.1

Topology: GloogleNet Batch Size: 64 per node



### **New Project: Low Cost Deep Learning Cluster**

GPU Cluster for Fraunhofer Consortium @ITWM

- Low cost Hardware
  - Consumer GPUs
  - Novel AMD architecture
  - Hosting Cost per GPU ~ 1.25 k EUR. Compared to DGX-1 ~ 10k
- Fast Interconnect and Data I/O
  - Parallel FS with local NVMe
- Open Source Multi-User Management
  - Reservation system

  - SchedulingCustom Containers
  - Web Interface





Bundesministerium





## **Project Carme**



### An open source software stack for multi-user GPU clusters

**Carme** (/ˈkɑːrmiː/ KAR-mee; Greek: Κάρμη) is a **Jupiter** moon, also giving the name for a **Cluster** of Jupiter moons (the carme group).

Or in our case:

an open source frame work to mange resources for multiple users running **Jupyter** notebooks on a **Cluster** of compute nodes.



## **Project Carme**

### An open source software stack for multi-user GPU clusters

**Common problems in GPU-Cluster oppration:** 

- Interactive, secure multi user environment
  - ML and Data Science users want interactive GUI access to compute resources
- Resource Management
  - How to assign (GPU) resources to competing users?
    - User management
    - Accounting
    - Job scheduling
    - Resource reservation
- Data I/O
  - Get user data to compute nodes (I/O Bottleneck)
- Maintenance
  - Meet (fast changing and diverse) software demands of users



### 🗾 Fraunhofer

## **Project Carme**

An open source software stack for multi-user GPU clusters

#### Carme core idea:

- Combine established open source ML and DS tools with HPC back-ends
  - Use containers
    - (for now) Docker
  - Use Jupyter Notebooks as main web based GUI-Frontend
    - All web front-end (OS independent, no installation on user side needed)
  - Use HPC job management and scheduler
    - SLURM
  - Use HPC data I/O technology
    - ITWM's BeeGFS
  - Use HPC maintenance and monitoring tools







## **Project Carme**

An open source software stack for multi-user GPU clusters





### **HP-DLF**

## **High Performance Deep Learning Framework**



- Transparent
- generic
- Auto-parallel
- Elastic
- Automatic data flow
- Automatic Hardware selection
- Portable
- Monitoring
- Simulation
- New optimization methods



Bundesministerium für Bildung und Forschung











### **Optimization for HP-DLF**

### Our asynchronous optimization algorithm



- Sparse communication for multi model optimization
- Lower demands on the communication bandwidth
- Superior convergence

#### Janis Keuper and Franz-Josef Pfreundt Fraunhofer ITWM Competence Center High Performance Computing Katserslautern, Germany {janis.keuper | franz-josef.pfreundt}@itwm.fhg.de tion of a wast majority of machine learning

Asynchronous Parallel Stochastic Gradient Descent A Numeric Core for Scalable Distributed Machine Learning Algorithms

SGD) methods have long proven to provide good results ooth in terms of convergence and accuracy. Recently, sev eral narallelization approaches have been proposed in order d parallel updating algorithm for the asynchronous single-sided comm

els from small sets of available training samples mo

n this context, algorithms which guarantee useful results en in the case of an early termination are of special inter With limited (CPU) time fast and stable convergen-

est. With innitial (CrO) unie, last and stable convergence is of high practical value, especially when the computatio can be stoped at any time and contined some time later when more ressources are available.

stead, the availability of res ize or network bandwidth become the domination actor for large scale machine learning algorithms

ABSTRACT

nication naradism. Comnared to existing methods. ASGD wides faster (or at least equal) convergence, close to line 1. INTRODUCTION

oes like CPU tin



he enduring success of *Big Data* applications, which typ-ally includes the mining, analysis and inference of very rge datasets, is leading to a change in paradigm for ma-Figure 1: Convergence speed of different gradient descent methods used to solve K-Means clustering with K = 100 on a 10-dimensional target space paral-lelized over 1024 cores on a cluster. Our novel ASGD chine learning research objectives [4]. With plenty data at and, the traditional challenge of inferring generalizing mod method outperforms communication free SGD [17] and map-reduce based BATCH [5] optimization by the order of magnitudes (See section 5.5 for the de

ails of the experimental setup).

an-reduce strategy for ML algorithms in [5], which shows That the vast majority of existing ML techniques could eas-ly be transformed to fit the map-reduce scheme. After a short period of rather enthusiastic porting of algoithms to this framework, concerns started to grow if fol-

wing the man-reduce ansatz truly provides a solid solution Parallelization of machine learning (ML) methods has been or large scale ML. It turns out, that man-reduce's easy n a rising topic for some time (refer to [1] for a comprehen sive overview). However, until the introduction of the map reduce pattern, research was mainly focused on shared mem llelization comes at the cost of poor scalability [13]. The nain reason for this undesired behavior resides deep down ory systems. This changed with the presentation of a generic ithms have in common: an optimization problem. In this ontext, man-reduce works very well for the implementatio

of so called batch-solver approaches, which were also use ation step. Hence, their scalability with respect to the dat ize is obviously poor ven long before parallelization has become a topic, mos stations avoided the known drawbacks of batch

map-reduce framework of [5]. Ho

cal properties most machine learning algo



## **V** A Look a the near Future

#### For HPC:

- New Hardware Accelerators
- New Interconnects
- New Architectures

#### From ML

- Models are still growing!
- Learning to learn



# **Automatic Design**

**Of Deep Neural Networks** 



**Basically, a graph optimization problem :** 

### Select Node Types

- And their Meta-Parameters
- **Connect Edges** to define the data flow through the graph

### **Optimization target: minimize test error**

### **Problems:**

- Huge and difficult search space
- Each iteration requires training of a DNN



# Automatic Design

### **Current Approaches: Reinforcement Learning**

| Model                           | Error rate | # params ( $\times 10^6$ ) |
|---------------------------------|------------|----------------------------|
| Maxout [7]                      | 9.38       | -                          |
| Network in Network [19]         | 8.81       | _                          |
| VGG [27] <sup>1</sup>           | 7.94       | 15.2                       |
| ResNet [10]                     | 6.61       | 1.7                        |
| MetaQNN [1] <sup>2</sup>        | 9.09       | 3.7                        |
| Neural Architecture Search [39] | 3.84       | 32.0                       |
| CGP-CNN (ConvSet)               | 6.75       | 1.52                       |
| CGP-CNN (ResSet)                | 5.98       | 1.68                       |

#### **Evaluation on CIFAR-10:**

Better than "State of the Art" (hand designed) performance

- After ~12500 iterations
- Compute time: ~ 10000 GPU days

| Under                                                  | review as a conference paper at ICLR 2017                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|--------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                        | URAL ARCHITECTURE SEARCH WITH<br>NFORCEMENT LEARNING                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Goo                                                    | et Zoph: Quoc V. Le<br>ge Brain<br>retzoph, qvl}@google.com                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                                                        | ABSTRACT                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                                                        | Neural networks are powerful and flexible models that work well for many diffi-<br>cult learning tasks in image, speech and natural language understanding. Despite<br>their success, neural networks are still hard to descriptions of neural networks and train<br>this RNN with reinforcement learning to maximize the expected accuracy of the<br>generated architectures on a validation set. On the CIFAR-10 dataset, our method,<br>starting from scratch, can design a novel network architecture that rivals the best<br>human-invented architecture in terms of test set accuracy. Our CIFAR-10 model<br>achieves a test error rate of 3.24, which is only 10 percent worse and 1.2x faster<br>than the current state-of-the-art model. On the Penn Treebank dataset, our model<br>can compose a novel recurrent tell that outgerforms the widely-used LSTM cell,<br>and other state-of-the-art baselines. Our cell achieves a test set perplexity of 62.4<br>on the Penn Treebank, which is 3.6 perplexity better than the previous state-of-<br>the-art. |
| 1 1                                                    | NTRODUCTION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| cation<br>Krizh<br>et al.<br>desigr<br>et al.<br>ResNo | st few years have seen much success of deep neural networks in many challenging appli-<br>s, such as speech recognition (Hinton et al.) [2012), image recognition [LeCun et al.] [1998,<br>visyle et al.] [2013] and machine translation (Stutkever et al.] [2014] Educhanu et al.] [2015] [Wu<br>[2016]. Along with this success is a paradigm shift from feature designing to architecture<br>ing. i.e., from SHT [Cuox4] [1999], and HGG [Datal & Trings [2005]. to AlexNet (Krithevelsky<br>[2017]). VGGNet (Simonyan & Zissermanj, [2014]). GoogleNet (Szegedy et al.] [2015]), and<br>t (He et al.] [2016a]. Although it has become easier, designing architectures still requires a<br>spert Kowledge and takes ample time.                                                                                                                                                                                                                                                                                                                                   |
|                                                        | Sample architecture A with probability p                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                                                        | The controller (MAR) The controller (MAR) Compute gradient of p and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                        | scale (by N to update<br>the controller<br>Figure 1: An overview of Neural Architecture Search.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                                                        | rigote i. An overview of Acutal Alcinecture search.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| tures                                                  | aper presents Neural Architecture Search, a gradient-based method for finding good architec-<br>see Figure[]). Our work is based on the observation that the structure and connectivity of a<br>network can be typically specified by a variable-length string. It is therefore possible to use                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| *W                                                     | ork done as a member of the Google Brain Residency program (g.co/brainresidency)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |



## DeToL **Deep Topology Learning**



- Genetic Algorithms
- **Reinforcement Learning**
- Early Stopping Graph Embedding
- Meta-Learning
- Pruning



Bundesministerium

für Bildung

und Forschung

On application size problems

Data basis generartion





ÎĮÎ **PSIORI**  $\Pr(E|H_p) \times \Pr(H_p)$ 



## Discussion



