[P] NNSplit: fast and robust sentence splitting[reddit]/r/MachineLearning

I just released NNSplit, a library for fast and robust sentence splitting. NNSplit is:

robust: Does not depend on proper punctuation and casing to split text into sentences.

portable: can be used from Javascript, Rust and Python.

small: uses a character-level LSTM, so weights are very small (~ 350 kB) which makes it easy to run in the browser.

fast: can split 100k paragraphs from Wikipedia in ~ 50 seconds (on an RTX 2080 TI and i5 8700k).

E. g. "This is a test This is another test." is split correctly into ["This is a test", "This is another test."] although punctuation is missing.

You can interactively try the model in your browser via the Browser demo.

Sentence splitting is obviously kind of a niche problem. I developed NNSplit for another project where I need sentence-level input, but the data is unstructured text. It could however also be used for feature engineering, or for making sure that the output of a neural net is properly punctuated.

2020-02-25: Freeze Discriminator: A Simple Baseline for Fine-tuning GANs https://arxiv.org/abs/2002.10964v1We demonstrate that this simple baseline outperforms the previous methods under various architectures and datasets

‹›
Generative adversarial networks (GANs) have shown outstanding performance on
a broad range of computer vision problems, but often require enormous training
data and computational resources. Several works propose a transfer learning
scheme to handle this issue, but they are prone to overfitting or too
restrictive to learn the distribution shift. In this paper, we find that simply
fine-tuning the networks while freezing the lower layers of the discriminator
surprisingly works well. The simple baseline, freeze $D$, significantly
outperforms the prior methods in both unconditional and conditional GANs, under
StyleGAN and SNGAN-projection architectures and Animal Face, Anime Face, Oxford
Flower, CUB-200-2011, and Caltech-256 datasets. Code and results are available
in https://github.com/sangwoomo/freezeD.

[R] The Break-Even Point on Optimization Trajectories of Deep Neural Networks[reddit]/r/MachineLearning

We (finally) released camera-ready for our spotlight ICLR paper https://arxiv.org/abs/2002.09572. We argue that there is a "break-even point" on the optimization trajectory, which is encountered early in training.

Abstract: The early phase of training of deep neural networks is critical for their final performance. In this work, we study how the hyperparameters of stochastic gradient descent (SGD) used in the early phase of training affect the rest of the optimization trajectory. We argue for the existence of the "break-even" point on this trajectory, beyond which the curvature of the loss surface and noise in the gradient are implicitly regularized by SGD. In particular, we demonstrate on multiple classification tasks that using a large learning rate in the initial phase of training reduces the variance of the gradient, and improves the conditioning of the covariance of gradients. These effects are beneficial from the optimization perspective and become visible after the break-even point. Complementing prior work, we also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers. In short, our work shows that key properties of the loss surface are strongly influenced by SGD in the early phase of training. We argue that studying the impact of the identified effects on generalization is a promising future direction.

Symbolic math without explicit symbol manipulation. As I've said: replace symbols by vectors and logic by continuous (or differentiable) functions. [twitter]

Hey everyone! For any python & pandas users out there, here's a free tool to visualize your dataframes[reddit]/r/datasets

- Here's a demo of all the functionality: demo video- Github is located here- Live interactive demo here

This is D-Tale "details of your data" is a free visualizer for pandas data structures. It is currently supported within the following:

python terminals

jupyter notebooks

hosted notebooks like kaggle & google colab

within R terminals through the use of reticulate

Hope it helps spawn some ideas around visualizations for your data sets! In the end, building out your own visualizations is still king, but this gives you a quick way to possibly see if your data might produce something beautiful before you spend time fully building it out.

[R] Questions implementing "Deep Learning with Importance Sampling"[reddit]/r/MachineLearning

I am implementing the paper "Not All Samples Are Created Equal: Deep Learning with Importance Sampling" in PyTorch (arxiv, Keras implementation).

The paper

The idea of the paper is as follow:

Instead of selecting samples at random, you can train more efficiently by selecting the "hard" samples with higher priority

You can achieve this either by prioritizing proportionally to the loss, but this is not very accurate. Or you can use the gradient norm, but that's very expensive to compute

In the paper, they come up with a relatively tight upper bound of the gradient norm, which can be computed fast enough and results in a nice training speed-up. Cool!

The problem

The paper is very well written in general, and a full implementation is provided. I used to think this was the clearest paper ever, before I started working on a PyTorch implementation.

Now I realise that:

Some of the details in the paper are not clear (or maybe I'm just not familiar enough with the math behind ML)

The Keras implementation is super heavy, with many levels of inheritance which makes it hard to follow the flow of the algorithm to actually understand what's going on

[R] Network Randomization: A Simple Technique for Generalization in Deep Reinforcement Learning. By introducing some noise in the feature space rather than in the input space as is typically done for visual inputs, their agents can generalize better to unseen tasks. CoinRun and DMLab experiments.[reddit]https://arxiv.org/abs/1910.05396/r/MachineLearningWe provide the more detailed discussions on an extension to the dynamics generalization and failure cases of our method in Appendix

‹›
Deep reinforcement learning (RL) agents often fail to generalize to unseen environments (yet semantically similar to trained agents), particularly when they are trained on high-dimensional state spaces, such as images. In this paper, we propose a simple technique to improve a generalization ability of deep RL agents by introducing a randomized (convolutional) neural network that randomly perturbs input observations. It enables trained agents to adapt to new domains by learning robust features invariant across varied and randomized environments. Furthermore, we consider an inference method based on the Monte Carlo approximation to reduce the variance induced by this randomization. We demonstrate the superiority of our method across 2D CoinRun, 3D DeepMind Lab exploration and 3D robotics control tasks: it significantly outperforms various regularization and data augmentation methods for the same purpose. Code is available at github.com/pokaxpoka/netrand.

[R] When Two-Layer ReLU Networks Provably Fail With High Probability[reddit]/r/MachineLearning

Title: Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent

(First author here.) We prove that for certain types of datasets, wide two-layer (Leaky)ReLU networks trained with Kaiming (He et al.) initialization and Gradient Descent on a Least-Squares loss converge to a bad local minimum with high probability. A more detailed explanation with video and plots is provided in this Twitter thread: https://twitter.com/DHolzmueller/status/1232280955731181574

Abstract: We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations and that stochastic gradient descent exhibits similar behavior.

How difficult will it be to train an RL agent to play Atari on a virtual game console within a physically realistic simulation VR environment where it also has to learn motor skills to operate a physical game controller and move itself to view the screen from different locations? [twitter]