Highlights From NIPS 2017 And What It Means For Your AI Education

This post targets engineers, AI scientists and the technical audience and aims to give an overview about observed AI and ML trends at NIPS 2017.

Last week, I attended the 30th conference in Neural Information Processing Systems (NIPS) in Long Beach, California. It was an event which reflected in all possible ways the hype around AI & ML: more than 7000 attendees, sold out in less than two weeks, overbooked hotels in and around Long Beach and event venues as big as soccer stadiums. Somebody told me that Google itself was represented with more than 700 employees, not totally sure about this number but it sounds not too far off. And yes, as this Bloomberg article points out, NIPS was a huge recruiting event and it most likely really changed its character since its early days but to whoever is complaining about the scale, I would just reply: Just be happy that your discipline is receiving so much attention!

“Google going to NIPS” - discovered on Facebook

via GIPHY

Top themes

A bunch of themes spread throughout the conference as key topics in AI & ML of 2017 in the academic community. Nevertheless, I am totally aware that the choice here is likely biased through my personal interests. The following diagram shows the key themes of this article and their “relatedness”. It simplifies very heavily but should provide a rough overview here.

Overview

Bayesian Deep Learning and Deep Bayesian Learning

The interplay between Bayesian methods and deep learning is ubiquitous. Clean and well-formulated Bayesian models are using deep learning elements as “universal function approximators” (Deep Bayesian Learning) and deep neural networks are augmented with elements from probability theory (Bayesian Deep Learning). At NIPS it is fancy to claim to “be Bayesian”. Bayesian methods give us in general richer distributions over all possible states and parameters and better uncertainties (compared to just “point estimates” in DNNs) by using a unified and principled mathematical framework for model building, learning and inference. But from a handful of discussions and panels, it also becomes clear that we really need to get the model right (and this is hard!, see below), otherwise we are really screwed. Further drawbacks of Bayesian approaches are that not all prior knowledge can actually easily be encoded as joint distributions and that we do have only limited analytical forms to represent conditional distributions.

Hinton

Another interesting observation is that some of these aspects that bridge deep learning and Bayesian methods are already pretty well established: I thought it was very cool to see that variational inference in this context for instance is not treated as a fancy and cutting-edge approach anymore to approximate intractable distributions, but is rather seen as a generically established tool.

Disentangled embeddings and representations

... are everywhere. Despite being very simple, I still like Yoshua Bengio’s mnemonic to explain disentangling: it is all about finding the right encoder/decoder structure between a “tangled-knot-like” high-dimensional complex manifold in data space and the “unrolled” abstract disentangled embedding (see figure below).

Bengio

At the same time, the (competing) goals are to respect information completeness, invariance properties to things “I don’t care” and sufficiency properties. An interesting discussion arose when somebody challenged that finding these disentangled factors of variation is not that much different than finding statistically independent components (e.g. via Independent Component Analysis, ICA). Even though these approaches are definitely related, independence is not a necessary condition for disentangling, it just makes things nicer. Also, using DNNs or techniques of weak supervisions for disentangling captures a much broader range of problems than the “conventional ICA”. Disentangled representations can help facilitate debugging, understanding and interpretability. The focus on disentangled representations really makes me very happy as this is part of my own research at Uber. More resources in this field can be found on the workshop site.

Interpretability and causality

A dedicated symposia was about interpretability in ML, even though interpretability is not a proper technical term in any sense. From a technical perspective, there are multiple approaches to achieve levels of interpretability. Examples include establishing weak submodularity in image regions (this means e.g. that the network can tell us which image regions cause certain activations) or to use attention mechanisms which teach the network where to focus on. Using disentangled representations (previous paragraph) or learning a simpler linear (and more interpretable) approximate model are other ways that were being discussed. Even though interpretability is considered important, the main consensus is that a better non-interpretable model will generally still be preferred than a worse interpretable model. Being capable of running a thorough sensitivity analysis on input variations is actually much more important and meaningful than being interpretable. Interpretability might still help us to debug and tweak our DNNs.

Fairness and bias

A number of talks do have a close relationship to the social sciences. Kate Crawford gave an excellent talk called “The Trouble with Bias”. In a lot of disciplines (health care, ads, access to insurance services, criminal justice) bias can have huge effects on individuals (just think of the example of women being less likely of being shown high paying job ads [source], or people being denied insurance due to their risk profiles). Here we need to discriminate between harms of allocation (i.e. resources, this is basically outcome-focused or “downstream”) and harms of representation (i.e. how do we represent minorities, population groups, … this is significantly harder to define, “upstream”). Kate claims that, in order to attack both harming forms, we need to involve social scientists and domain experts in our construction of datasets and machine learning models.

One “upstream” example is how, per default, each classification task has the underlying socio-economic bias intrinsically encoded in its underlying dataset already (e.g. 17th century natural philosopher John Wilkins defined 40 “classes” of objects in the universe defined, ImageNet exactly defines 1000 image classes for its main classification task. Facebook, 4 years ago, had exactly 2 genders to select from: male and female. Today Facebook lets you select from 56 genders...) That means there is no absolute ground truth in this data. There are always assumptions included in such datasets which are influenced by the social context and factors of the time.
This means, we need to start tracking the life cycle (and changes) of datasets. But we also need to understand that each of these “design” decisions how to represent our world (in form of datasets and models trained on them) has immediate consequences and social implications on humans.

A lot of research is already going into the area of “downstream” harms of allocation. Still, given e.g. Trump’s newest ideas about automated terrorist screening procedures and the risk of introducing huge biases associated with severe implications for certain groups of people into these systems, Kate concludes that we need to ask (much more) questions what harmful things can happen if we build such systems.

Neuroscience research

I personally never liked to throw neural networks and true human intelligence into one bucket, as many people let themselves fool here. But more advanced and thorough methods show actually quite promising research which could inspire more breakthroughs by studying human brain activities.

Personally, I thought Schuck’s et al. paper was a very neat example here. They investigated how internal latent states are represented in certain brain regions (in particular in the orbitofrontal cortex, OFC). When letting humans perform certain perception and decision tasks that require to capture the current state of the world and past (memorized) information, they could train an SVM on top of fMRIs and classify these latent states with decent accuracy.

OFC They concluded that this OFC must have a specific role in human decision-making capturing the up-to-date representation of all task-related information (= state). I would hope that further research will explore the more specific relationship to reinforcement learning here.

Learning with limited amounts of data

The field is not new but it is still an interesting one as it is so diverse and was well represented at NIPS: This involves concepts of transfer learning (learning from one task but applying the gained knowledge to a different task), active learning (a sub-domain of semi-supervised learning where one assumes “expensive labels” and the learning algorithm queries the user to maximize information gain quickly) and weakly supervised learning (where the labels can be noisy and/or incomplete). It becomes clear once more that generative models and Bayesian methods (i.e. Variational inference) can be very powerful to provide structure to these tasks.

Meta learning

This seems like the logical next step taking off in the ML/DL community: How do we learn how to learn? However, I am still skeptical how much “science” is behind some of these approaches. I am by far not an expert here but some things seem just like a more fancy version of hyperparameter optimization or random grid search.
A more specific example: Peter Abbeel’s talk includes Model-agnostic meta learning (MAML) which is about finding a parameter vector which is a good initialization for many different tasks. For training we train by sampling from a distribution of different tasks, the learned parameters will then be adapted and fine-tuned depending on the specific tasks.

Deep reinforcement learning

… is not hyped-out yet, it feels like it is just starting off. Just a few directions here: Attention learning as way to focus on key subsets of the input and learning in a sequential way even on non-sequential data. Imitation learning, i.e. inferring a policy from just being shown a number of examples, has some promising new directions such as one-shot imitation learning. Hierarchical RL is another exciting area which means we could learn certain sub-policies within a master policy. The workshop on HRL is a good resource to explore this topic more. RL for robotics and meta reinforcement learning was well explained by Peter Abbeel’s keynote talk.

Implications

Apart from these academic trends, what does all of that mean for startups and the industry?

I hypothesize it is just a matter of time until more of the above-mentioned concepts and directions will be manifested in ready-to-use tooling and frameworks. Tooling for embeddings, playing around with features and their sensitivity for example. Frameworks such as edward or pyro (disclaimer: yes this was recently open-sourced by Uber) are already paving the right path in the Bayesian Deep Learning field.

However, and this is an even bigger statement: In particular the Bayesian domain can actually only be supported with out of the box methods to a limited degree. Here, what will remain true is that pure talent is the real asset. We all know that this holds true right now in AI and ML but this might even become more amplified.

Why?

Being “Bayesian” is just really hard. Knowing about the basic concepts of Probabilistical Graphical Models is not sufficient. As Zoubin Ghahramani pointed out during one workshop, most current ML courses [besides his own ones of course:-)] do not teach proper model-based thinking. But Bayesian thinking requires exactly deep and thorough model-based thinking. Being Bayesian means to think carefully about how to represent the world and design meaningful modeling/learning/inference relationships in a principled mathematical way. This, most of the times, is a lot harder than just using the combination of [deep learning framework of choice] + [pretrained model of choice] + [optimization procedure of choice]. Not saying these “common” deep learning approaches are trivial (they are definitely not), but combining Bayesian thinking with DNNs really addresses a different and new level of skills.

For interested people new to this field, I recommend browsing through Zoubin Ghahramani’s paper for an introduction, and the paper of Yarin Gal and Zoubin for a more advanced discussion. And full disclosure: Yes, Zoubin is Uber’s Chief Scientist and therefore, it makes me even happier to see him continue to be one of the greatest pioneers in this field.

What do you think? Do you agree with the statements here? What are the best ways researchers and engineers should educate themselves in the above-mentioned topics? Should we, intellify.us, do a deep dive into Bayesian Deep Learning here?