Impressions from ICML 2013

Last month I’ve been to the International Conference on Machine Learning, in Atlanta, to present my work on Modeling Musical Influence.
Following are some impressions about what went on in the conference:

1. GraphLab

The first keynote was by Carlos Guestrin, describing the work they’ve been doing on GraphLab. I always think of it as “Matlab” where you replace the “Mat”(rix) with a graph. That is, a programming language where the primary objects are graphs. It’s supposed to be very fast, very programmer friendly (except for version 2…), and easily parellalized. There a version called GraphChi which is specifically suited for single disk use, and seems to have some remarkable results in terms of speed/performance.

2. Spectral methods for HMMLDANMF and more

There were a lot of talks about these new methods. The basic promise is “we now have an algorithm which is proven to give the one correct solution” for all these highly non-convex problems, where previously one had to run EM with multiple restarts to get a result and typically had no theoretical guarantees.
What seems to be the case though, is that the theory guaranteeing a solution hasn’t yet met up with the practice. This makes sense since EM-type algorithms have decades of experience behind them, and are currently able to find better results with fewer samples.
The current state of the art seems to be initializing the relevant EM variant with the parameters obtained from running the spectral algorithm.
A specifically interesting paper in this domain, by Huang & Schneider, was one where they used both dynamic data and static data drawn from the same process. What they said was that in many cases you have only a few dynamic time series for training, but a lot of static data relating to the same process, and they use both sources to better estimate the HMM.
They give two examples of this situation:

In some situations, the dynamic process of interest may evolve slowly over time, such as the progression of Alzheimer’s disease, and it may take months or even years to obtain enough time series data for analysis.

(while presumably collecting individual samples from Alzheimer’s patients is easy).

In other situations, it may be very difficult to measure the dynamic process of interest repetitively, due to the destructive nature of the measurement technique. One such example is gene expression time series. Although obtaining reliable time series, or dynamic data, can be difficult, it is often easier to collect static, orderless snapshots of the dynamic process of interest.

3. Anchor words for NMF and LDA

There were a few papers about this. The basic idea is that if we add a certain “separability” assumption to LDA or NMF models, they become efficiently and exactly solvable.
The assumption (in the LDA case) is this:
“For each topic there is one word that appears only in this topic”.

4. Sparse PCA

There were 2 papers offering fast and provably good sparse PCA algorithms. Some can be used simply as good feature selection algorithms. One paper, by Volodymyr Kuleshov, uses a simple Rayleigh power method iteration on the sparse subset and gets very fast convergence.
In the other paper, by Papailiopoulos et al., the algorithm chooses in an interesting way a small subset of good candidate features, and then they run a combinatorial procedure on top of those. The interesting part is that they can guarantee that they provide a good approximation to the optimal sparse PCA solution, with the quality of the approximation dependent on the decay of the matrix eigenvalues. They then show some very nice looking results on a twitter data with 64,000 features.

5. The Return of Frank-Wolfe

There were several papers, including one by Martin Jaggi from ETH, using an old optimization method called Frank-Wolfe, a.k.a. conditional gradient. The problem they solve is this: if you want to do gradient descent over a constrained set (for example, positive semidefinite matrices, or the L1-ball), then you perform a gradient step and then project back to the set. In many cases the projection step is the computationally expensive part.
The Frank-Wolfe algorithm gives a way to perform a variant of gradient descent over a compact set without the projection step. They show many cases in which this is more efficient computationally, and also has other advantages such as knowing exactly how far you are from the optimum by means of the duality gap.

6. Deep Nets

There were A LOT of talks about deep nets. In order to keep my model complexity low I avoided all of them… I did manage to catch up on a new thing they all seem very excited about, called dropout networks. I think the idea is to somehow zero-out some of the hidden units over training. That’s all I know for now.
7. And just a cool audio application
Finally, just a cool application I saw there, from Nick Bryan and Gautham Mysore: an interactive tool for separating sound sources, such as taking the vocals out of a song. Turns out there was no real way of doing this until now, and they used some nice ML techniques in the process (Bayesian NMF with an added constraint in the M-step of the EM). The tool is open source and available for download.
Here’s a video of a demo of the method:
and the paper: