Scaling ferns

Not so recently, rFerns reached v3; the most notable improvement was an introduction of parallel training, powered by OpenMP. For those who don’t know what rFerns is, it is a random forest-like classifier of a simpler structure and without optimisation, which is consequently much faster to train (unfortunately also less accurate in general, but often not as dramatic to make it useless). From my perspective, it is a very good source of attribute importance, and this is mostly the reason why I play with it. There is no place here to dive deeper; hopefully, there is a paper about it.

What is important is that this method in principle trivially parallel; the model is just a huge, flat collection of small models (ferns), which are exactly of a same size and mostly equally hard to produce. There are two modes of the aforementioned importance, simple and more expensive shadow (which estimates the significance of importance); their algorithms are also trivially parallel, but the current implementation involves a critical section for storing the results, to conserve the RAM usage. Enough said, time for some funny benchmarks; I have run rFerns on the Madelon dataset, using an old 4-core i7, 8-core first-gen Ryzen and our shiny new server with two 24-core Epycs.

Here is a basic picture, clipped to 16 threads for clarity: Plot of a speed-up versus single thread, for three computers and three variants of importance calculation. Follow the text for a summary All of the machines have 2× SMT, which is the reason why the plots for Ryzen and i7 are ‘broken in half’ – the utilisation of the first thread sharing core with another one leads to a drop in performance, and the scaling afterwards is substantially poorer; though, in this problem, SMT is certainly beneficial. For Epyc, this effect should surface after 48 threads, and I’ll discuss this later. The scaling is generally ok, and noticeably depends on the importance. This is somewhat expected, as more work in the parallel part generally hide scaling imperfections, both in the spirit of Amdahl’s law and sharing of other resources, like cache and RAM bandwidth. Some details seems perplexing for me, especially why does this difference is so much smaller on Intel; this surely calls for some other experiments with a simpler code.

Moving on, this plot shows the same data on an absolute scale, number of runs per second: Plot of execution time. Follow the text for a summary This allows for comparison of the speed of CPU cores on their own; one can see that Epyc is slowest (due to a low clock), while Ryzen is only slightly slower than i7 (this is naturally only because the tested i7 chip is like 4 years behind). This disappears beyond 4 threads, where Ryzen real cores compete with SMT; fully-load i7 is equivalent to 4-6 threads on Ryzen, while, amusingly, fully load 1700X is roughly equivalent to 16 real threads on Epyc. This plot also explicitly shows the cost of importance calculation – simple, on average, makes rFerns 2.5× slower, while shadow 16×; although, this seems pretty nice considering the fact that fern training is basically only random jumping through the dataset, while both importances are permutation tests and involve locks.

Finally, the full picture for the Epyc machine, up to 96 threads: Speed-up plot for the Epyc machine. Follow text for a summary Here, I have also added a Gustafson’s law scenario, that is make the number of ferns (so the number of concurrent tasks) depend on the number of threads; namely, put 10k ferns per thread as opposed to 100k ferns per all run in the original set-up. This noticeably improves scaling of vanilla training and shadow importance (even allowing the SMT break after 48 threads to evidently show up), and smooths out the fluctuations caused by a very rapid execution of the whole task in the huge thread count limit. On the other hand, simple importance exhibits a complete breakdown beyond 27 threads, with a plateau well below its best-case result; while this is the same in both scenarios, I can only suspect this is a case of a lock overload, which is masked by work in case of shadow importance.

So, to sum up, rFerns seem to work pretty well with seriously multicore machines AMD is bringing to the masses; the main problem seem to be an insufficient amount of load the benchmark problem provided; hopefully, this kind of issues tend to sort themselves out quickly in the real world.

Previously: Eight legs still, later: Latent supervision.

CC-BY mbq, written 11-10-2018.