In my experiments, linear regression with extended (addition of squared values) attributes is very much competitive in accuracy terms with reported MLP accuracy.
The paper [1] referenced in your link follows the lagacy of the paper on the HIGGS dataset, and does not operate with quantities like accuracy and/or perplexity. HIGGS dataset paper provided area under ROC, from which one had to approximate accuracy. I used accuracy from the ADMM paper [2] to compare my results with. As I checked later, area under ROC in [1] mostly agrees with [2] SGD training results on HIGGS.
I think that perplexity measure is appropriate there in [1] because we need to discern between three outcomes. This calls for softmax and for perplexity as a standard measure.
So, my questions are: 1) what perplexity should I target when dealing with "mc-flavtag-ttbar-small" dataset? And 2) what is the split of train/validate/test ratio there?
For better or worse the people working on this don't really use perplexity or accuracy to evaluate models. The target is whatever you'd get for those metrics if you used the discriminants that were provided in the dataset (i.e. the GN2v01 values).
As for why accuracy and perplexity aren't reported: the experiments generally choose a threshold to consider something a "b-hadron" (basically picking a point along the ROC curve) and quantify the TPR and FPR at that point. There are reasons for this, mostly that picking a standard point lets them verify that the simulation actually reflects data. See, for example, the FPR [1] and TPR [2] "calibrations".
It's a good point, though, the physicists should probably try harder to report standard metrics that the rest of the ML community uses.
Perplexity, aka measuring how much a network is sure about its answer. Which might be wrong. It would not pass the pier review of any particle physics journal. (Real) science is about being right, not about being sure about itself.
And this problem is a joke compared to a real problem. We are talking about going from 40 MHz to 100 kHz incoming data stream, after which a second layer of real-time selection reduces the data to 1 kHz which is processed, cleaned, elaborated into high level features that you have in that dataset. But if you think you can do better, apply for a CERN job, come here and enlighten us!
[1] https://archive.ics.uci.edu/ml/datasets/HIGGS
In my experiments, linear regression with extended (addition of squared values) attributes is very much competitive in accuracy terms with reported MLP accuracy.