> what differences the trees within the random forest have - as I understand it, they are all slightly different, and this gives them greater accuracy?
I can give you an example from my own work. We have a random forest on a 400+ attribute input (ie: 400 variables). All we want at the end is a probability from 0.0 to 1.0.
Our random forest model will build around 500 trees. Each tree randomly selected a small subset of those 400+ input attributes and says "what's the best I can do using only these attributes?". Generally, it does okay. But when you average the 500 trees, the accuracy is pretty darned good.
Edit later: To be clear, each new tree is generated using the random subset of variables. The point is that each tree may glean some insight about that small combination of variables.
I can give you an example from my own work. We have a random forest on a 400+ attribute input (ie: 400 variables). All we want at the end is a probability from 0.0 to 1.0.
Our random forest model will build around 500 trees. Each tree randomly selected a small subset of those 400+ input attributes and says "what's the best I can do using only these attributes?". Generally, it does okay. But when you average the 500 trees, the accuracy is pretty darned good.
Edit later: To be clear, each new tree is generated using the random subset of variables. The point is that each tree may glean some insight about that small combination of variables.