Policy Gradients Grow Up: Teaching RL to Think in Domains
The problem is not that RL cannot plan. It is that it keeps learning the wrong object. A warehouse robot can learn to pick up box A from shelf B and move it to station C. Very impressive, until tomorrow’s warehouse has different boxes, different shelves, and a new station name. The action label changed. The task structure did not. ...