As I continue to implement and experiment with a context representation based on Bayesian inference, I have considered various ways to implement curiosity. The main component I need is a prediction error based on the difference between the system's predictions and reality. Given the tools available in the Bayesian framework, the most obvious method would be to use the prior and posterior distribution. The prior represents the systems predictions, and the posterior represents reality (approximately). So the prediction error would be the difference (possible KL divergence) between the prior and posterior distributions. (One implication is that if the incoming evidence (likelihood) contains no new information, the prior and posterior will be equal, resulting in zero prediction error.) The prediction error can then be used to generate a curiosity reward, either for "surprising" events or for those that yield high learning progress (reduction in surprise over time).
After I went through all the trouble of figuring this out, I came across this mathematical model of surprise based on Bayesian statistics (in which surprise is measured in units of "wow").