AbstractReinforcement learning aims to adapt an agent to an unknown environment according to rewards. There are two issues to handle delayed reward and uncertainty. Q-learning is a representative reinforcement learning method. It is used in many works since it can learn an optimum policy. However, Q-learning needs numerous trials to converge to an optimum policy. If the target environments can be described in Markov decision processes, we can identify them from statistics of sensor-action pairs. When we build the correct environment model, we can derive an optimum policy with the Policy Iteration Algorithm. Therefore, we can construct an optimum policy through identifying environments efficiently.We separate the learning process into two ph...