Preference-driven demonstrations ranking for inverse reinforcement learning

van der Wijden, R.

Preference-driven demonstrations ranking for inverse reinforcement learning

Title

Preference-driven demonstrations ranking for inverse reinforcement learning

Author

van der Wijden, R.

Contributor

Kober, J. (mentor)

Faculty

Mechanical, Maritime and Materials Engineering

Department

Biomechanical Engineering

Date

2016-07-04

Abstract

New flexible teaching methods for robotics are needed to automate repetitive tasks that are currently still done by humans. For limited batch sizes, it is too expensive to teach a robot a new task (Smith & Anderson, 2014). Ideally, such flexible robots can be taught a new task by a non-expert. A non-expert is a person who knows the task the robot should perform, but does not have experience in programming a robot. A powerful method that would allow for flexible robotics without the use of an expert is inverse reinforcement learning (IRL). IRL aims to learn the cost function out of demonstrations, this cost function is subsequently used to learn a policy which realizes the desired task. Current implementations focus more on the IRL algorithm itself and assume that there are enough demonstrations available and the quality of these demonstrations is also close enough to the optimal behaviour (Doerr et al., 2015). Whilst actually these demonstrations are very expensive and non-optimal. This thesis focuses on the effect of the quality of input demonstrations on the performance of the learned trajectory. Furthermore, how imperfect demonstrations still can be used, without lowering the performance of the learned trajectory. The first hypothesis is that the performance of the resulting trajectory depends on the average performance of the input demonstrations and the quantity of the input demonstrations has less of an effect. The second hypothesis is that by adding a ranking to the demonstrations, created through the preferences of non-robotic experts, the performance of the learned trajectory would be better than the average performance of the input demonstrations. The preferences of the non-robotic expert are collected through a crowdsourcing experiment. The preferences of the non-robotic expert are used to create an overall performance measurement. This overall performance measurement is used to obtain the sequentially order of the input demonstrations but also to evaluate the final learned trajectories. The results validate the first hypothesis. The average performance of the input demonstrations is determining the performance of the learned trajectory. The second hypothesis could not be confirmed. The results did not show any improvements in the performance of the learned trajectory when the ranking based on the preference of a non-robotic expert is added. It could be argued that the input demonstrations were too similar or the cost features used in IRL are not specific enough to create different cost functions and therefore create differently performing trajectories.

Subject

robotics
reinforcement learning
preference learning
inverse reinforcement learning

To reference this document use:

http://resolver.tudelft.nl/uuid:4a85d32d-79da-4983-97d7-530c7bb1da98

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Draft_Thesis_ReneevdWijden2.pdf

3.2 MB

Close viewer