Object tracking is an important component in computer vision, which is the field that aims to replicate the abilities of human vision by automatically analyzing and understanding the content of digital images or videos. Tracking has applications in a wide range of domains. For instance, tracking techniques may be used in systems that remove camera shake or movement from a video, systems that automatically focus a camera on a target object, systems for driving assistance, systems for activity and action recognition, or systems that perform 3D reconstruction of objects based on images from a single moving camera. A tracking system generally consists of three components: an appearance model that measures whether an image location may contain the target object, a location model that models the prior probability of the target object being in a particular location, and a search strategy that aims to identify the target object based on these two models. Trackers can be grouped into two main types: model-based trackers and model-free trackers. In model-based tracking, the appearance of the target object is usually modeled offline, for instance, by training a machine-learning algorithm on a collection of annotated images. In model-free tracking, the appearance of the target object is modeled online without exploiting prior knowledge on the object appearance. A situation in which the generic nature of the appearance models of model-based trackers and the limited discriminativeness of the appearance models of model-free trackers is particularly problematic is the situation in which multiple objects are present that have similar visual appearance. This situation frequently occurs, for instance, when tracking people, faces, or cars in complex environments. The key problem here is that the tracker may switch from the target object to another visually similar object. This is especially hard for model-free tracking because three important problems in model free tracking remain unsolved: (1) information on the visual appearance of the target object is ambiguous in the sense that the initial bounding box only approximately distinguishes the object of interest from the background, (2) the object appearance may change drastically over time, in particular, when the object is deformable, and (3) only simple appearance models can be used to maintain real-time performance of the sliding-window exhaustive search. In this thesis, we propose a new model-free tracker that can track multiple objects with similar visual appearance by incorporating spatial constraints between the objects, which are learned online along with the appearance models. We show that this novel structure-preserving object tracker (SPOT) achieves substantial performance improvements in multi-object tracking. Model-based trackers are powerful and robust as they have been trained offline on (potentially extremely) large collections of data. Yet they are not adapted to the object at hand. This thesis also explores the possibilities to adapt model-based parameters online using concepts from model-free trackers. In particular, we adapt the parameters of the model-based tracker to the visual appearance of the particular target object using an online structured SVM, essentially using our techniques for model-free tracking in model-based tracking. We further improve the performance of the resulting tracker by online learning a prior distribution over the size of objects. The experimental evaluation of our tracker demonstrates its effectiveness in pedestrian and car tracking. In many practical applications of trackers, it is essential that the tracker can operate in real-time on a variety of computational platforms. Approximate search strategies such as particle filters have the advantage that they allow the user to trade-off speed for accuracy by adapting the number of particles. However, the localization of target objects by approximate search is usually somewhat inaccurate, which may cause the tracker to drift away from the true location, in particular, when the target object has been temporarily occluded or when fast camera motions are present in the image sequences. Exhaustive search strategies such as sliding-window search are generally more accurate than approximate search strategies, but they tend to be slow for large search spaces and more importantly, they also do not allow the user to trade off speed for accuracy based on the available computational budget. In this thesis, we present a new approach that reduces the computational costs of trackers by ignoring features in image regions that — after inspecting a few features—are unlikely to contain the target object. To this end, we derive an upper bound on the probability that a location is most likely to contain the target object, and we ignore (features in) locations for which this upper bound is small. With this upper bound we realize a formal trade-off between accuracy and computational burden within trackers. We demonstrate the effectiveness of our approach in experimental evaluations, which show that the average number of inspected features can be reduced by up to 90% without affecting the accuracy of the tracker. Taken together, this thesis makes a major contributions to model-free trackers, both in performance as well as in efficiency, and shows the power of online learning, also for model-based trackers. With that steps are taken towards realizing computers that can see.