Imaging modalities and clinical measurement, as well as their time progression can be seen as heterogeneous observations of the same underlying disease process. The analysis of sequences of multi-modal observations, where not all modalities are present at each visit, is a challenging task. In this paper, we propose a multi-modal autoencoder for longitudinal data. The sequences of observations for each modality are encoded using a recurrent network into a latent variable. The variables for the different modalities are then fused into a common variable which describes a linear trajectory in a low-dimensional latent space. This latent space is mapped into the multi-modal observation space using separate decoders for each modality. We first ill...