In this paper, we propose a deep variational and structural hashing (DVStH) method to learn compact binary feature representation inside the network. Then, we design a struct layer rather than a bottleneck hash layer, to obtain binary codes through a simple encoding procedure. By doing these, we are able to obtain binary codes discriminatively and generatively. To make it applicable to cross-modal scalable multimedia retrieval, we extend our method to a cross-modal deep variational and structural hashing (CM-DVStH). We design a deep fusion network with a struct layer to maximize the correlation between image-text input pairs during the training stage so that a unified binary vector can be obtained. We then design modality-specific hashing n...