We introduce a model and algorithm for following high-level navigation instructions by mapping directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time control. The Grounded Semantic Mapping Network (GSMN) is a fully-differentiable neural network architecture that includes modular and interpretable perception, grounding, mapping and planning modules. It builds an explicit semantic map in the world reference frame. The information stored in the map is learned from experience, while the local-to-world transformation used for grid cell lookup is computed explicitly within the network. We train the model using a modified variant of DAGGER optimized for speed and memory. We test GSMN in rich virtual environments on a realistic quadcopter simulator powered by Microsoft AirSim and show that our model outperforms strong neural baselines and almost reaches the performance of its teacher expert policy. Its success is attributed to the spatial transformation and mapping modules which also provide highly interpretable maps that reveal the reasoning of the model.