In this work, we show how to learn a visual walking policy that only uses amonocular RGB camera and proprioception. Since simulating RGB is hard, wenecessarily have to learn vision in the real world. We start with a blindwalking policy trained in simulation. This policy can traverse some terrains inthe real world but often struggles since it lacks knowledge of the upcominggeometry. This can be resolved with the use of vision. We train a visual modulein the real world to predict the upcoming terrain with our proposed algorithmCross-Modal Supervision (CMS). CMS uses time-shifted proprioception tosupervise vision and allows the policy to continually improve with morereal-world experience. We evaluate our vision-based walking policy over adiverse set of terrains including stairs (up to 19cm high), slippery slopes(inclination of 35 degrees), curbs and tall steps (up to 20cm), and complexdiscrete terrains. We achieve this performance with less than 30 minutes ofreal-world data. Finally, we show that our policy can adapt to shifts in thevisual field with a limited amount of real-world experience. Video results andcode at https://antonilo.github.io/vision_locomotion/.