TranViT: An Integrated Vision Transformer Framework for Discrete Transit Travel Time Range Prediction
Accurate travel time estimation is paramount for providing transit users with reliable schedules and dependable real-time information. This paper proposes and evaluates a novel end-to-end framework for transit and roadside image data acquisition, labeling, and model training to predict transit travel times across a segment of interest. General Transit Feed Specification (GTFS) real-time data is used as an activation mechanism for a roadside camera unit monitoring a segment of Massachusetts Avenue in Cambridge, MA. Ground truth labels are generated for the acquired images based on the observed travel time percentiles across the monitored segment obtained from Automated Vehicle Location (AVL) data. The generated labeled image dataset is then used to train and evaluate a Vision Transformer (ViT) model to predict a discrete transit travel time range (band). The results of this exploratory study illustrate that the ViT model is able to learn image features and contents that best help it deduce the expected travel time range with an average validation accuracy ranging between 80 prediction can subsequently be utilized to improve continuous transit travel time estimation. The workflow and results presented in this study provide an end-to-end, scalable, automated, and highly efficient approach for integrating traditional transit data sources and roadside imagery to improve the estimation of transit travel duration. This work also demonstrates the value of incorporating real-time information from computer-vision sources, which are becoming increasingly accessible and can have major implications for improving operations and passenger real-time information.
READ FULL TEXT