Since its advent, all the functionality in FarPlano relied exclusively on the HAFAS API provided by Public Transport companies, including finding nearest location, searching location by query, retrieving stationboard, etc.
The generic architecture of the whole ecosystem is depicted on the image below:
The use of the HAFAS API has its obvious advantages, such as the low implementation and maintenance costs, i.e. one just needs to implement a proxy interface that will interact with an API. However, there are also a number of disadvantages of the proxy architecture, including the impossibility to alter location search process, longer response times and various peculiarities of the HAFAS API (that deserve a separate post).
The impossibility of changing the HAFAS’s location search algorithm became the most crucial problem as we think it is possible to provide a better user satisfaction with other approaches. For example, HAFAS search algorithm does not use the information about user’s current location to search on the stations closes to him first. That’s why as our service continued to grow, we decided to implement our own station search algorithm. Luckily, many public transport companies provide their data in the standard General Transit Feed Specification (GTFS) format1.
As a backend database, we decided to use MongoDB for the following reasons:
- the data is only meant to be consumed by a service and updated by a script once a week, hence we don’t really need transaction support which will only slow things down;
- MongoDB natively supports geo-spatial indexes, which allows to efficiently perform nearest location searches;
- as in any document-oriented DB, data schema is flexible, which means we can add new fields or completely change document structure without hassle if necessary.
Now, efficient implementation of location search algorithm requires an efficient autocomplete search over the set of location names. The state-of-the-art solution to this problem is Prefix Tree or a similar data structure. However, MongoDB does not support prefix trees natively, but we can implement it by generating prefixes ourselves. At the core of the search algorithm there are the following three indexes:
- Full name prefixes (
[f, fr, fri, ..., fribourg g, ..., fribourg gare];
- Prefixes of non-first words (
[g, ga, gar, gare];
- Prefixes of individual words (
[f, fr, fri, ..., fribourg, g, ga, gar, gare]. This index handles word skips2.
Given the indexes, the search algorithm proceeds as follows:
- Nearest locations (10km max distance) are searched for matches, first on
second_index, then on
full_index. The rationale behind this order is that most cities contain city name as the first word in the location name. Also, we limit this query by 1 result, since if you are searching for a local station - it is usually a specific one.
- Search large-weight stations for matches on
full_index. Large-weight stations are big stations in the cities, such as
- Query for everything else using first
word_indexto handle word skips.
After new search algorithm was deployed, we decided to analyze the changes in response times and user’s typing patterns, e.g. if users now type less characters to search for locations. The graphs for response times are presented below.
As can be seen, both location- and query-based requests are now processed ~100ms faster on average.
To perform the analysis of users’ typing patterns, we have gathered the lengths of the search queries during a single search session that were followed by either a stationboard or a connection request (i.e. a user has completed his search task). The distribution of lengths before and after deploying the new search algorithm are shown on the figure below:
As can be seen from the distribution, after deploying new search algorithm which takes into account position of a user, the average search query length has decreased by a approximately one character. Furthermore, now we can see that 3- and 4- length queries now constitute ~32% of all queries as compared to merely 22% before!
To conclude, by deploying the search algorithm locally, we have improved both the response time and users’ satisfaction (by means of reducing query length). We will continue monitoring the result and report updates on the search algorithm improvements.