homelocator | Qingqing Chen

Identifying meaningful locations, such as home or work, from human mobility data has become an increasingly common prerequisite for geographic research. Although location-based services (LBS) and other mobile technology have rapidly grown in recent years, it can be challenging to infer meaningful places from such data, which – compared to conventional datasets – can be devoid of context. Existing approaches are often developed ad-hoc and can lack transparency and reproducibility. To address this, we introduce an R package for inferring home locations from LBS data. The package implements pre-existing algorithms and provides building blocks to make writing algorithmic ‘recipes’ more convenient. We evaluate this approach by analyzing a de-identified LBS dataset from Singapore that aims to balance ethics and privacy with the research goal of identifying meaningful locations. We show that ensemble approaches, combining multiple algorithms, can be especially valuable in this regard as the resulting patterns of inferred home locations closely correlate with the distribution of residential population. We hope this package, and others like it, will contribute to an increase in use and sharing of comparable algorithms, research code and data. This will increase transparency and reproducibility in mobility analyses and further the ongoing discourse around ethical big data research.

The package can be installed in R using the following command:

install_github("spatialnetworkslab/homelocator")

Specifically, the package has four built-in “recipes”. The default “recipe” is called HMLC, which weights data points across multiple time frames to “score” potentially meaningful locations. The other three “recipes” are named OSNA, APDM and FREQ.

Take the HMLC recipe as an example, there are a list of pre-conditions are used for filtering “meaningful” locations (see Fig. 1 (left)), but the thresholds for each condition are tuneable based on specific research questions. After this initial filtering phase, the algorithm uses a “scoring” phase to determine the most likely home location (see Fig. 1 (Right)).

Figure 1. Left: Default pre-conditions used for selecting "meaningful" locations in the "HMLC" recipe. Right: Default scoring standards used in the "HMLC" recipe. The weight for each activity is tunable according to a researcher's own importance.

In addition, the package allows for the customization of “recipes”, or writing them completely from scratch. Many of the existing algorithms in the literature can be expressed through a set of basic building blocks, implemented as various key functions.

validate_dataset: validates the required variables are present in the input dataset. This is a useful first step, especially in the context of reproducible workflows with different input datasets.
enrich_timestamp: derives additional variables from timestamp column (e.g., ‘weekday’, ‘hour of the day’) that are often used as intermediate variables in home location algorithms.
A set of functions that make it easier to operate on “nested” data. Nesting data (i.e., the storage of an entire table inside of a column) is a common procedure in the tidyverse approach to data manipulation. For location inference, it is often necessary to operate on doubly nested data as analysis is needed per person, per location. To aid the mental gymnastics required for this, we provide nested versions of common tidyverse functions such as mutate and summarise, as well as functions specific to location inference such as score_nested, which allows the user to perform a scoring phase, after the initial filtering. Nesting has an aditional advantage as it allows for the subsequent parallel processing of data, which can dramatically speed up computing time.
Finally, a function named extract_location extracts the most likely home location for each user based on the filtering and scoring in the previous steps.

More details about the package and its source code can be accessed as a GitHubrepository: https://github.com/spatialnetworkslab/homelocator. Documentation for the functions of the package and a tutorial on its use can be found on its documentation website: https://homelocator-website.netlify.app. In addition, we take Singapore as a test case to illustrate and evaluate the package and proposed approach. The dataset used is de-identified based on the workflow shown in Figure 1 and it can be downloaded from https://figshare.com/articles/dataset/De-identified_location-based_services_dataset/13394102.

Figure 2. Workflow of de-identification approach.

Chen, Q. and Poorthuis, A. (2021) Identifying home locations in human mobility data: an open-source R package for comparison and reproducibility. International Journal of Geographical Information Science, Pages 1425-1448, https://doi.org/10.1080/13658816.2021.1887489