When I met my boyfriend, I sometimes felt a tinge of regret that we never got to make an adorable meet-cute story, despite living in the same city for five years and attending the same university. Our story isn’t unique from many couples in 2020- in a trough between COVID waves, we started speaking online, on Bumble, a dating app.
Thanks to Google’s location history quietly ticking away collecting GPS data in the background I thought that I could potentially find an answer to how fatefully our paths criss-crossed before we met. Just how closely did we come to potentially having an eyes-meet-across-the-room moment?
Scroll down to ‘The Answer’ on the table of contents on the left if you’re not interested in the technical stuff!
You can request and download all your recorded location history through the Google Maps desktop website. This data comes in a zipped folder full of JSON files. My partner kindly agreed to download his data and sent it to me.
packages <-
c("tidyverse",
"lubridate",
"rjson",
"ggplot2",
"ggmap",
"wesanderson")
install.packages(setdiff(packages, rownames(installed.packages())))
library(tidyverse)
library(lubridate)
library(rjson)
library(ggplot2)
library(ggmap)
library(wesanderson)
#Names of two people
x <- "Chan"
y <- "Dan"
#Records file addresses (reference the Records.json file)
data_x <-
"~/Crossing paths/TakeoutChannon/Location History/Records.json"
data_y <-
"~/Crossing paths/TakeoutDandre/Location History/Records.json"
#Function from the fromJSON library to extract the JSON files into R list objects
extract_maps_data <- function(fileName, saveAs) {
rd <- fromJSON(file = fileName)
records_data <- unlist(rd, recursive = FALSE)
save(records_data, file = saveAs)
records_data
}
#Running the extract function on the data if it hasn't been run before, otherwise loading it in
if (file.exists("extracted_data_x.RData")) {
records_data_x <- get(load("extracted_data_x.RData"))
} else {
records_data_x <- extract_maps_data(fileName = data_x,
saveAs = "extracted_data_x.RData")
}
if (file.exists("extracted_data_y.RData")) {
records_data_y <- get(load("extracted_data_y.RData"))
} else {
records_data_y <- extract_maps_data(fileName = data_y,
saveAs = "extracted_data_y.RData")
}
Before we jump right into transforming and cleaning the data it’s necessary to set some boundaries around time frames and accuracy so that we don’t spend time processing unnecessary records.
This table displays how the distance accuracy measures relate to each other.
accuracy.level | 1 | 2 | 3 | 4 |
---|---|---|---|---|
decimal places | 3 | 4 | 5 | 6 |
accuracy (m) | 111 | 11.1 | 1.11 | 0.111 |
coordinates | +/- 00.000 | +/- 00.0000 | +/- 00.00000 | +/- 00.000000 |
The data has two types of accuracy indicators. The GPS coordinates have differing numbers of decimal places depending on the specificity of the location. There is also a reported accuracy measurement based on the strength of the signal of the GPS at the time which is reported in meters.
These accuracy measures define the margin of error in the distance for any of the conclusions that we draw. That is when we potentially conclude that he and I did cross paths, did we come into tens, hundreds, or thousands of meters from each other? I initially chose 11.1m, related to coordinates with 4 decimal places (+/- 00.0000), because I thought that even in dense urban areas the length of a telephone pole had a tantalizingly high probability of being within sight. However, the quality of the data (see below) limited the records too much, forcing me to use a distance accuracy of 111m instead.
I’ve paired together the two types of accuracy measures so that the distance accuracy is standardised for all our insights. Coordinates with more than the stated number of decimal places are rounded. This means that we’re dividing the world into bigger blocks and moving any of the points in the middle of a block to its nearest corner. Then, any of the reported accuracy measurements with distances less than the related distance in meters are filtered out of the data set.
#Distance margin
accuracy_level <- 1
We also need to define the time boundaries of the two data sets. The first date is the earliest possible date that we had any real probability of having come into contact. This could also be set to the minimum date in either of the two sets if there’s no clear probable contact time like moving to the same city.
## [1] "Chan 's earliest record is 2014-10-11"
## [1] "Dan 's earliest record is 2015-01-13"
## [1] "Use 2015-01-13 or first_intersect_date to start at earliest intersecting dates."
start_looking_from <- as.POSIXct(first_intersect_date) #Start date to start analysing data from
There is also an end date to the time set. That is the date and time that we’re definitely certain that our GPS coordinates would have matched up for the first time.
best_day <- as.POSIXct("2020/08/08 1:00:00") #The confirmed first meeting <3
Similarly to distance margins, it’s also helpful to set time intervals to group coordinates into. This is because when considering a scale of meters people just don’t tend to move fast enough across the surface of the earth on a day-to-day basis for it to be necessary to record our location many times a second, however, the data is stored down to millisecond time stamps.
I chose 10 minutes because it was also a margin, I was happy with having on either side of my insights, i.e., we may have sat on the same bench 10 minutes apart.
#Time margins
time_accuracy <- "10 mins" #The granularity of time buckets. Use ?round_date to see time categories available.
The tree-style lists I’ve converted the JSON files from are still unrefined. Currently, each data set is made up of hundreds of thousands of observations, each of which is in a listed format with the below structure:
## List of 7
## $ latitudeE7 : num -2.59e+08
## $ longitudeE7: num 2.92e+08
## $ accuracy : num 2459
## $ activity :List of 2
## $ source : chr "CELL"
## $ deviceTag : num 1.44e+09
## $ timestamp : chr "2014-10-11T14:39:52.645Z"
By the time we’re complete transforming the data we would like to have two data frames, one for each person, with the following structure:
index | timestamp | latitude | longitude | accuracy | person | coordinates | source |
---|---|---|---|---|---|---|---|
1:nrows | POSIXct time | dd coords | dd coords | m radius | name | lat, long | signal source |
To do that we’ll run the following function on both data sets:
transform_maps_data <-
function(x,
person,
time_accuracy,
distance_accuracy) {
# Extracting the variables that we're interested in and dropping all the information related to whether Google thinks we were in a vehicle or sitting still.
timestamp <-
lapply(x, pluck, "timestamp") %>% unlist() %>% data.frame()
latitude <-
lapply(x, pluck, "latitudeE7") %>% unlist() %>% data.frame()
longitude <-
lapply(x, pluck, "longitudeE7") %>% unlist() %>% data.frame()
accuracy <-
lapply(x, pluck, "accuracy") %>% unlist() %>% data.frame()
source <- lapply(x, pluck, "source") %>% unlist() %>% data.frame()
# Setting an index
index <- 1:nrow(timestamp)
# Combining each of the variables into a data frame as columns with clear names.
records_df <-
cbind(timestamp, latitude, longitude, accuracy, source)
rownames(records_df) <- index
colnames(records_df) <-
c("timestamp", "latitude", "longitude", "accuracy", "source")
# Adjusting the formatting and adding some key columns.
output <- records_df %>%
# Formatting the time stamps so that they are POSIXct objects in the system timezone and rounding them off into the given accuracy limits
mutate(timestamp_utc = as.POSIXct(timestamp, format = "%Y-%m-%dT%H:%M:%S", tz = "UTC")) %>%
mutate(timestamp = with_tz(timestamp_utc, tzone = Sys.timezone())) %>%
select(-timestamp_utc) %>%
mutate(timestamp = round_date(timestamp, unit = time_accuracy)) %>%
# Adding a column with the name of the individual who the records belong to
mutate(person = person) %>%
# Latitude and longitude are in a suitable format to allow for quick copy and pasting into Google Maps, and are filtered within the distance.accuracy number of decimal points
mutate(latitude = round((latitude / 10000000), distance_accuracy)) %>%
mutate(longitude = round((longitude / 10000000), distance_accuracy)) %>%
# Adding a column for combined coordinates
mutate(coordinates = paste(latitude, longitude, sep = ", ")) %>%
# Filtering our any records outside of reported accuracy
filter(accuracy <= google_reported_accuracy) %>%
filter(accuracy >= 0) %>%
# Filtering our any records before or after start and end dates
filter(timestamp >= start_looking_from) %>%
filter(timestamp <= best_day) %>%
# Removing unknown sources for integrity
filter(source != "UNKNOWN") %>%
# Removing duplicates
unique()
}
#Running the function on the two data sets or otherwise loading data we've already saved before.
if (file.exists("X_Google_Maps_history.RData")) {
load("X_Google_Maps_history.RData")
} else{
x_records <-
transform_maps_data(
records_data_x,
person = x,
time_accuracy = time_accuracy,
distance_accuracy = distance_accuracy
)
}
if (file.exists("Y_Google_Maps_history.RData")) {
load("Y_Google_Maps_history.RData")
} else{
y_records <-
transform_maps_data(
records_data_y,
person = y,
time_accuracy = time_accuracy,
distance_accuracy = distance_accuracy
)
}
At the end of our transform step we have two data frames with the following structure:
## timestamp latitude longitude accuracy source person coordinates
## 1 2015-01-13 02:00:00 -25.89 29.252 12 CELL Chan -25.89, 29.252
## 3 2015-01-13 02:04:00 -25.89 29.252 12 CELL Chan -25.89, 29.252
## 4 2015-01-13 02:06:00 -25.89 29.252 12 CELL Chan -25.89, 29.252
## 5 2015-01-13 02:10:00 -25.89 29.252 12 CELL Chan -25.89, 29.252
## 7 2015-01-13 02:12:00 -25.89 29.252 27 CELL Chan -25.89, 29.252
## 8 2015-01-13 02:14:00 -25.89 29.252 27 CELL Chan -25.89, 29.252
Saving the files.
save(x_records, file = "X_Google_Maps_history.RData")
save(y_records, file = "Y_Google_Maps_history.RData")
Working with one dataset is easier than working with two in this case so we’ll create one long-version of all the records for analysis.
# Combining the records together
if(file.exists("combined_records.RData")) {
get(load("combined_records.RData"))
} else{
columns <-
c(
"person",
"timestamp",
"coordinates",
"latitude",
"longitude",
"accuracy",
"source"
)
combined <- full_join(x_records, y_records, by = columns) %>%
group_by(timestamp, coordinates, person) %>%
mutate(max_accuracy = min(accuracy), .keep = "unused") %>%
unique()
}
## # A tibble: 418,929 × 7
## # Groups: timestamp, coordinates, person [408,904]
## timestamp latitude longitude source person coordinates max_accuracy
## <dttm> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 2015-01-13 02:00:00 -25.9 29.3 CELL Chan -25.89, 29… 12
## 2 2015-01-13 02:10:00 -25.9 29.3 CELL Chan -25.89, 29… 12
## 3 2015-01-13 02:20:00 -25.9 29.3 CELL Chan -25.89, 29… 27
## 4 2015-01-13 02:30:00 -25.9 29.3 CELL Chan -25.89, 29… 27
## 5 2015-01-13 02:40:00 -25.9 29.3 CELL Chan -25.89, 29… 27
## 6 2015-01-13 02:50:00 -25.9 29.3 CELL Chan -25.89, 29… 27
## 7 2015-01-13 03:00:00 -25.9 29.3 CELL Chan -25.89, 29… 27
## 8 2015-01-13 03:10:00 -25.9 29.3 CELL Chan -25.89, 29… 25
## 9 2015-01-13 03:20:00 -25.9 29.3 CELL Chan -25.89, 29… 25
## 10 2015-01-13 03:30:00 -25.9 29.3 CELL Chan -25.89, 29… 25
## # … with 418,919 more rows
save(combined, file = "combined_records.RData")
arrange(combined, coordinates) %>% head(5)
## # A tibble: 5 × 7
## # Groups: timestamp, coordinates, person [5]
## timestamp latitude longitude source person coordinates max_accuracy
## <dttm> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 2018-10-20 16:00:00 -25.3 31.0 WIFI Dan -25.344, 31… 17
## 2 2018-10-20 16:10:00 -25.3 31.0 WIFI Dan -25.344, 31… 17
## 3 2015-12-29 09:50:00 -25.4 29.4 WIFI Chan -25.395, 29… 96
## 4 2015-12-29 10:00:00 -25.4 29.4 WIFI Chan -25.395, 29… 51
## 5 2015-12-29 10:20:00 -25.4 29.4 WIFI Chan -25.395, 29… 58
An analysis of how many records are available to who and for which periods shows that 98% of Chan’s periods had records that meet the accuracy standards. In comparison, 45% of Dan’s periods contained records.
Gaps in records could be due to a lack of mobile data (which is notably expensive in South Africa) or our cell phones being turned off. Further, there may be gaps when privacy settings disallowed the collection of location history.
These variables mean that no matter the result, there will still be some uncertainty during the un-recorded periods where we may have crossed pathways but no evidence exists.
Analysing the source of the records shows that most of the collected records came from instances where the cell phone was connected to Wi-Fi.
However, when considering the number of unique coordinates per source it’s evident that most of the unique locations came from GPS, and that Wi-Fi and cell records were tied to a handful of locations. Only 6747 WiFi routers made up all the 58865 records. There were also only 256 cell phone towers with unique coordinates making up 3239 records.
We can understand this better by looking at the aggregate number of unique records per unique location by source. Wi-Fi had the greatest number of records per location because many records in the range of a Wi-Fi router were centralised to the router’s location when cell devices connected to Wi-Fi. The same is true for cell towers, however at a lesser rate. GPS records were the most likely to be recorded with higher accuracy.
## source unique_locations unique_records records_per_location
## 1 CELL 256 3239 12.7
## 2 GPS 16303 58865 3.6
## 3 WIFI 6747 356825 52.9
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 14 rows containing missing values (geom_bar).
Finally, it’s important to consider the differences in the accuracy of different sources. Along with being the most likely to record coordinates away from centralised Wi-Fi routers or cell towers, GPS records are also on average the most accurate.
This is an initial view of some of the most common places we both visited.
#Group the data together by places
places <- combined %>%
group_by(coordinates, person) %>%
summarise(latitude, longitude, timestamp, source, records = n()) %>%
arrange(desc(records))
#Find all the places x has been
x_places <- places %>%
filter(person == x)
#Find all places y has been
y_places <- places %>%
filter(person == y)
#Look for the intersection of places that both x and y had been
if (file.exists("both_places.RData")) {
both_places <- get(load("both_places.RData"))
} else {
both_places <-
full_join(x_places,
y_places,
by = c("latitude", "longitude", "coordinates", "source")) %>%
filter(!is.na(records.x)) %>%
filter(!is.na(records.y)) %>%
mutate(both_records = records.x + records.y) %>%
mutate(x_per = (records.x / both_records) * 100) %>%
mutate(y_per = (records.y / both_records) * 100) %>%
mutate(dif_per = x_per - y_per) %>%
# mutate(time_dif = abs(difftime(timestamp.x, timestamp.y, units = "auto"))) %>%
mutate(time_dif = abs(timestamp.x - timestamp.y)) %>%
unique() %>%
arrange(desc(both_records))
}
save(both_places, file = "both_places.RData")
# Get Map data from Google API
# Note: you'll need to get a key on the Google Maps platform to access the API.
# Go to https://mapsplatform.google.com/ -> get started -> credentials -> API keys
# Find Maps API Key and SHOW KEY, then copy and paste your API key below:
# register_google(key = "[your key]", write = TRUE)
#plot the map, scaling by number of records, and coloring by who visited most frequently
map_most_common_places <- get_map(
location = c(
lon = median(both_places$longitude),
lat = median(both_places$latitude)
),
zoom = 12,
scale = "auto",
maptype = "toner-lite"
)
common_places_map <- ggmap(map_most_common_places) +
geom_point(
aes(
x = longitude,
y = latitude,
size = both_records,
color = dif_per
),
data = both_places,
alpha = 0.8,
position = "jitter"
) +
scale_radius(range = c(1, 10),
name = "Number of records") +
scale_color_gradientn(
colors = three_colors,
name = "Most visited by person",
breaks = c(min(both_places$dif_per), 0, max(both_places$dif_per)),
labels = c(y, "both", x)
) +
theme(
text = element_text(family = "sans"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
) +
ggtitle(label = "Common locations")
common_places_map
There were 2729 coordinates we were both recorded in at some point during the five-year time frame. Most of the common locations are the main roads in and around the city. I lived and worked in the northern part of the city, and that’s visible by all the yellow. My partner spent most of his time living and studying around the university, visible in big red markers.
Now that I’ve been able to whittle down the potential locations we could have met, next we’ll need to measure how near in time we came to be at those locations.
We’ve already calculated the time difference (time_diff) between any two visits in both_places data frame. This is because when joining the two data frames for each timestamp of person x’s visits in a set of coordinates, there is corresponding data for each timestamp of person y’s visits to a particular set of coordinates.
#To minimise the time it takes to run we can begin by filtering out the values with time differences of more than 12 hours (the vast majority of the records).
both_places_under_12_hours <- both_places %>%
filter(time_dif < (12 * 60 * 60))
#Then we can group by coordinate and include the instances with the minimum time differences between visits per coordinate
grouped_places <- both_places_under_12_hours %>%
group_by(coordinates) %>%
mutate(closest_visit = min(time_dif))
closest_timestamps <- grouped_places %>%
unique() %>%
filter(time_dif == closest_visit) %>%
mutate(closest_visit = as.numeric(closest_visit) / 60)
crossed_paths <- closest_timestamps %>% filter(time_dif == 0)
map_close_brushes <- get_map(
location = c(
lon = median(closest_timestamps$longitude),
lat = median(closest_timestamps$latitude)
),
zoom = 15,
scale = "auto",
maptype = "toner-lite"
)
close_brushes_map <- ggmap(map_close_brushes) +
geom_point(
aes(x = longitude, y = latitude, color = closest_visit),
data = closest_timestamps,
size = 2,
alpha = 0.7,
position = "jitter"
) +
scale_color_gradientn(colors = three_colors, name = "Minutes apart (+/- 10 mins)") +
theme(
text = element_text(family = "sans"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
) +
ggtitle(label = "Close encounters") +
geom_text(
data = closest_timestamps,
aes(longitude, latitude, label = closest_visit),
color = "white",
check_overlap = TRUE,
size = 3,
hjust = 0.5,
vjust = -1.2
)
close_brushes_map
Oh, how exciting! There were 33 instances (largely around campus) where we had come closer than 111 meters from another 10 mins or less apart.
So when did these close brushes occur? Let’s plot them over time.
all_intervals <-
seq(from = start_looking_from, to = best_day, by = "10 min") %>%
as.data.frame()
colnames(all_intervals) <- "timestamp"
whole_time_x <-
full_join(all_intervals, x_records, by = "timestamp") %>%
mutate(person = x)
whole_time_y <-
full_join(all_intervals, y_records, by = "timestamp") %>%
mutate(person = y)
#I'm using the haversine formula to calculate the distance between two points on earth's spherical surface. This is because the harversine formula offers the most accuracy on the scale of meters.
haversine_calc <- function(lat.x, lat.y, lon.x, lon.y) {
#radius of the earth in km
r <- 6371
#converting degrees to radians
r.lat.x <- (lat.x * pi) / 180
r.lat.y <- (lat.y * pi) / 180
r.lon.x <- (lon.x * pi) / 180
r.lon.y <- (lon.y * pi) / 180
dlat <- r.lat.y - r.lat.x
dlon <- r.lon.y - r.lon.x
a <- sin(dlat / 2) ** 2 + cos(r.lat.x) * cos(r.lat.y) * sin(dlon / 2) ** 2
c <- 2 * asin(sqrt(a))
d <- c * r
}
whole_time_both <-
full_join(whole_time_x, whole_time_y, by = "timestamp") %>%
filter(!is.na(coordinates.x)) %>%
filter(!is.na(coordinates.y)) %>%
mutate(
distance = haversine_calc(
lat.x = latitude.x,
lat.y = latitude.y,
lon.x = longitude.x,
lon.y = longitude.y
) * 1000
) %>%
arrange(distance)
whole_time_both$accuracy <-
whole_time_both$accuracy.x + whole_time_both$accuracy.y
whole_time_simple <- whole_time_both %>%
select(timestamp, distance, accuracy) %>%
mutate(day = as.Date(timestamp)) %>%
group_by(day) %>%
transmute(min_dist = min(distance), min_acc = min(accuracy)) %>%
mutate(closest_brush = (min_dist == 0)) %>%
unique()
ggplot(whole_time_simple,
aes(x = day, y = min_dist, color = closest_brush)) +
geom_point(
alpha = 0.5,
position = "jitter",
fill = "white",
size = 3
) +
scale_color_discrete(two_colors, name = "Closest brush?") +
theme_classic() +
labs(x = "Date",
y = "Distance from each other (meters)",
title = "Distance apart over time") +
ylim(0, 2500)
Between 2016-2017 while I was studying and before I graduated the distances between us on any given day tended to be closer than in 2018 when I stopped going to campus. This makes me think that our chance of meeting in the wild most likely would have happened on campus, and that after that our chances declined.
#Finding the dates that we were at 0m(+/- 111m) from each other
dates_crossed_paths <- whole_time_both %>%
filter(distance == 0) %>%
mutate(day = as.Date(timestamp))
# %>%
# filter(!day %in% problem_dates)
#Visualising it
map_close_distances <- get_map(
location = c(lon = median(
c(
dates_crossed_paths$longitude.x,
dates_crossed_paths$longitude.y
)
),
lat = median(
c(
dates_crossed_paths$latitude.x,
dates_crossed_paths$latitude.y
)
)),
zoom = 17,
scale = "auto",
maptype = "toner-lite"
)
ggmap(map_close_distances) +
geom_point(
aes(x = longitude.y, y = latitude.y),
data = dates_crossed_paths,
color = two_colors[2],
size = 1,
alpha = 0.5,
position = "jitter"
) +
geom_point(
aes(x = longitude.x, y = latitude.x),
data = dates_crossed_paths,
color = two_colors[1],
size = 1,
alpha = 0.5,
position = "jitter"
) +
theme_void() +
labs(title = "Crossing paths on campus") +
facet_wrap( ~ day, ncol = 4, nrow = 8)
total_time_in_uni <- ((total_time/144) - as.numeric(as.Date(best_day) - as.Date("2018/01/01")))*(24*6) %>%
as.numeric()
both_records_time <- whole_time_both$timestamp %>% unique() %>% length()
per_time_with_records <- both_records_time/total_time_in_uni
extrapolated <- crossed_paths_n / per_time_with_records %>% round(digits = 1)
There were 33 recorded occasions where we may have crossed paths, mostly on campus. A whole bunch of times we may have been within arm’s reach of each other, and we never even registered. Considering that only 0.7851364 percent of our time had records that met the given accuracy standards, extrapolating out to the start of 2018 when I stopped going to the university campus (making a few iffy assumptions about the records representing a random sample) this means we could have crossed paths 41.25 times!
It makes me feel very philosophical. Likely we had walked past each other every other week for three years with our heads down, or in the clouds, paying very little notice to the people around us. Past Channon had not the foggiest clue what impact an unnoticed stranger would come to have on her. Since learning this I can’t help but look at the passers-by in my day-to-day life in a different light.
I’m glad to have gone down this rabbit hole because now I think about the 41.25 near-misses we had, where if an audience had been watching they may have been on the edge of their seats. On the 16th of August 2017, exactly one week short of three years before we would meet, we’re recorded at the same GPS coordinates in this spot where I frequently had lunch, near a coffee booth in the social sciences building. Did he just trot past on the stairs? Did he take a seat in the shade under the trees? Did we queue together in line for coffee? I guess Google doesn’t have the answer for everything. but it’s pleasant to imagine when we have our coffee together every morning that years before we may have silently enjoyed coffee near each other.
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Macarulla Rodriguez, Andrea & Tiberius, Christian & Bree, Roel & Geradts, Zeno. (2018). Google timeline accuracy assessment and error prediction. Forensic Sciences Research. 3. 240-255. 10.1080/20961790.2018.1509187.
movable-type.co.uk/scripts/latlong.html#https://www.geeksforgeeks.org/program-distance-two-points-earth/#:~:text=For%20this%20divide%20the%20values,is%20the%20radius%20of%20Earth.