Find All Coordinates Within A Circle In Geographic Data In Python

July 30, 2023 Post a Comment

I've got millions of geographic points. For each one of these, I want to find all 'neighboring points,' i.e., all other points within some radius, say a few hundred meters. There

Solution 1:

scipy

First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.

Binary Search

Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:

Compute the PCA of the dataset. You want to rotate the dataset such that the most significant direction is first, and the orthogonal (less large) second direction is, well, second. You can skip this and just choose X or Y, but it's computationally cheap and usually easy to implement. If you do just choose X or Y, choose the direction with greater variance.
Sort the points by the major direction (call this direction X).
To find the nearest neighbor of a given point, find the index of the point nearest in X by binary search (if the point is already in your collection, you may already know this index and don't need the search). Iteratively look to the next and previous points, maintaining the best match so far and its distance from your search point. You can stop looking when the difference in X is greater than or equal to the distance to the best match so far (in practice, usually very few points).
To find all points within a given range, do the same as step 3, except don't stop until the difference in X exceeds the range.

Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.

One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.

Delauney triangulation

Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...

Solution 2:

Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.

from scipy.spatial import cKDTree
from scipy import inf

max_distance = 0.0001# Assuming lats and longs are in decimal degrees, this corresponds to 11.1 meters
points = [(lat1, long1), (lat2, long2) ... ]
tree = cKDTree(points)

point_neighbors_list = [] # Put the neighbors of each point herefor point in points:
    distances, indices = tree.query(point, len(points), p=2, distance_upper_bound=max_distance)
    point_neighbors = []
    for index, distance inzip(indices, distances):
        if distance == inf:
            break
        point_neighbors.append(points[index])
    point_neighbors_list.append(point_neighbors)

Python Playground