The project was a collaboration between the Boston Planning & Development Agency (BPDA) and Boston University Spark! to create a comprehensive database of businesses in the Boston area.

The BPDA wanted to build a business database in Boston to understand profiles of businesses in the city along with when they closed during the COVID-19 lockdowns. We drew from public API sources such as Bing, Yelp, and Yellowpages by querying the list of all Boston area addresses provided by the City of Boston's SAM Live Address database.

The pipeline for creating the database went as the following:
1) The datasets were scraped from each public API source.
2) They were geocoded using the Google geocoder API for zip codes.
3) They were cleaned by standardizing data types for numbers and strings.
4) The merge between all the datasets was done by dividing the data into batches of zipcodes. The two datasets being merged would first have the merge done on the same zipcode batch using Dask, then for that zipcode batch the haversine distance between all the possible business combinations would be calculated and filtered for a distance of less than 18 meters. Additionally, a fuzzy similarity ratio was calculated between the now filtered dataset and only the entries with over 85% string similarity were kept. The above steps greatly sped up the merging process between entries that had different information between two sources (Yelp and Bing) but were actually the same business.