The project is a collaboration between the Metropolitan Area Planning Council (MAPC) and the Boston University Spark lab. The MAPC wanted to de-duplicate their Craigslist rental listings data efficiently and then classify multi-unit listings to seperate them from duplicate listings.

I've been leading a team of students in using K-means location clustering combined with string similarity to de-duplicate the listings. For classifying between multi-units and duplicate listings we are using a combination of string similarity, rent price, and location similarity to diffrentiate the results. In the current iteration we are building a labeling assistant to help the MAPC's volunteers and interns label the data as multi-unit and duplicate using certain attribute rules.

View Source Code