Incremental Web Crawler

Jan 1, 2025 · 1 min read

Project Overview

During my time at JLL, I developed a robust data pipeline to monitor commercial asset data in real-time.

Key Achievements

  • High-Volume Data Collection: Developed Python-based web crawlers to collect over 20,000+ data points from various real estate sources.
  • Real-time Monitoring: Implemented Redis for incremental weekly updates and real-time monitoring, ensuring data freshness.
  • Data Quality Control: Built an automated cleaning program that identified limitations in token-based classification and proposed a geo-coordinate cross-verification improvement.
  • Impact: Successfully corrected over 700+ inaccuracies in the company database of 13,000+ entries.

Technologies Used

  • Python (Scrapy, Selenium, Pandas)
  • Redis
  • SQL