I'm looking to buy a house. I've got some requirements, a garage, more than one bedroom, price, etc. Recently I set up Huginn, a system that automates online tasks. For this task, I wanted to monitor for new real estate listings, get a commute time, and send an email every half hour if it finds any.

Step 1: Pick a real estate site. I chose Zillow for no reason other than it was big. I think they all pull from the same database. Once I picked one, I set up a search. I selected all my criteria, including drawing a border for where I waned houses. All this criteria is encoded into the URL of the website. To check this, copy and paste the URL into an incognito/private window. If it has all your criteria, you're golden.

Step 2: Figure out where the listings are. To do this, I used a chrome extension called Selector Gadget. I used this to find where the listing was in the HTML page source. It gave a path, which is then used in Huginn to web scrape.

Zillow house results, shows class for each card. .list-card_not-saved
Zillow results

Step 3: Start building up the scraper. I didn't find the documentation for the website agent to be that great, but I did end up figuring it out for this project. I had problems with the URL at first, because Huginn uses liquid formating, and the URL had curly braces. Basically, I had to put the url inside {% raw %}URL{% endraw %} tags. The extract section eluded me for a bit. Basically if you use a css search, the path needs to be in a certain order. I first keyed on the ID .list-card_not-saved, then you can go deeper on IDs, classes or tag names. The IDs are prepended by a period, and the tags are just the name. The value can be a number of different things. If you want a parameter of the tag, use @name, so @href for a link, or @src for an image source. If you want what is inside the tag, say what is the link text, use string(.). With that, I got the below scraper. It pulls address, the image, the link, and the price.

This shows the details of my scraper. Every css has a .list-card_not-saved, then further specificity.
Scraper details

Step 4: Optionally find commute time. With the listings in hand, I wanted to get commute time. I passed the listings into another website agent. The thing about website scrapers is that they can't use javascript. This means google maps is out. Bing maps didn't work either. Bing proper functioned without JS, but I couldn't find the time with the scraper. Regular google search worked, however the class name was a series of seemingly random letters, and I didn't know if it would be the same for my huginn box as my desktop. Turned out it was. To reference the passed in address, liquid formatting is used. In the Google search URL, I added {{title | url_encode}}. This sanitized the address (Called title), and urlified it. NewHouseAddress to WorkAddress was the query. Make sure the mode is merge, or you'll lose the passed in information.

Step 5: Formatting. I made a quick formatter that would give me the basic information. It's not pretty, but it works. "message": "<a href=\"{{url}}\" alt=\\\"house\\\"> <h2> {{title}} <\\h2> <h3> <br> {{price}} {{time}} <br>  <\\h3> <img src=\"{{img}}\" /> <\\a> <br> <br> "

Step 6: Sending the email. This is the easiest part. Just set up an email digest agent to run every 30 minutes with a source of the formatter. Then fill in the subject and headline. Easy peasy.

Dataflow of the created thing. Zillow source to travel time to formatter to digest.
Zillow Event flow