There’s a reason we hire realtors

Christine Lloyd
7 min readSep 8, 2022

or: we don’t have a number for “curb appeal”

“Scrape some data and build a linear regressor.” As project specifications go, that’s especially sparse. But on the upside, it leaves a lot of room to build an interesting project!

The project idea:

House prices in Seattle are borderline absurd; we’ve supposedly gained $100k in assets by just living in our house for two years. (This is actually ridiculous, to be clear.) So, what makes a house price?

Why does it matter? Won’t the market determine the right price? Not necessarily. There are a lot of costs to listing a house, as well as the costs associated with having a house on the market while it’s unsold (which could include paying two mortgages if you’ve already closed on your new house). Price it too high, and it will sit around for days or weeks or months; you might even accept a mediocre offer out of desperation. Price it too low, and you’ve pegged a worth in the buyers’ mind; even with a bidding war, it might not reach the price it could have.

So, can I build a regressor that predicts housing prices, allowing people to set a better price for their initial listing, maximizing their gains while minimizing time on the market? (Narrator: she couldn’t.)

Scraping Redfin: maybe just don’t.

I probably shouldn’t have been surprised to discover that web scraping is a contentious legal space and can be a rather adversarial process with the webpage you’re scraping from — after all, you’re using their webpage in a way they didn’t intend and using their data while not giving them the upside they’d planned on. I was warned that real estate sites were particularly tricky, but I had a question I actually cared about and I was in bootcamp to challenge myself!

Long story short, scraping the links to individual house listings from the search page wasn’t much of a hassle, but scraping the individual listing pages was a total pain. The listing pages are dynamically generated, with much of the information showing up only once you’ve scrolled down to it. Scrolling straight to the bottom of the page didn’t seem to populate most of the intervening content, and the length of pages varied considerably depending on things like how effusive the listing realtor was and whether certain features applied. In the end, I had selenium navigate a browser to the page, scroll, pause, repeat the scroll and pause a dozen times, wait a few seconds for things to finish loading, get the page source, hang on for 15–30 seconds, and then do it again, stopping for around 5–10 minutes after every so many pages.

3000-ish pages took a bit more than a day. But it worked! No CAPTCHAs, no errors, no hassle! Except. . .

Checking your code is more than skimming for obvious flaws.

After the test code worked, I copied it to a new cell and changed all the appropriate variable names and pointers. (Narrator: or so she thought.) I hit shift-enter, waited a few minutes, and then walked away, confident that I’d have my data next day.

Table of house data consisting of the same house’s information repeated over and over

Any guesses about what I did?

Yup, when I’d been running test code, a variable named “scrape” had been pointing to that particular webpage’s BeautifulSoup item in memory. Either that variable had been defined in a prior cell or I’d deleted the line defining “scrape” instead of re-defining it. So for 26 hours, my computer had been navigating to new pages, scraping them, and then throwing the data away and parsing the same item in memory over and over again.

On the upside, I started pointing and calling when going through my code, as I wrote about previously.

And after another 20-odd hours of scraping, I actually had data! Also, I had saved files containing the source code for all the pages I’d scraped so that if I needed to parse something new or check the prior parsing, it was available.

Munge, munge, munge the data!

Whenever there’s a human involved in data input, there are going to be interesting mistakes in the data.

Initially, I was looking at number of bedrooms, number of bathrooms, house type, square footage, year built, lot size, parking available (a major concern in some neighborhoods), location, when it sold, and the brokerage fee.

Parking available was way too varied for me to want to tackle manually and we hadn’t yet covered NLP, so that was dropped immediately. Brokerage fee is something sellers might be able to adjust, but it was so overwhelmingly set to 3% that there wasn’t likely to be much I could glean from looking at the very few cases at 2.5%.

House type unexpectedly turned out to be rather a beast.

There were on the order of 100 different types of houses

At first, I was going to separate houses out by number of stories, but it seemed a little overwhelming, and I ended up assuming that square footage plus location would probably together give the same information, approximately. In the end, I put together a dict that would separate listings into houses, townhouses, condos, and other. The various cabins, duplexes, triplexes, more-plexes, and houseboats were discarded, as there were too few data points in any category to make me feel comfortable making any conclusions.

Lot size had some very weird outliers. 0 square feet seemed totally reasonable for a condo, but 0.31? 0.51? Most of the sub-1 sq ft listings were under “Residence, <1 acre” and included photos of a house on a rolling green lawn. Ah. Data entry error strikes again! Well, at least it’s a relatively easy fix: houses with lots below a certain size have the lot size multiplied by the number of square feet per acre.

I set the cut-off at lot < 1 sq ft, as I didn’t want to balloon some lot sizes to absurd amounts. If I were doing it now, knowing what I know and with more time on my hands, I’d probably set a different cut-off, like lot size less than 1/3 the house square footage (even a tall house with tiny setbacks will have a lot larger than that). However, it seems like it worked well enough.

Huge variance-explaining powers!

Plugging in a vanilla linear regression on the available data sans ZIP code gave a fabulous R-squared of 0.72, without any feature engineering required beyond one-hot encoding house type. I was chuffed, to say the least. If this is how good it is out of the box, what could we manage with some feature engineering?

I added polynomial features en bloc; nothing in the pair plot cried out for nonlinear feature addition, but it’s often a good way to increase model performance. I then ran LASSO at various alpha strengths to find the best model. Since LASSO requires first normalizing all the features, the coefficients aren’t easily interpretable, so I experimented with adding the most important polynomial features to the simple linear model. I stratified ZIP codes into price per square foot quintiles that were approximately balanced by number of listings and tried adding that. There were some obvious outliers that seemed very mansion-like and were priced far higher than beds+baths would suggest; I eliminated some houses with extreme values. Iterate and iterate, massage and massage.

On one hand, the R-squared got as high as 0.94. On the other hand, the mean absolute error, which measures how far off the predictions are on average, that never went below $100k. Most predictions were closer, but some were enormous underestimates of the sales price.

Itty bitty price prediction usefulness.

Figuring out the best price for a house plus or minus $100,000 isn’t really going to help anyone.

So what’s wrong? Well, the data I gathered isn’t capturing all of the house price variance, and when we step back, it’s obvious — if curb appeal, spiffy new paint, and hip staging didn’t affect house prices, it seems like people would stop spending so much money to spruce a place up before putting it on the market. I can’t capture curb appeal, the niceness of the neighborhood (which is far more granular than “Greenlake” or “98104”), the street noise or traffic problems in the immediate area, how good the schools are, or any of these other squishy elements that make a house or a location “good.”

With a sufficiently large dataset, I might be able to develop algorithms to capture some of that, and I’m sure that’s part of what’s going on under the hood with Redfin’s price estimator or Zillow’s Zestimate. But without something telling me how appealing pictures are or which block is appealing because it’s convenient for thoroughfares and which block is unappealing because it’s directly on the thoroughfare, it’s impossible for me to capture all the house price variability.

Pivot to predicting ROI on house improvements

Back to the drawing board with a fresh train-test split and a completely vanilla linear regression on the available features; let’s just look at the contribution of various elements to house price.

All other things being equal, adding a bedroom without increasing the square footage at all is worth about -$20k. Why? That’s probably going to make the remaining space feel a lot more cramped.

An extra bathroom, on the other hand, is golden. Approximately $90k golden!

Basement space is much less valuable than ground floor space. All other things being equal, having a basement means -$80k on the price. (Anyone who’s lived in a garden-level apartment will probably agree with the market on this one.)

And each extra square foot is worth, on average, somewhere in the neighborhood of $300.

So, if you want to add a bedroom, make sure you’re also adding on to the house if you’re concerned about the subsequent sales price. If you think the place really needs another bathroom and you can afford it, hey, you might be able to recoup that cost on sale. And only finish the basement if you want to finish the basement, because future buyers aren’t going to be as thrilled with it as you are, in all likelihood.

In conclusion: there’s a reason people pay the big bucks to realtors.

All the intangibles and things like “curb appeal” mean that a professional probably has a better idea of what a given house is worth than any current algorithm, and certain better than my algorithm. Maybe algorithms will replace realtors, but not yet.

--

--

Christine Lloyd

Into science communication and public health. Simultaneously overqualified and underqualified. Happy to geek out over many different subjects.