Coding a web scraper on a mainframe (metaphorically speaking)

Christine Lloyd
5 min readOct 6, 2020

or: slow down so you can get it finished faster

When we get into computer discussions, my dad likes to throw in his stories from learning computer programming with the school’s mainframe, running cards overnight when there was idle time available. Mishandled, spindled, and accidentally shuffled cards were causes of frustration, but the most maddening was a bug in the code. Due to inattention, haste, or an insufficient understanding of CS, there was something wrong in your code, and so you got the cards back the next morning with a terse “compile error” or similar from the technician. So you revised it, checked it over and over, and then handed the cards off again to get run. . . sometime. When they had the time. Maybe in a couple days.

I’m now learning webscraping and linear regression on a tiny machine many orders of magnitude more powerful than that hulking mainframe, but as the computers have gotten more powerful, the problems we tackle with them have become more complex. I don’t have to worry about spindled cards or garbage collection, but I do struggle with library compatibility and some serious data sets. At the moment, I’m using a suite of libraries in Python to drive an instance of my browser, scrolling down in a human-like fashion until the dynamically generated page loads the content I’m looking for, scraping the page source, dropping that into a list, and occasionally converting the list and pickling it so I don’t lose all my data if something crashes. When that’s done, I get to unpickle it, convert it back, and then run it through a function to parse out the information I think is most important.

All that elaborate work to get past the various bot-detection mechanisms on the website I’m scraping means that the process is SLOW. I’m sure I could theoretically find ways to parallelize this so that I could scrape several thousand pages in less than 24 hours, but at this moment in time, I’m so swamped that it’s actually not possible for me to learn how to do that and still present a linear regression model in a few days. So it’s rush-rush-rush to start a job that’s just good enough and then wait until the job is done running to see the results.

Doing it the fast way at the expense of doing it well is actually a theme of my journey as a coder. In grad school, I was trying to jump in as fast as possible (my advisor had little patience for downtime while learning, do the work now!). That meant I was trying to solve relatively complex problems without understanding the fundamentals of programming, just hacking together other peoples’ code. The longer it took me to figure something out, the more I flailed, throwing code at the wall until something, anything, ran without throwing an error. Sometimes I even got to the point that the code threw out the appropriate response at the end. Except for a few stabs at the MITx intro to CS course until I finally finished it, I didn’t put any serious time into learning to code well. I wasn’t a computer scientist, I just needed to {insert task here}.

Still, I managed to pick up a few good habits. Never use single-letter variables outside of lambda functions; not after a colleague spent months tracking down why a program mostly returned correct answers and occasionally returned possible but incorrect values, eventually found to be caused by the variable ‘b’ being used in two different places. Comment your code so your future self has any idea what is going on; one project was entirely restarted from scratch when I picked it up six months later and had no idea what any of the code was doing. Cite your sources when you copy something you aren’t 100% certain you understand; you want to be able to go back and figure out where things went wrong or at least to know who on StackOverflow you should curse loudly and roundly when something goes wrong.

But the first couple of weeks of bootcamp only amplified my “type furiously into Jupyter Lab and then hit shift-enter and then figure out why it’s throwing that error” style of coding. With the time pressures, it’s no wonder that I feel rushed! And until web scraping, it worked well enough. Yes, I sometimes went off on the wrong tangent because I didn’t take the time to clearly identify and explain the problem. Yes, there was a lot of cursing. But it was fine, right?

Eh, about that. . .

After a lot of iteration, trial and error and error and trial, I managed to scrape a few pages. My code sent Chrome to pages, scrolled, scraped, and parse the values I wanted into a dict. I just checked that the output had a reasonable length and first value, which it did, so I set up my scraper to run and walked away. 18 hours later, and it was still running. Page after page of home listings scrolled by, and I started to get really excited to dive into the data. I’d done it! I was really scraping a site! This was amazing!

When I finally got a chance to dig into the data, I broke out into a cold sweat. This is not a summary chart you ever want to see.

summary table of data for house site listings shows that all values for price, beds, etc. are identical for all 3111 entries
Did you know that 3111 identical houses were sold in Seattle over the last three years?

I had failed to change a variable in the code from the test version to the production version, and so I navigated to 3000+ web pages just to scrape and parse the same object in memory 3000+ times. And save it. And turn it into a dataframe. A dataframe of 3000+ identical objects.

There was much wailing and gnashing of teeth. But I did learn a valuable lesson — slow down, check your code, and thoroughly test its outputs before you scale it up.

My coding style is still a lot of furious typing and debugging-via-error-messages, especially when it’s late at night and I’m feeling the time crunch. However, at least for the moment, I’m making a point of stepping through my code before I run it and pointing out the variables, making sure I know what they are, where they got defined, and what their purpose is. My version of pointing and calling leads to some seriously strange-sounding muttering while I’m walking through my code, but it has also cut down drastically on errors, so it does seem very worth it.

We don’t always learn the lessons we want, but we do tend to learn the lessons we need.

--

--

Christine Lloyd

Into science communication and public health. Simultaneously overqualified and underqualified. Happy to geek out over many different subjects.