Exploring Different Web Scraping Strategies

Background

Due to the relatively isolated location of Cornell, finding and buying bus tickets are a common shared experience among Cornell students. However, there are suprisingly a lot of different bus services of varying quality and price. In addition, finding and buying the right tickets is a hassle, as you have to compare tickets from all those bus services and for various days. Recently, I've been working on a project that aggregates data from different bus providers to make it easier for students to easily find and compare bus tickets.

Understandably, a crucial part of this project was web scraping.Most sites were very easy to crack, as most of them had publically accessible APIs that gave the data in nicely-formatted or had the JSON embedded inside the HTML of the page. However, I soon hit a roadblock when trying to scrape data from Cornell's official bus service—Campus-to-Campus (the best bus service imo). This roadblock was a bit confusing at first, but later proved to be a fun detour into the various technologies of web scraping!

Introduction to the C2C Website

The C2C website is an ASP.NET web application, which makes getting data from the website a bit difficult, as the website sends POST network requests to load data with some parameters to track state across requests (e.g. "__EVENTVALIDATION"). Therefore, it is not possible to just send a simple GET request to get the necessary data from the website.

You can see this process yourself by navigating to the website here. As you fill out the form, you can see the different requests pop up in the Network tab of Chrome DevTools.

After playing around with the website, I figured out that there are two main approaches to scraping data from this website:

Sending multiple fetch network requests to simulate the process of running the C2C website
Using a headless browser (driven by Selenium or Puppeteer) to simulate and automate the steps of a user

The repo containing all of the different approaches can be found here.

Scraping Strategies

Fetch Request Simulation - ❌

GitHub Code

This was my original approach. After opening up Chrome DevTools and looking at the Network tab, I discovered that the C2C website sends POST requests to the site with form data (application/x-www-form-urlencoded) every time the user updates a field on the form. After further research, I concluded that in order to get the seat data at the end, every step that the user made (and, thus, the network requests generated by the user's actions) would need to be simulated by a script that wanted to scrape that data. The cookies sent by the website would also need to be saved to be used in subsequent requests. Thus, I came up with the following steps:

Go to the main page (https://c2cbus.ipp.cornell.edu/mobile/?a=mobile)
1. Get the necessary fields for the next request's form data and save them (i.e. __EVENTVALIDATION, vs_gid)
2. Get the cookies (from getSetCookie()) and save them
Set ddlTripType to 'One Way' and ddlQty to '1' and send the necessary POST request
1. Send the necessary POST request with these two changes and the other form data (i.e. __EVENTVALIDATION, vs_gid, etc.)
2. After getting the response from the POST request, save the updated __EVENTVALIDATION for the next step (note that vs_gid doesn't update from step 1)
Select ddlDepPickLocation and send the necessary POST request
1. Send the necessary POST request with this change and the other form data (i.e. __EVENTVALIDATION, vs_gid, etc.)
2. After getting the response from the POST request, save the updated __EVENTVALIDATION for the next step
Select ddlDepDropLocation and send the necessary POST request
1. Send the necessary POST request with this change and the other form data (i.e. __EVENTVALIDATION, vs_gid, etc.)
2. After getting the response from the POST request, save the updated __EVENTVALIDATION for the next step
Select btnDepCal and send the necessary POST request
1. Send the necessary POST request with this change and the other form data (i.e. __EVENTVALIDATION, vs_gid, etc.)
2. After getting the response from the POST request, save the updated __EVENTVALIDATION for the next step
Parse the results

In theory, this approach should work, but for whatever reason, this script is treated differently than a user (maybe because of cookies?). Thus, this led me to try another approach.

(P.S. if any readers have an idea for a solution, feel free to reach out to me!)

Selenium - ✅

GitHub Code

I decided to use Selenium because I knew that it would 100% work as it simulated the actions of a user on a web page. To make Selenium (and eventually Puppeteer as well) work, I needed to find the elements and carry actions out on it. Most of the time, this was pretty easy, as most elements had unique ID properties. However, some elements, like the date picker's button to increase the date had some weird HTML, which forced me to use its XPATH to find and interact with the element. In addition, as the C2C website is a dynamic website, it is important to tell Selenium to wait for elements to appear with the setTimeouts() method. In the end, Selenium worked perfectly, as expected. It was not as slow as I expected, with an average runtime of ~2 seconds on my machine (M1 MacBook Air, 16GB).

Puppeteer - ✅

GitHub Code

Although the Selenium approach worked, I decided to translate my Selenium code to Puppeteer, another library that allows you to control a headless browser (i.e. Chrome) through Javascript. The reason for this move was that there didn't appear to be much support for Selenium on Vercel's serverless functions platform. However, people managed to run Puppeteer on Vercel, albeit, their solutions were a bit finicky.

I found Puppeteer to be a bit harder to use than Selenium. For example, in Puppeteer, you need to explicitly call the waitForSelector() method before you want to interact with an element that hasn't yet appeared (as opposed to Selenium's one-time invocation of setTimeouts() method). In addition, a frustrating bug was that you have to explicitly tell Puppeteer to wait for an element to be visible before you interact with it. Otherwise, you would just be looking if that element exists in the DOM. After finally solving those issues, I got Puppeteer to work, and it was indeed faster than Selenium with an average runtime of ~1-1.5 seconds on my machine (M1 MacBook Air, 16GB).

Now that this puppeteer script works, I aim to deploy this solution for the project.

Conclusion

Although I thought that scraping C2C would be a really easy process that would at most take a day or two, it, instead, led me down a week of learning about how ASP.NET applications work, headless browser automation, XPATHs, and more. I'm really grateful that scraping this web page was a challenge, as I wouldn't have learned as much if there was just a simple API to fetch data from.

I hope that this article taught you some stuff on the quirks of web scraping as well! Furthermore, on a metaphorical level, I hope that this article also subconsciously allowed you to realize one of my dad's favorite phrases:

办法总比问题多 (there are always more solutions than problems)

Happy coding!