How to Extract Tables from PDF using Python | NAR Housing Data
Need an easy way to extract tables from a PDF document? In this video, we'll gather housing statistics from the National Association of Realtors by scraping PDF documents with only a single line of code in Python.
Ariel Herrera 0:00
One of the main challenges in real estate is that we get so much data, but it's not actually organized. In this video, I'm going to show you how you can take a PDF extract tables from it using Python, so that you can actually analyze the data within PDF documents. My name is Ariel Herrera with the analytics area channel where we bridge the gap between real estate and technology. I love being able to make data usable so that you can make informed decisions for your real estate business. If you enjoy these types of insights, then please like this content, so I want to make more of it, as well subscribe to this channel so that you get the latest content. Alright, let's get started.
What we're going to be doing is using NARS data There are pdf document so that we can actually take the data from the PDF document and output it into a CSV file. That way when we want to analyze data that comes back from NAR, we don't have to actually copy and paste the information from a PDF document. Now I'd be asking what is NAR? Nar is the National Association of Realtors. And they post housing statistics on a monthly basis. And it's really relevant to look at NARS data because they have access to the entire MLS. So all homes that are purchased and sold. Now if we go to the website housing statistics, we can see that there's several different paths that we could go towards. There's information on existing home sales, pending home sales, housing affordability, index and more. For our first use case, we're going to look at existing home sales. Let's imagine that we want to understand how the market is performing. because interest rates are rising, and we're on the brink of a possible recession, we're concerned if home sales are going to decline. If that happens, there could be a lot of people out of work, including mortgage lenders, inspectors, agents, because there's not enough houses that are being bought and sold. It doesn't mean the prices are going to crash, it just means there's not enough inventory as well. So for example, buyers are less likely to purchase when interest rates are high. That's because their mortgage may go from 2000 a month to 3000 a month depending on the change. This makes it less affordable. Now for the seller side, if sellers are looking to move up, maybe they're looking to purchase a larger home, their mortgage is now maybe double the price. And if they want to keep the mortgage the same, they'll maybe have to downsize to a smaller property. Why would anyone wants to do that. And that's exactly why we expect to see a lot less people selling their homes because the next option isn't really that good of an option. So if we go to existing home sales, we could expand this and actually dive into the home sales data. Here, we could see that for June of 2022, there was a 5.4% decline in the number of transactions. And if we go down to average mean price, we can look at the actual dataset.
Now the data set, as I warned you is not clean. What we have here is a PDF document. And this PDF document has mean sales price of existing homes from 2020 month over month, and they recently released as of July 20, the June stats down here. Now what if we want to actually see this in an Excel or CSV file? Well, we can't really do that right now because of the way it's formatted. But using Python we can. So the next step here is using the link below to open up the Google collab notebook. What is Google collab it's basically a free environment for you to be able to code in Python, as long as you have a Google account. Now if you want to actually run this notebook yourself, you can open up the file, go to File and then save a copy to your drive. I'm going to be using tabula tabula is a library Python that allows us to scrape the information from PDF documents. So the first step is installing this library. And afterwards, we're going to be importing all of our relevant libraries. We're going to be using tabula as well as Plotly express that we can actually visualize what our values look like month over month. Now if we go down here, we have the URL of our PDF document. So I simply just went to our PDF and I copied it In the URL from up here, paste it within here. And now using tabula, you can use a method Read PDF, and we pass in our URL. And there's only one single page. So we say state here that it's one page. Once we run this tabula is able to actually get the data from the table within the PDF document. Here, we could see we have two columns year, month and US mean price. We have here two years 2020 2021, the mean price, and then we have the month over month, breakout all the way down to June of 2022, which has a P in front of it. This is super useful, it only took a couple seconds, and we got the table out. But it's still not usable in this format. So we need to do a little bit of data wrangling and Python. So we do next is we normalize our data by establishing a variable for our table. So we're going to call this DF for data frame means sales price. Now, we don't care about the year stats up here, we only want the month over month. So in this case, we're going to select only from row three and on. Now the other two things that we have to do is right now our values are strings. When these are actually numbers, we want the number 47900. So in order to get it into a number, we're going to replace commas and just take them out, then we're going to take each of our variables and wrap them in integer so that we have them as a number. Then in order for us to normalize the first column, year and month, what we need to do is remove this p. So whenever we have a new month, it's going to have the P in front of it. So what we're doing in this next part is only selecting the first seven characters in the string. So what we're doing here is we're specifying for our column year over month. Our format is year period, and month. And that's going to transform into a daytime object. Now, if you're new to Python, don't worry, this is only a few lines of code. And in essence, we're basically making changes to our columns within our data frame. So now we can run this. Now if we look at the last five rows of our table, we could see that we've transformed our date column to actually be a date. So now it's no longer 20 22.05. We now see it adds an actual year, month and day. And then for us mean price, it's now a number. So we could even stop at this point if we'd like and download this file, and work with our file within Excel, which we have right here. But what fun is that we
want to create visualizations. And some of the most eye popping visualizations are ones that you can create in Python with only a single line of code. So using Plotly Express, which is px, right here, we're going to do a line chart. And if we run this, we're passing in our data frame, our x axis going across is going to be your month column, then our y axis up and down is going to be the US mean price. And we stay our title here. So the national single family mean price month over month, we can see that from January of 2020. This has pretty much risen, it's only gone down and are going a little bit flat in certain periods. But overall, we've seen an increase from 350 to have a median price all the way up to 536 median price, which is a really large increase in only a two year span. And now that we have this setup, every time that NAR publishes their new data every month, we could just run the script. And we could see how that has changed. Great. So we've been able to actually take a PDF document, grab the table, normalize it and chart the results. Now what if you have a table that's a little bit more complicated? So NAR also has pending home sales. And pending home sales is kind of a table within a table. Pending Home Sales are interesting, because we can see how many homes are currently under contract we can get an understanding afterwards. are a lot of these Pending Home Sales actually going to be sold or are they cancelled contracts too. So up top we have a year over year view. And then we also have a month over month view. So let's imagine we just want to get the year over year view and we want to look at the US versus In the different sections, so northeast, Midwest, South, and West. In order for us to do this, we're going to also copy this URL, and we paste it down below here. So now we run this and tabulous going to read the PDF document. Again, we're seeing that it's only one single page. And if we look at the table that we labeled DF pending, it is very messy. We see that the data that we want is up top and these three rows, but there's numerous columns, there's columns that don't really make sense, like we have here unnamed and a bunch of nulls. How do we actually get the data we want and make sense of this? Well, the first step for data wrangling here is we want to first change the header. So all of our headers are in the first row, we need to make that the actual column names not unnamed. So the next cell, if we run this, what we do here is we make that first row the header. And now we actually have first row header. But we still have an issue where we have these na n, so no value columns that don't really mean anything. So our next step that we're going to do is remove these from our results, we can see that we've removed the null values, but we still have issues and we have duplicate columns as well. So for our next step, we're going to just select the first three rows, because we only care about the information in the PDF document that's highlighted in yellow up top right here your rear. Now if we just select the first three rows, and label it, DF one, we now have just those first three and several columns. A lot of them, though, have stars associated for the valleys, which we don't want. So if we run the next cell, what we're going to do is if the column has already appeared in the data frame, then add an underscore and x to it so that we can remove it afterwards. Now we could see that we have several columns that have underscore x, because these are duplicates. So let's remove those. And now we only have the five columns that we care about. But we have one column here, that's a combination of two, it's northeast Midwest, and we have the values put together, which we don't want. And if we look deeper look here, we can see that these values are all coming within the same cell. Now by using Pandas, we're going to take the column here and split it by space, then label it as Northeast and Midwest and drop that icky column that we have.
And once we do that, we have a beautiful table. We have year over year, what Pending Home Sales look like for United States as a whole as well as broken out by region, Southwest, Northeast and Midwest. And it's going to be the same exact table that's in this upper left hand corner. I hope this has been useful. And if you want to see more data wrangling projects and working with PDF documents, and please let me know in the comments below, so I can make more videos like it. Thanks so much. And if you haven't already, please subscribe.
Transcribed by https://otter.ai