How to Get Redfin Housing Market Data | Part 2

Looking to get real estate market data at your fingertips that you can analyze your specific market and understand, is a crash coming? Is it becoming a seller's or a buyers market? These are all questions we can answer with Redfin's free data source and I'm going to show you how to extract that data and filter down to your specific city using Python.

Video Block
Double-click here to add a video by URL or embed code. Learn more

Ariel Herrera 0:00

Looking to get real estate market data at your fingertips that you can analyze your specific market and understand, is a crash coming? Is it becoming a seller's or a buyers market? These are all questions we can answer with Redfin's free data source and I'm going to show you how to extract that data and filter down to your specific city using Python. My name is Ariel Herrera with the analytics erielle channel where we bridge the gap between real estate and technology. If you enjoy data driven solutions, then please subscribe to this channel as well. Stay tuned because in the third video, I'll be showing you how to create the same exact visualization so that you can be ahead of your competition and understand how to analyze your market. Alright, let's get started. Right now I'm on red fins data center page. And Redfin has information about housing market data they make available publicly. So this data, they have some Tableau dashboards already, where you can filter on different areas, and look to see how new listing pending sales home sold and more have changed over time. This is useful information to see the status of a real estate market if it's cooling off, or if it's heating up in favor of sellers or buyers. So down below, we actually have downloadable information. And you could go all the way from the national level down to the neighborhood level. And this is critical because real estate is hyper localized. For example, say if you are investing in the city of Miami, well in some pockets of Miami and may be more affordable, there may be more population coming. Or maybe there's people that are driving away. So you can just look at city as a whole begin to really focus on zip code as well. And this data not only helps to see how our market is currently performing, but it helps us to analyze what next markets may be emerging and we want to invest in. So I'm going to be opening right now a Google collab notebook. And this allows us to basically programmed in Python on the cloud using Google services without having to have Python installed on our actual machines, and will be able to run the cells. In order to start you're gonna go to the left hand corner file and save a copy to your own drive. This is free, you just need your own Google account. Now for the first step, we're going to be using some libraries in order to download our data, as well as extract our data and see how long it takes to run. So here, you can either press play or hit Shift Enter in order to run this cell. Now for the next cell. If we go over two Redfin's data source, and we go to zip code, we can copy the link address. And this is going to allow us to read the data. And if we look at the link address, it's going to match the same one that I already have within the URL. And you might be thinking, why don't we just download the file and then read it in Excel or some other application? Well, the reason is, because this file is massive, we have data for over two years across Pending Home Sales listings month over month changes. And it's all zip codes throughout the United States. So let's run this cell here. So that we can actually read in the data from the URL, we're specifying with this pandas function that it's G zip compression, and it's separated by a tab. What we're going to be doing next is understand how long this takes to run, and then show us the first five rows within our dataset. So let's click run here. Here, we

could see that it took 2.29 minutes in order to actually read in this dataset. And that's because there's over 5 million rows, almost 6 million with 58 columns. So it's a pretty wide and lengthy dataset. If we look at our first five rows, we could see we have information on period begins. So our data should go back to at least 2013. as well. We have information of what zip code is the state state code, and then more information on what's going on in the market, including median sale price, median list price, as well as pending sales, home sold inventory, and more. So let's actually get this data down to a subset that we could actually work with. So I'm going to filter specifically on Florida since that's the state that I'm currently in. Now my dataset is down to 380,000 rows, which is doable. This is great to work with. And we could see here that the duration of the dataset is 90, throughout across and we could look at some other mean standard deviation Max stats by using the describe function. I'm going to skip forward to our next step where we're going to sample this even further. So our goal is to be able to visualize all of the data within a dashboard and Tableau, it's free to create, I always suggest that it's a good practice to start with the sample set, and then expand forward. So maybe start with just a city that you're local to. And then if you want to look at the state or across the whole United States, go for it, just be wary that it is going to be a lengthier data set. So it may take more time in order for the data to load in the visual tool. So next, what I'm doing is I'm going to get data for a select amount of counties. And I go to this GitHub repository that already has data for the US, each state, all counties and all zip codes. So if we look at the first five rows, we could see we have Alabama related zip codes, counties and cities. So for my purpose, instead of looking at all of Florida, I'm going to sample this down even further. And I'm going to look at specific counties on the left hand side of Florida. So by Tampa, Tampa is within Hillsborough County, and you have St Pete and Pinellas County, you have Pasco Hernando and citrus up north that are still about one hour from the city. So these are the ones that I'm going to concentrate on in order to reduce my sample set. So on the next cell, what I'm doing is I'm filtering on Florida, and I'm filtering even further to only look at these five specific counties, I want to check that the zip codes are correct, because I've noticed that some of the zip codes in this file, they don't actually assign to a real city. So I'm checking that all the zip codes should be fully numeric, meaning that each element within the five digits should be all digits. So I look here, and I could see that there are a few invalid zip codes. And this is what they look like. So I'm going to be removing those from my dataset in the next hour. So if I run this, we have now filtered, we're only getting valid zip codes. And we have 119 zip codes we're going to be focusing on now for our next spot where we're going to do filter, so that we get region from region we get zip code. So if we go back to the top where we have our dataset originally, we could see we have this region fields, the zip code is nested inside of it, we just need to get that zip code itself. So we can choose all of our relevance of coats. So what I'm doing over here is I'm splitting by the colon, and then grabbing the first element and stripping it so that any white spaces or blank spaces beginning to end are removed. So now we have a ZIP Code column. And I'm merging this on an inner join with our valid county zip codes. So that now our dataset is going to be reduced from the 300k, just to those zip codes in those five counties. So as we run this, we should see that now we only have 56,000 rows. So we've reduced our dataset. Now we want to actually export this so we can work with the end Tableau. However, as I was going through this, I noticed there were some issues with the data. So let's look at an example County, and an example zip code and only single family homes. So what this is telling us is that from February of 2022, to March of 2022, median days on market went from six to seven days. However, median days on market month over month, it says there was an increase, this should be six to 12. But it's not, that means that this is incorrect. And we need to recalculate median days on market. just by eyeballing a data set, I only saw this issue with this specific fields. But I highly suggest if you're going to use a data set for your own city and maybe do a for loop, and recalculate all month over month calculations. And the way to do that is by sorting your values by the period. So the date from the earliest date to the latest date in the last row. And then for median days on market, which is this column, we want to get the percentage change, which helps for us to gauge if the market is increasing or decreasing month over month. And I ran this on this test dataset. And we could see this should be the actual values that we're getting, we should have point 166 as the median days on market, month over month increase, not one. So this is going to replace this column.

So in the features part, I'm going to run this here, there's three different features we're creating, we're gonna have a latest period column that way we could identify right away. It's the most recent data. So in this case, if it's for May 31 of 2022. We could see all the way at the end that we have our latest period true or false. Then we are going to be removing the relevant city and keeping the geo city so we have to two tables that we had merged. One of the cities isn't as detailed as the other. So I'm keeping the city that came from that GitHub CSV file. And then the last step that I'm doing is removing that bad data by replacing the median days on market field with the percentage change that I'm manually calculating here. Now we have our data set. And the last step is just to export it. So here I'm getting today's current year and month, which is 20 2206. And I'm exporting this into a CSV file. Our next step here is once we have our data set, and next video, we'll be actually creating this Tableau dashboard that you'll be able to use to analyze your own market. If you haven't already, please subscribe. Thanks

Previous
Previous

How to Build a Real Estate Tableau Dashboard | Part 3

Next
Next

07-08-22 Tech in Real Estate | Don't Count on a Housing Market Crash