How to Get Zillow Property Data in Python 2022 | Easy Tutorial
Want to analyze real estate properties, but not sure on how to get the data? Check out this video, for a step-by-step tutorial on how to get Zillow data in Python for on market and off market properties.
Ariel Herrera 00:00
Hey everyone, are you a real estate investor looking to get property detail at scale, but you're not sure where to start? Maybe you don't have access to the MLS, and you want to get information for 101,000 or even 10,000 Plus properties. How do you do it? Well, in this video, I'm going to show you in Python how to extract property detail information for both on market and off market properties. You'll be able to get information like property estimates for Zestimate rent estimate, as well as lot information and zoning. My name is Ariel Herrera, the analytics area channel, we bridge the gap between real estate and technology. I love being able to do tutorials that are easy to follow, and allows you to become an even more data driven investor. It's the kind of content that you enjoy, then please subscribe. As well like this videos, I know how to make more of it. All right, let's get started. In this tutorial,
Ariel Herrera 01:06
we're going to go through three sections, the first talking about the data that we're going to be able to extract second discussing the API that allows us to get this data. And then third is being able to extract this data using Python. So right now I'm on Zelos homepage. Zillow is a data aggregator that's gets information across the United States. For properties that are on the MLS. That's the multiple listing services, which agents have access to. So say, if you wanted to purchase a property, you could go to Zillow site and search your neighborhood. When I search Riverview, Florida, which is locally around Tampa, I can see on the right hand side a list of properties. And I can also see a map view of those properties. Let's dive into one of these properties. In particular 11622 Pure pebble drives a property that's for sale, it's going for 480k. It's four bed, two bath, about 2000 square feet. And some interesting information that we'd like to really get to know are the facts and features of the house. We want to see these details in a singular spreadsheet so that we can analyze maybe all properties for sale, or off market properties and one single spreadsheet. Now, what's really neat that Zillow provides on some of its properties is Zestimate score. And the Zestimate score is an estimate on what they believe the property's current value is, as well, we also get a rent estimate. And the rent estimate allows us to gauge if we were to be investors, and actually rent out the property, are we going to be positive or not? Are we going to lose money or gain? Ultimately, what will our cash flow be? There's a lot more information that Zillow provides on the URL, including, like I said, home value with the Zestimate score, price and tax history, estimated monthly cost and the rents estimate as well. So all this information can be taken from the web. Now imagine if we wanted to get this data for all properties within Riverview, which are 1000s, it would be a very manual process to write this down into an excel sheet into separate columns. Therefore, there's been a big push to have web scrapers. Web scrapers, get this data from web pages, they basically take the elements that we're looking for, and they make it into a consumable way that us as consumers can have it. Now you might be thinking, why doesn't Zillow just provide an Excel spreadsheet that we could download or an API that we can access, they're not really gauging for real estate investors or researchers, their main thing is to be able to help first time homebuyers move up buyers, or even those that are downsizing to be able to purchase our property. So the whole idea of being able to make metrics on multiple properties at scale really isn't in their niche. This is why we're going to use the zillow.com API that's created by API maker. This can be found on rapid api.com. And what rapid API is, is basically a marketplace of API's and API allows us to receive structured data. So in this case, for this property 11622 Pure pebble drive, instead of us just getting a text file of a bunch of different values, but not knowing which one associates to the Zestimate bedrooms or bathrooms. An API actually provides us in a structured format, and we can make requests to it. So we could say we want data for this specific property for this specific city and the API goes to its own database and retrieves it, the way rapid API works is that each single API is in a structured format. So you're always going to see the same three options in the layout as well as information on endpoints, tutorials, discussions, pricing, as a heads up, what's really awesome is that you can actually utilize this API for free, you can make 20 requests a month. So in this case, that could be 20 different cities getting all the for sale properties, it could be 20 different properties, getting the specific details for things like Zestimate. And if you were to scale up, the cost is relatively low in comparison to platforms and other competitors that have this information. So if we go back to endpoints, we could see on the left hand side, there's numerous types of endpoints that we can get data for. When you think of endpoints, just think of folders with data. So in the first case, we can get information on for sale, as well as sold properties, which we've done in previous videos. So go check that out if you want to do that. But in this specific case, we want to focus on whatever the property is whether it's on market, so it's being sold on the MLS, or it's off market, because we may want to target sellers that are motivated, maybe they're behind on payments, recent divorce, etc. So in this case, we want to target this property details. property details is going to give us all the information that we saw on Zillow. So here's a couple things to note, we have a request URL, which is the same it's rapid API, though we have some header parameters, which don't change. Once you sign up, when you subscribe up here, you'll be able to get your own API key. Again, it is free to sign up, it is only if you go over the 20 limit, which they'll let you know, I'll send you an email if you go if you're about to go over that will you incur any costs and I thought the costs are very low. On the right hand side, we could see that when this request is made, we can actually look at Python specifically, this is how we're going to make a request. And we could see example responses. Let's look at our example. So we need to enter in a Z PID, which if we go up here, we have z PID. So let's copy this. Once we enter in our Z PID within the parameters, and then we hit Test endpoint, we could see all the information for that property that we have pulled up, we could see the URL which is 11622, pier pebble, we can see that it's a single family home 2000 square feet, we can also see zoning information HOA information tax annual amount, the rents estimate description, how many people have viewed the page, current mortgage rates, as well as the date that it was posted. So there's a ton of information here. And what we're going to do now is actually open up our Python notebooks that we can get this data. So right now I'm on Google collab. This is a free site in order to be able to run Python code without having to install it at all on your machine. It's all on the cloud. All you need is a free Google account. And you can clone this notebook. So in this example, let's say that I'm working with an off market property list. So maybe I'm looking at pre foreclosures or I'm looking at auctions and I'm using prop stream to get that list of sellers and their properties. Once I make my search and prop stream. This is an example of what my spreadsheet looks like. I have information on the address, the city, the zip code, the owner, first and last name, as well as some other useful information like the bedrooms, bathrooms, total assessment estimated value. However, I want to get more I want to be able to assess what is the Zestimate and rent estimate so I can calculate metrics I cashflow automatically. I also want to know zoning so that if I want to ever build a mother in law suite, like an adu, I be able to do so as well. Something else that I'm interested in is lot size, again, are on the same exact topic of can I build an adu in the future a tiny home or throw something else in the lot where I be able to make more money? Well, if the house is relatively small in comparison to the lot, it's a really big lot. That's something I might want to engage in with further. So once you clone this notebook, your next step will be creating a free rapid API key account, store that key somewhere safe. And then also subscribe to the zillow.com API on rapidapi. If you're not running this on Google collab, then make sure that you do install Google search and I'll explain why in a moment. The first step we want to do is import all of our libraries that we're going to need. So what I'm going to do is click within the cell and you can either click play here, or you can do can Ctrl, enter. Once the cell finishes running, we could see a checkmark on the left hand side and how many seconds it took. In this case, it only took one second, we're going to skip functions for now and go straight into our locals and constants. These are variables we're going to refer to later on, specific to Google collab. So if you'd like, you can just skip this and enter in your API key right here. I'm going to run these both and it's going to connect me to my Google Drive. So that I can read a file that I have all my API keys. Once I have my rapid API key set, the next step is we're going to get data for single property. And after that, we're going to get it for a list of properties, which is more realistic. But first, let's start off with one property to make it simple. So in this case, we have our 11622 Pure pebble drive, which is currently for sale. But this will work whether it's for sale or it's not for sale. So on or off market, we need to take three simple steps. One, we have either a list of properties, or we have a single property. Second, we need to be able to search that property in Google so that we can get that CPA ID. In order for us to be able to get the property details. The API needs to know what property are we looking for. Unfortunately, we can't search by address for the API. So we need to do is to be able to get the URL of the property and extract this last piece, which is the PID. See, PID is basically a unique identifier that Zillow associates to each property is in their database. So once we have our Z P ID, which we get from the Google search, we'll be able to feed that in to the API in order to get our data. So we have our property address and a string and we want to append Zillow home details, we're going to emulate the same exact search. If we run this, we can copy what our aquarium is going to be. When we paste it in Google, we can see that the first return URL is that coming from Zillow, so we want to grab this, we don't want to grab the Caldwell realtor or any of the other ones. So what we do is we use search search by Google, which is a Python library super useful. And here, we're basically saying we want to be able to get our return of URLs that have.com At the end their English, and we only want to get three results. So if we run this, we could see that we get three results. Here, we get the zillow.com, which is the one we want Trulia and then Caldwell bankers homes. Now since we know that the zillow.com is usually always going to be the first result, we can select Play in order to index this lesson only grabbed that first URL. So that first URL is now in a string. And we want to parse it out to only get the XEP ID. So we're going to run this and what we're doing is splitting the URL by that forward slash. And if we see the PID within that, then we want to grab it and split it by their underscores that underscore here. That's what we're then obtaining the Z PID. Next, we're actually going to get the data from rapid API in rapid API. If we go back to code snippets, make sure we're on property details. They give us the code already that we can copy or copy right here. And we can bring it over into our notebook. I've already done this over here. And just make sure that you replace your rapid API key with your own mine is coming from a file, which is why it's in the variable. Once we run the cell, we could see that we got a response back. And our status code was 200, which means it was a success. Now if we look at our response and adjacent format, we could see that we have all the information that we wanted. So we get the latest School, which is river view, we could see solar potential the state tax history, the street address, if we can confirm the street address is the one we're looking for. 11622, Pierre pebble, and more. Now, the issue here is a JSON format is great, but if you're someone like me, you want to see this as an Excel sheet or something that you can actually upload into your CRM to put some metrics on this, like cash flow. So in order to do this, we're going to skip down to the next cell. We're going to use pandas, which is a really, really useful library in Python that you can use to manipulate data. You basically can do any formula that you want in Excel within pandas. So here, we're using Pandas, and we're going to normalize this JSON file. Here, we could run the cell, and we get one single row since we only got data for one property. And we could see that we have all 259 columns full with data. If we scroll to the right a bit, we could see it that Zestimate score is right there. Now let's just take a few of these variables since there's a lot there. or 200. And let's grab each of these values. So we have bedrooms, bathrooms, and then some estimates, if we run play, we could see that we have all the property details extracted from that table. We have bedrooms, bathrooms, and we also have lot size, lot dimensions, zoning property type. And these values all map up to the same exact elements are on that webpage. If we want to download this file, select Play, and it's going to open up an Excel file, this CSV file is basically our entire data frame. But now we can actually work with the data here. Great. So we're able to get data for a single property. Now that we understand the methodology, let's actually get this data for a list of properties. Remember, our end goal is that if we're looking for motivated sellers on or off market, then we should be able to just upload a file, grab the XEP IDs and a pen that property data that we want. So in this case, I'm going to work with that file that I showed you before, which is Woodbridge tax liens. And Google collab has a really useful function here where you can upload a file. So let's run this, once you select your file, it's going to store temporarily within your Google Drive. Here, I wrote up some functions just to quickly get the file name so that it could dynamically upload it, and then put it into a data frame called DF upload. So in our case, we only have five rows in this file. And these five rows have different addresses and Woodbridge, New Jersey, we have information that's exactly the same. So we have the address, owner information, mailing information, as well as some information about the property like bedrooms, total bathrooms, then some things that are about the loan itself, like estimated balance of open loans, total open loans, so say like this property here that has 00, say if we wanted to try to do seller financing, and basically get financing through the seller and not through a bank, we want to find someone that has 100% equity in their home, these values help us to get that. But back to our main focus, we want to add columns over to the right hand side that come from our zillow.com API. So now that we have our file, this is the main code that's going to allow us to get this data for every single street, I'm going to walk through it briefly and then run it, the first thing that we want to do is create two different lists, we're going to have a list of data frames for every single address, it's going to return data for us, we're going to transform that into a data frame. But ultimately, we want to get just one single file, we don't want five different files, we want one. So this list is going to allow us to bridge this information together later on. Same thing for the Z pid list, we're going to need a way to associate the addresses in our file with the new addresses that we get back. In order for us to actually get data for each and every property, we want to iterate through each of these rows. So each row is going to have the street city state and zip code, my file was a little bit weird. So I had to add a zero to the front for ZIP code because it came off, I thought it was a number, you could eliminate that if it's not an issue for you. But we want to associate what columns we're going to need. So in this case, my street column is labeled as a dress. So that's why here I'm associating it as a dress. Now once I have that information, I can put that into my function functions allow us to repeat code. And my function up top basically does the same steps that we did previously. So it takes the street city, state and zip code and puts it into a single string that we're going to enter into Google to search for our property. Once it searches for our property over here, we're going to extract the XEP ID from it, we're going to get the first URL, and then get that CPE ID and return it. Our next function is getting that property detail from rapid API. All we need are two different parameters, our rapid API key and that XEP ID. Once we get that we're going to return that data that is received from the API. Now in between these two functions, I do have this part that says sleep 1.5. So in order for us to make these Google searches, we can't overload the system. This is specific to the Google search library. If you want ways around this, check out my solutions with using Bing web API. But in this case, if we're running it with our notebook, it's fine. We're going to do is wait one second and a half each time we use the Google search and then go get data. When we go to get the data that's going to be this line right here. We're passing in our rapid API key and our Z P ID. We then want to get this into a data frame. So a table using this and then we want to append it to that original list that we had there. We're going to have all of our responses in one one spot. The reason why I have a try and accept is that sometimes it's rare. But I have seen where the XEP ID does not come up. So in this case, we don't want our entire script to stop, we want it to continue. So it's going to throw an error that says, hey, couldn't get the data for the street, but it's going to still continue. Let's now run our two functions up top. Next, when we run this main cell, it's going to take a little bit of time, we're going to see right here that it tends to take that a 1.5 second break, which makes sense if programmed with that way, and then it goes to get the data. Once that's done, we're going to have this complete. So the spinner stops, it took 14 seconds for five properties. And then we're going to click play here in order to see all of our data. So we concatenated. So we took our list, and we merge it together. And now we have five different rows. These five rows represent each of the five properties that we got data for, we see that we have information on the Zestimate for each one, as well as living area and more. For our next step. Let's actually merge these datasets together. Remember, we have our original file. And now we want to append the Zillow data. So here, what I'm doing is I'm making sure that each of our tables has a column called Z P ID, and that Z P ID is a string. After that I am merging these two tables. So we have our DF upload that was our original file. Then we have the DF property detail, which is what we got back from Zillow, and we're merging it on Z PID. We're doing so by making sure that we preserve our original file first, if for some reason the API could not get information on your property, still going to show us that property just with the columns blank for the Zillow API. When we merge this, we can see that we now have our original file with all of the columns that came back from Zillow. However, there are 313 columns now in total for our Excel spreadsheet, which is very overwhelming. So let's cut this down a bit in this cell, we can redo the merge source following that same process. But instead, we're just going to stay a couple of columns to get back from the zillow.com return data. So now we see we have only 44 columns a lot less. And if we go all the way to the right, we could see that we now have the Z PID Zestimate and rents estimate for each of these properties. When we download this file, we could see that we now have all the information that was original. And then it also maps to have as estimate rents estimate at the end. Lastly, we can do some plotting as well. If we want to use Plotly Express, we can enter in our new data frame, which is modified output and specify that we want to see all the Zestimate values in this box plot, we could see that RS estimate values range from 200 to about 560k with a median of 380k. Maybe if our strategy is to bring forth forward properties to flippers, maybe we want to focus on Zestimates that are lower, so they can add a lot more value. As a recap, we've been able to successfully get data for a single property, whether it was on market off market, and we were able to upload our own file no matter where it came from, and then be able to append more information on it. In this case, we appended property estimates. But as you saw, a lot of data came back from the API. So just research what you want to pull back. And you can also add that in. If you're looking for a no code solution and check out my property bulk upload tool. This allows you to upload your files, select the columns that are relevant to a street address. And then my tool goes to get the data files a similar process to this but a little bit more intricate since I use the Bing search API and everything is getting stored in AWS, in addition checkout to my other videos related to this topic. Another is automation workflow. So being able to get the Zillow property data and add it to a Google spreadsheet, it allows you to also add it to a CRM if you'd like to. So if you're looking to use Zapier, this will be the video for you. And lastly, I'm trying to create a CRM where I'm able to enter in a single property, get all the property details, skip trace the owner automatically get property estimates and ultimately contact them. If you want to follow my journey into building out my CRM then definitely check this out. I'm going to show you how to get the Zillow data similar to here but in Zoho, which uses a programming language called dilute if you haven't already, please subscribe and if you have any questions, please join the tech and real estate Facebook group and post them there. Thanks