How to Web Scrape County Data Using using Python with Browse AI | Part 2
How we web scrape county data for a list of properties? In this video, we continue part 1 (building a web scraping bot with no code) by getting owner information and property assessment values with Python.
Ariel Herrera 00:00
Are you looking to get owner information for a list of properties from your county, perhaps you're looking to find motivated sellers, whether they're divorcees, tax liens, foreclosure, and you have the properties, but you need the owner information so you can contact them. Well, in this video, I'm going to show you how to programmatically get a list of owners from your county. My name is Ariel Herrera, the analytics aerial channel, we bridge the gap between real estate and technology. If you enjoyed data driven solutions to automate and scale your real estate business, then please subscribe to this channel. And if you want to see more web scraping tutorials, then please like this content, so I know to make more of it. Alright, let's get started.
Ariel Herrera 00:55
This tutorial is part two of a two part series. In part one, we went through the exercise of using Browse API to be able to have a no code solution and getting information from our county. In our example, we were looking at foreclosures within the Jacksonville area. And we wanted to see who is the homeowner so that we can contact them before the property actually goes to auction. And in our scenario, we went to Jacksonville's Property Appraiser site. And we use browse AI to automatically enter in street number and street name so that we can get information back on the property. If you didn't watch part one, it's essential for you to do so. So you could set up your bot, it only takes several minutes. And it's really easy to do, or your county. If you have multiple counties that you're searching for, and you don't have the bandwidth to create bots for every single one, then I highly suggest for you to use prop stream and check out the link below for that. But in this case, let's imagine that we went through step one, and we've created our extract to be able to go to our county website, perform a basic search and gather information within your browse AI dashboard, you'll be able to see your tasks. And you could see how often you've run your credits for the month for free, you get 200 credits. So we're going to click our tasks that we previously created. And here we could see a history of all of our previous tasks that we've run, and we could see what the inputs are. In our case, we only have three inputs, one, the basic search URL, such as going to the website that we have here, and then entering in two parameters, street number and street name. So fairly simple. Once we run this task, we're able to get information, including owner information, building value, taxable value descriptions, owning your built building, bathrooms and bedrooms, we also get a final screenshot of how our bot performed. And if there's any errors, we could troubleshoot this here. This is really useful. And we could schedule this. But what if we have a list we have a list of properties we want to get information for, it's not going to be useful to manually input the properties here. That's why you want to use integrations. So if you want to stick to a no code solution, then you could use something like integrating with Google Sheets or Zapier. Zapier is super useful, because you can integrate it with many different applications. So say if you wanted a workflow, where every time you dropped a file into Google Drive, you want Zapier to then start this exact task, you'd be able to do that and automatically push it back to Google Sheets or into your CRM, like Podio or Zoho. But in our use case, we're going to stick to Python. And we're going to use REST API's to get the data. So once you're within your tasks, you're going to click REST API, which is going to allow us to programmatically call out the data. And we can see here at the bottom that we have a team ID task ID and our variables, which we'll get to in a moment. Now for our API keys, you'll need to create one in the account API page. I already have one here called Test key, you'll just need to create an API key here. You can name it whatever you'd like. And then once you do, you want to store your secret API key in a handy location. I have mine shown because I'm going to likely delete this right after this tutorial. Once you have this, copy it and then open up the notebook that's linked below. So for this exercise, what we're going to do is have a list of properties that we want to get owner information for, or our specific county. In this case, it's going to be for the City of Jacksonville. So let's go to Redfin.
Ariel Herrera 04:52
In Redfin, I have a search right now set up so in this example I'm looking for motivated sellers. So what I focused in on As I only want three bed, two bath mixes because I want to be able to find motivated sellers right around the property in the future. And I likely want to have a larger size to command larger friends. I also have it filter here that I want to see properties that have been on the market for a while. So the sellers for some reason were unable to sell, maybe they're even more motivated now because they want to get the property off their books, whether they're looking to move, or they're an investor looking to sell. So I have it highlighted here that I want to look at properties that have been on Redfin for more than 60 days, I also have a filter here that I only want to look at properties that were before 2020. And it being 2022. Today, I don't want to see new construction builds. Because I've seen, those are properties that stay on the market for a long time, but in actuality, they haven't been built yet. So I have that filtered as well. And then towards the bottom, I also have price reduced. So I want to see properties that were on the market for a while the seller has had to reduce the price, but yet they still can't sell the property. This signals motivated seller to me. So if I click See 86 homes, I'll be able to see all these properties on the right hand side. By clicking on the property, I'm able to see photos and some descriptions. But ultimately, I'm not able to get the detail that I want. And the detail that I want is the owner information, I want to contact the owner directly without any agents in between, and maybe do some creative financing deals here. In order to get this information, I would need to input this into our county website. So this is 111 East 54th Street. If we bring this over to our county website, we could type it in.
Ariel Herrera 06:47
And when we search, we get our first result. And we see this is actually owned by a corporation to Z LLC. I could further dive into this by going to Sun vis which is Florida's search for LLC is to see who owns this LLC, or for this simplicity, we'll just stop right here. And we get the owner as well as other information on the property. But we want to get this programmatically and in our hands with a list of properties. So now we can go back to our collab notebook. Collab is a notebook environment, it's been created by Google. So you can actually run Python code without needing to have Python code installed within your machine. So it's all in the cloud and doesn't interfere with your system. If you want to make a copy of this, you would go to File and then save a copy within your drive. So to get started, I'm going to import libraries that we're going to need down the line, which will be request in order to get data programmatically pandas to manipulate the data and time so that we could slow down our code to let our bot process the next step is setting up our browser API key and our task ID. So our browse API key is the key that we created right over here, you want to copy a string, and then place it within the same string that I have mine. Next, to get your task ID, you're going to go back to the task, select your task and integrate go to REST API, and you'll see your task ID down below, you're going to copy that as well. Now you have those two pasted run yourselves to generate our data, let's just go through our first example that we had from the original video 10 991, a lot more wrote. In this case, we are going to pass in the URL, what task ID we're going to run, which we specified up top, then we're going to pass in our payload, the start URL. So this is going to be the same exact parameters that we have right here. We have the origin URL, the Input and Input to and if you need a refresher of what those are, go to your run Task tab. And you'll see the examples when you originally created your web scraper. So I have the Jacksonville property search 10 991, which is the street number. And then last more, which is the street name. Next, I'm passing in my API key with bear in front. And I'm requesting the data. And I'm printing out the response in text. So let's run this here. And what we get back is a whole text of information. But let's actually normalize this into a panda's data frame. So we can understand this a little bit better. So we have the status code, which we want to see 200 200 means it was the success of starting up our web scraper. And we have the timestamp of when it was created. And we have some other information including the variables that we input it, but ultimately what we want to be able to extract is the ID the ID for this job. So if we go into our JSON file URL, and we look at the result specifically for the ID. So this is this field right here, will be able to get the job ID. Now the job ID is what we're going to pass back into browse AI, to say, hey, browse AI, we sent a job. Now we want to get the results for it. So in this next part is where we get the results, we pass in our original tasks, IDs or web scraper, the specific job. So this is for 10 991. Last more, than we pass in our browser API key once again, and then we get a response. So let's run this here.
Ariel Herrera 10:36
And the second part is just data transformation for us to be able to get this results, the capture text within a data frame. So run this here, and we could see we do have a results, we are able to get the owner name, which in this case is Secretary of Veterans Affairs, the owner address, building value, land value, taxable value, and some other fields that we had requested. And this is awesome, we were able to actually use our bot, not just in browser API's interface, but programmatically. So this allows us to really scale, we want to analyze multiple properties. And we can automate more of our system. So now let's get to that point of working with multiple properties. So we had previously our search of potential motivated sellers, because these are properties that have been on the market for a long period of time. And they have price reductions, I copied some of these street addresses into the notebook. Here we have five different addresses these were all taking from Redfin. And what we want to do is get the owner information for each of these properties. This is where our for loop comes in. For loops are pretty simple in Python and necessary to understand if you want to get proficient in the language. What we're doing here is we're going to go through every single row in our table in our data frame. That's why we have this part, which is for dataframe, iterate through the rows and get the index on the row. Our browse API requires two parameters that needs to consist consistently change, which is street number and street name. So here we say for every row, get the street number and get the street name. We're going to print a string saying processing. So we know that our code is working, we want to also see updates on when we're getting county data. And when we're getting the results back. We have wrapped our previous code into a function. So let's look at that up top, we have two functions here. One is getting the county data where we pass in our task ID, street number, street name and browser API key, it's all the same steps as before, and returns a response. Ultimately, once this works, we want to be able to get that job ID, the job ID then gets passed into our second function, which is going to get the URL header job response, and then transform this into a data frame so that we can read it in a table format, and ultimately, export it into a CSV file and read it in Excel. Great. So jumping back to our for loop, we're getting our county data, we get the job ID and we're going to append the job ID to a list so that we have any errors down the line, we could see what job ID may be failed, then we're going to sleep for one minute, which means we're going to allow browse AI one full minute to go get our data, return it to us and then make it available so that we can query it using Python. Typically, the scraper does not take this long, but I'm going to give it 60 seconds. And you could always play around with this number and decide what you need for your specific county. Once we get this, we're going to get the results. And the reason why I have this wrapped and try and accept is that if our script at all fails, so maybe there was an error of some sort that we weren't able to catch, I don't want it to stop, I want to keep going through all the properties of in our list. So that's why I have the except clause here. And whenever there's a failure, I want to know that it was unable to get results. So once we sleep for one minute, we want to get our county results. And we want to append this into a list called DF list which we will then join afterwards. So let's run this now. Once our script is complete, we'll see a checkmark on the left hand side. We can then see that our script iterated through each of these properties and was able to get county data and results. If we go down to view our actual output we can concatenate our list and then look at all of our listings. We see here that we're able to get the owner information owner originating address and information on the property including zoning, your bill building type and more. From the three properties that we're able to get data from, we could see that two of the three are actually owned by LLCs. This is interesting, potentially, the LLC is owned by one person, or it could just be a large corporation. And because they have a large portfolio, they don't really care that the property hasn't sold within 60 days. And maybe that's why they're steadily doing decreases. So we may want to actually reach out to the individual owner name is not an LLC, so he is not living in the current property, so it's not owner occupied.
Ariel Herrera 15:32
If we want to see why we're missing some of the other addresses, we could look at our DF list. And we could see here that we do have some empty data frames, we could further do investigation to see is it because we put in the address in the wrong format? Or do we need to build in some more checks. Lastly, we could output this data into a CSV file so that we could view it here we have the same exact information that was in our spreadsheet Awesome, so we were able to get a list of properties. From our first video create a web scraping tool to get owner information for our county. And then in this scenario, we were able to get a list of properties get the owner information and other data from our county. I hope this tutorial has been useful and as expanded your mindset to understand that there can be so much automation done in the real estate space to find motivated sellers and to extract data. If you have other tools that you enjoy using them. Please comment with them below so I can make future videos on them as well. And subscribe to this channel if you haven't already. Thanks.