Real Estate Data Science Project | Find Fixer Uppers using NLP in Python | Part 1
In this mini series, I'm going to walk you through how we can use machine learning to identify which properties are distressed. Based on the property description. We will walk through how to create our training dataset, create an unsupervised and supervised model. Even if you're unfamiliar with the world of AI and machine learning, this will still be a fun series to learn the possibilities.
Ariel Herrera 0:00
For investors finding properties that are distressed or need to be repaired is one of the best strategies to take to make a great return. With competition high for fix and flippers, and Brrr investors, it is critical to use speed to find listings first.
But if there's 1000s of properties in a city, how do we find the ones that are distressed? Do we read each listing description line by line? No, we can use machine learning to automate the process. In this mini series, I'm going to walk you through how we can use machine learning to identify which properties are distressed. Based on the property description. We will walk through how to create our training dataset, create an unsupervised and supervised model. Even if you're unfamiliar with the world of AI and machine learning, this will still be a fun series to learn the possibilities. My name is Ariel Herrera Goodfellow, data scientist with the analytics area channel where we bridge the gap between real estate and technology. Please like this video and subscribe to help us reach a wider audience. And if you enjoy these ml mini series. All right, let's get started back in early 2020. When I was first starting this channel, my intention was to show how to use data science like machine learning models, forecasting, NLP, and even AI in real estate. However, as I was on that journey, I realized that just retrieving data was a huge obstacle for a lot of investors, as well. Many out there weren't thinking as much yet of how to incorporate AI. However, with the infusion of chat GPT to the public, there's a lot more interest. Therefore, I am now showing some past projects that I've been working on or have worked on over the last several years that have to do with ml. Since there is now a buzz and interest, particularly back in 2020, I collaborated with two other data scientists that I met over a bigger pockets forum. Together, we tried to solve a problem. Our goal was to identify real estate properties with rehab potential for investors using machine learning. The thought was, if you're trying to do say burr strategy, which is buy rehab, rent, refinance, repeat, or you're a house flipper, one of the main things you're looking for are properties that you can add value to, since you can get a higher return. However, finding these properties can be tricky. You can read property descriptions that may say the house needs some work or TLC. But it's not easy to parse that information without having to take a manual process. Let's take a look at an example of a property that is rehab versus non rehab. So on the left hand side, we see a gorgeous property and the classification is non rehab, it doesn't need any work. Maybe if you're an investor that's looking to have a turnkey property, this would be perfect because they're not looking to do any type of fixes. Before the tenant arrives. We could see a description for this property is this wide open, beautifully updated four bed two full bath with Office, etc. Car garage, modern touches HGTV. There's a bunch of words in here that help us our minds as humans to quickly know that this is likely either a new or updated property. Whereas on the right hand side, we could see at this property it says short sale short sale Carrollwood estates, some description of the property home needs TLC, TLC, meaning tender loving care, so it needs work. And we could see here the cabinets are old appliances outdated, as well. The line green color isn't really what fits in today's modern look for houses. The question is just by looking at this text, can we identify which one is a property that needs rehab? And yes, we can it is the image to the right. So the process that I took back in 2020 to obtain this information was using a realtor API to get property data. If you're looking to also get listing information, then please see the links below so that you could watch the series that I have on getting data using the Zillow scrape peak API, the methodology that I had was trying to extract data for 80 Plus cities filter on properties that are below the median price of the city. So I was trying to find a lot of examples of properties that were distressed. So if they are priced lower, they likely For some reason, maybe the roof isn't done, or the property needs some type of work. So that was a methodology of thinking how to find these rehab properties, then I would extract the property description from it, and start to create a training dataset when I first thought about this problem, and I realized, oh my gosh, the data is just all over the place. There's like no documentation to know how to read this data in and manipulate it with common libraries, like pandas. So that's why one of the reasons why I decided to create the channel to make it easier for others to gather data, and do some projects like these. So now let's dive into a notebook as to how you create training data for this particular problem. Right now, I'm on Google collab, it is free to use, you can clone the notebook that I have here by going to File, Save a copy and drive and be able to run this on your own, I'm going to walk through the steps actually create a training dataset that will have properties that are either distressed, not distressed. So let's scroll down. And the first step that I have here is imports. So I'm going to be using NLT K, which is an NLP library, which stands for natural language processing. Natural Language Processing is a segment within machine learning, where we're able to parse information from text, so we can actually interpret text and create models off of it here, I'm also going to download punct stop words and word net, which I'll dive into a bit further. For functions, I have some functions of NLT k, that allow us to clean our data set, which I'll describe, and jumping down to our data. The first step is loading a dataset. So in this case, I had a whole setup of extracting data from a large list of cities, and going through a large list of properties. For this example, I'm just going to grab a small data set, in particular one that I extracted with coffee closers that has information on properties and descriptions. So I'm going to choose File and go to my spreadsheet. This is now loading as a CSV file. And in total, I have 329 rows and 33 columns. I have here information on the address, including the price of the listing, your bill square footage and description. Particularly, I want a range of properties that need work. And these are actually kind of hard to find, because most properties that go on market are usually those that are ready to be sold. So here and the description is detail that I want to parse out. So the first step that I'm taking here is normalizing my description. So as you can see, from our description, we have dashes, we have commas, we have some that are lowercase uppercase. In order for us to train our model, we want to have our text be cleansed. And if we go down here, there's a couple of steps that we're taking. We're removing punctuations, and we're tokenizing our text meaning separating the text between spaces, we're lower casing, and we're removing stop words. Stop Words are basically words that we don't care about and doesn't really have meaning like it on you can also add additional stop words to lemmatization deals with reducing the word for a search query. For example, if we had different words like fixes fixer or fix, they all centrally are the same word. Lemma tising allows us to truncate this down instead of having a larger corpus. Then lastly, we token our sentences. So if we see here for this column, description was our original listing description. And on the right hand side is the normalized description. If we look at this row here, that star star space is now removed, as well, investors is lowercase. And we can go into more detail of how this has been cleansed. But once that was cleansed, my next step here was to just analyze the data a little bit more. So looking at the word count, and I could see that most descriptions had about 20 to 40 words in the description, with some being a little bit more explanatory and some also not having that many words. So for those of low amount of words, likely can't really train too much on that. We could also view the top words so here, the top words that's in the descriptions include home, new property room bath, great opportunity, which is interesting Tennant kitchen and more. After exploring the data set, I realized I wanted to remove any where after the cleansing description ended up becoming null. So I set this to true, I also want to remove duplicates. And I thought based on the histogram that we showed above, that I only wanted to see descriptions with at least 10 words. So here, I filtered based on that criteria. And after reducing my data set, the percentage or records retained were about 90%. Then after going through tediously, through some of these descriptions, and looking at the ones that had the highest count, we realized that there were some words that were very tied with distressed or those with remodeled properties. In order for us to speed up human labeling, we wanted to try to tag some of these descriptions automatically, and then make it a little bit easier from the human side to be able to check yes, correct or no, this is not correct, and maybe needs to be a rehab or non rehab. So here I have some terms that are very distressed type like TLC tender loving care, as is repairs fixer upper handyman must sell tenant, then there's very clear words that a property is remodeled like new, gorgeous quartz move ready charm. And by us being able to use these two lists, I then had a function called hard code labels. In order to pass in my data frame, pass into distress keywords and remodelled keywords, and then apply this label all the way to the right, this label was threefold. So either the property was labeled as distressed, none, or it was labeled as not distress. If it's a a remodeled property. Let's take a look at a couple of these examples. So by doing this hard coding to try to label distressed and non distressed, we could see the first group that this was labeled as distressed, due to terms like need renovations, second one investment property fix, flip, buy and hold great rental, and this one opportunity for investors. So by doing this hardcoding it's not perfect. But for the most part, these descriptions all look like ones that would be towards a property that needs some type of work that is distressed. Then for the next set, we have non distressed so properties that are newer that don't do any rehab, we see here charming home, lovely home hardwood, attractive words that are associated with a property that likely doesn't need any work. Again, this can be useful, even still for an investor that maybe doesn't want a property that needs work and wants a turnkey property that they could quickly rent out to a tenant even if that means having possibly a lower return than there were some of these descriptions that didn't have any of these labels. So from my point of view, what I was doing iteratively was looking at the ones that had none reading some of these descriptions and then seeing if there were any words that I could tag, put it back into the list and then just continue this process. So it's really interesting in the machine learning space is a data exploration part. So being able to clean your dataset, especially with NLP using libraries like spacey, or NLT, Kay, being able to start to label your dataset, which sometimes has to be a manual process. We tried to speed that up by having those two lists. And in the next video, I'm going to show you the unsupervised model that we first created that gives some insight into how these different descriptions were grouped. As mentioned earlier, this was a project that I worked on a while back and 2020. Therefore, please know there's probably better methods today. However, I have not touched this code since and this is strictly just showing you what type of problems you could solve with machine learning with Python, using NLP. But I would love to hear in the comments below some of the latest research that you may have done or ideas that you have, as well if you're thinking Wow, where did this project go? How can I potentially get my hands on the data? Or a machine learning model that already has this problem solved? Of being able to tag say objects or identify properties that are distressed? And I would highly suggest to actually look at Moxie AI. boxy AI is a leader in the space of using computer vision artificial intelligence for real estate To be able to classify these images as well as descriptions for properties all right and the next video we'll get into the models see you there