Real Estate Data Science Project | Find Fixer Uppers using NLP in Python | Part 2

In this video, I'm going to show you how to apply topic modeling to real estate descriptions, we'll be able to identify which properties are fixer uppers, and which ones are maybe recently remodeled.

Ariel Herrera 0:00

Have you been curious and how you can utilize machine learning real estate data?

Well in this video I'm going to show you how to apply topic modeling to real estate descriptions, we'll be able to identify which properties are fixer uppers, and which ones are maybe recently remodeled. My name is Ariel Herrera Bell data scientists with the analytics area channel where we bridge the gap between real estate and technology. Now if you're not a Python geek like I am no worries, you could still learn a lot from this video of what's possible with machine learning. Please like this video and subscribe to help us reach a wider audience. All right, let's get started. When I started the analytics aerial channel several years ago, my original intentions were to show more videos related to data science with real estate data, such as forecasting models using natural language processing, as well as developing AI models. However, I soon discovered that the first step when you want to actually create models is retrieving data and real estate data had no documentation on how to retrieve it, and how to utilize it. This led me down a different path of creating a series of content on how to utilize real estate API's to gather information, and to be able to automate processes for real estate investors, however, which had GPT being a main topic in the public eye by those that have no experience with machine learning, I realized this is a great opportunity to show some previous work that I had done on machine learning this is in the hopes that I can shift back to the original intention to start creating more videos that talk about AI ml within real estate. So if you haven't already check out part one, we detail what the project was all about, and how we're able to obtain training data using NLP natural language processing. I go through a series of steps through the notebook that I created, and you can replicate it for your own data set as well. So as a recap, the problem that we're looking to solve is how do we identify if a property needs work if it's distressed, if maybe the kitchen is outdated flooring needs work? And why would we even care to begin with? Well, a lot of real estate investors who employ the burr method or our fixer uppers, they look for these properties because they have a higher return since you can usually fix these properties up at a certain cost but sell it at a higher price than if you were to buy a turnkey property. So what is the challenge in this space? You would think okay, just go to the MLS or to Zillow, Redfin and just search fixer uppers right? Not quite, sometimes this data isn't as apparent as to which properties need fixing, and also which ones are ready to go which ones are turnkey. So collaborating with two other data scientists that I met off of BiggerPockets forum, we had a fun little project, where we were looking to see if we can classify whether a property needed rehab or not. So if it was distressed or not, based on the listing description, since a lot of the time, the Listen description will tell you if a property needs remodeling, or if it needs TLC, tender love and care. So in that process, we worked for several months on this fun project where we got to learn a lot about using different techniques in data science. So even though this project was a while back, a lot of the parts are still relevant today, especially if you're looking to start a machine learning or AI based project. So what I'm going to do is walk you through the notebook on how to create an unsupervised topic model, so that you can identify or group these properties that need work versus don't need work. If you're brand new to Python and want to do something similar, I highly suggest for you to check out my course introduction to real estate data analytics. I go through Python step by step as well as we cover web scraping using different API's like Fred census, and even create Tableau dashboards to analyze markets. Great. So to start I'm using Google collab. Google collab is a free notebook environment where you can code and Python without having to have anything installed on your machine. When we were first working on this project, we found it useful to use Google collab because we can easily share notebooks with one another without the overhead of managing it and GitHub. So the first step here, we are using lda, which is a topic model method. I have links below if you'd like to learn a little bit more about what this model is. But let's go through the code. And I'm going to show you a really neat visualization that we're able to create off of it. So here we have our imports. And we're also downloading stopwords, so that we could remove them from our dataset. I have stored in GitHub, the training datasets that we had, again, this is a project that I worked on a while back. So there are definitely more methods that can be applied. But I love to show you some of this work. Now that there's more interest in the AI and ML space, particularly for real estate. So jumping down, we're going to reference these functions later. And the steps that I took overall, were reading in the datasets. So in total, we had 12,000 rows of training data, you could watch the prior video on an example of how the training data was obtained. First putting some hard labels than having human review. And ultimately, having a set of different labels is included distressed, not distress, remove, undecided and unknown. As you could see, most of these classifications were unknown. As we were going through this process, we realized, oh, man, doing human labeling is so tedious. Imagine reading line by line descriptions for properties and labeling them as distressed, or not distressed, there were a lot of complications, because we realize some properties seemed like both they seem kind of like they're the stress but not really that you could you could probably move in. So what actually clearly defines either these two, this is where we thought maybe we could use unsupervised methods to be able to classify or start to group these descriptions into different buckets, and then better identify what the right labels should be without going through human labeling. While going down, let's just look at examples of what's in this training dataset. So we have original descriptions, as you could see here. But of course, there's a lot of junk, we have dollar signs, exclamation marks, things are lowercase uppercase kind of s. And if we were to tokenize this as is it would be such a wide corpus, think of a bunch of different columns. If you like to think of things in Excel terms or table terms, we want to cleanse our dataset. So one of the steps here was to remove stop words, then the next step was building bigrams and trigrams. So what does this mean? Basically, there are some words that don't really have that much meaning when it's just one single word, but when you pair them, say in a string of three, it does have a lot of meaning. So for example, when I was using stop words, I believe as or is kept getting removed. But if we look at this sold as is when you pair all those three words together, it actually means something sold as is usually means that it's a property that the seller will not do any work on, they know that there probably shouldn't be maybe the roof needs to be redone or the flooring. And that's a prime word to help us know if the property is distressed as well, great investment. Now, if it just said great, or if it just said investment, it might not mean too much. We don't know the context. But by using bigram, where we pair these two words together, we now have a more informative phrase that allows us to better describe our data. So after doing some more pre processing, a lot of it was an iterative process of looking at the data afterwards and then making modifications. Here you can see an example where we have a description it says Green Acres Central Park area, great price for this fixer upper easy to view all information should be independently verified. We remove stop words and cleanses a bit where you see now everything is lowercase, we remove punctuations, then we also implement bigrams. And now we could see fixer upper has been identified as a word that we want to have in our corpus. After creating our corpus we can now train our model. Here we are building our topic model using an LDA model. We pass in Corpus as well as number of topics which will detail how we found below. We could also print the keywords for our topic. After building our model. We could see the weight placed for each type of text in our corpus the way I was Evaluating my model was looking at the log perplexity, as well as looking at coherence score. Now if we look down a little bit below, here's where we're computing coherence value that takes a bit of time in case you're replicating this as well. This helps us to choose the optimal number of topics. And if we plot this here, we could see the optimal number is four, which is why I have top on telling the LDA model to try to split up my data into four topics based off of the corpus. So now if we go down, this shows when we zipped the coherence values, basically each of the topics what those values are. And skipping down a little bit. Here, we select the model that we'd like to, to set as our main model, save it to a temp file. And then actually visualize showing the topics this is my favorite part about using the LDA model is not only having an unsupervised model that we could utilize, but actually being able to see it visually. On the left hand side we have our topics. And on the right hand side, we have the words that contribute to each of these bubbles. So let's start with number one. Number one, the 30 most relevant terms for topic one included words like room bedroom, large space living feature, backyard bathroom family fireplace, as you can see, nothing here states that it's remodeled or brand new, but it also doesn't fit into the category of needing rehab.

So this is great that our LDA model has been able to set these types of properties into their own bucket. Then number two, we see there are a smaller subset of properties that fell here, where it talks about downtown neighborhood shopping bungalow great home, this makes me think that there is a subtopic of properties that also may not be brand new remodeled, but are pretty much ready to go and happen to be in some sort of downtown area and really location is talked about the most when the description is made for the property listing. Now three is where we start to get a little bit more variant and a little more extreme as to whether the property is distressed or not distressed. Here in this case, we see words like new update window appliance cabinet, remodel, stainless steel, and even towards the bottom fresh paint. Now these words What do you think? Yes, they are very close to words that would explain a property that is recently remodeled or new. Then if we go down to number four, we see different words including buyer seller offer, property obtain, we see buildings zone, estimate only sold as possibly sold as is highest bid b which probably means highest bid and maybe auction related. There are some words that are missing here like fixer upper particularly, this makes me think that I need to go back to my corpus and possibly retrain this data. However, from my initial thoughts, I think that by splitting up for different topics, LDA does a pretty good job at having one topic that kind of discusses not fully rehab or properties that may be sold as is or are foreclosure because it says things like highest bidder, then on the far end, we have a very clear picture of those that are remodeled, and then a not as clear picture but properties that kind of fall in the middle. What I take away from this is when I was first labeling the stress or not distressed? Well, in fact, it is actually not as clear as just those two, there's likely more categorizations that needs to be had one classifying a property based on its property listing. Then towards the bottom, I have some more text about the output of this model, including some keywords or original texts. And I believe when I had first gone through this, I was also going through an iterative process of reading this and then making modifications to the corpus to further improve the unsupervised model. Some of the great takeaways that I got from this was using an unsupervised model to really understand the separation of my dataset as well being able to visualize the data to to clearly be able to take out some key words. Now you might be asking Where did this project go? Is there an API that I can use if I want to classify your property as a stress or not distress? Well, we never took our pet project that far. However, I highly encourage you to one, either utilize some of this code or to show err in the comments down below some more recent methods that maybe you took. And if you're looking to right away use an API that helps to classify properties based on images, as well as listings at a high suggest to check out foxy AI foxy AI is a company that uses computer vision and artificial intelligence for real estate. In the next video, I'm going to conclude this series by going over for this project, how we ultimately created a supervised model based off of this training data. See you there

Transcribed by https://otter.ai

Previous
Previous

Real Estate Data Science Project | Find Fixer Uppers using NLP in Python | Part 3

Next
Next

Real Estate Data Science Project | Find Fixer Uppers using NLP in Python | Part 1