Analyze Short-Term Rental Airbnb Data in Python | Part 2
Want to analyze short-term rental airbnb listings in your market? Check out this video for Part 2, where we analyze the Airbnb Scraper data in Python.
Ariel Herrera 00:00
Are you looking to get and analyze real estate data related to short term rentals for your market? Well, you're in luck because in this video, I'm going to detail how to get information from Airbnb, and be able to visualize this data within Python to understand market trends. My name is Ariel Herrera analytics area channel, we bridge the gap between real estate and technology. I love data driven solutions. And if you do to them, please subscribe to this channel, as well. If you want to see more short term rental content, then please like this video, so I know to make more of it. Alright, let's get started.
Ariel Herrera 00:46
This video is part two of a two part series. In part one, we created an Airbnb scraper on apify for free to be able to extract data for our market. This was a no code solution only took a few minutes to set up and less than five minutes to be able to run our bot to be able to get this data once our bot ran, we were able to export this data in different types of formats including Excel, JSON, CSV, XML, HTML and more. For the HTML table, we were able to view this data right away. And we could see for our market which is in this case, Siesta Key, the number two beach in the United States and in the Tampa Bay region facility, we were able to get this data and view it in HTML here. And we were able to see that we can get the city latitude longitude of the listing the name of the listing pricing number of guests, as well as room type stars and some more fields as well. This was a basic data retrieval because there's so much more information we can get, we can get more detail within the listing, including calendar, how far listing is booked out host information, and more. So for this tutorial we're going to be doing is taking this data that we got from our first step. And now we're going to analyze it in Python. That way we can actually assess what does the market look like? And our use case, we are going to be investors who are thinking about investing in Siesta Key. But how is hyper property are we supposed to get? Do people want entire homes? Do they want private rooms? Do they want condos? And at that? How many people usually stay in these properties? Is there a cap do most people only come with one single family so in this case, maybe just four people is enough? These are all questions that we need to answer. And by analyzing this data, we'll be able to do so. So for the first step, you're going to use the link below to follow along with the Google collab. So if you're brand new to Python, this is completely okay, because Google collab allows us to actually run Python scripts without needing it to be installed on our machines, because it's all running on the cloud benefits of using Google. So if you go open this file, you'll be able to actually save a copy to your own drive so that you could follow the same steps as well. So for our first step, we want to make sure that we import the necessary libraries that we're going to need, we are going to be using some Google Drive specific imports, such as drive and files, so that we can actually upload our CSV file, we don't need to have it stored anywhere. We also are going to use Plotly Express for visualizations, which is amazing to only use one single line of code. And then pandas my favorite for data manipulation. So in order to run the cell, we can either click run here, we can click Run up top, or we could do Ctrl, enter, which is what I'm going to do next. This is very specific for Google collab. So we're going to plot our data on a map. And the map that we're going to use is a scatter plot on Matte Box Map box is free. But we do need to get an access token to be able to use it to be able to visualize all of our listings within one single area. Now for my map box API key, I have it stored in a CSV file. So what I'm doing in these two steps is I'm just saying, hey, Google, go to that file that I have called API keys and retrieve the map box key that string so that I could use it down the line. And if you want to just input your string right here, you can do so as well. So what I'm going to do next is run these two cells, we're just going to ask me to connect to Google Drive.
Ariel Herrera 04:40
Here, I select my name, and allow Google to go into my files. And once that's activated, we're going to be able to read in our file and get our Mapbox API key. Once that's complete, our next step is to be able to load in our data. So our data set was here in the H KML format, but for this tutorial, you're going to want to select CSV and download it. When I select Run. Google collab provides the option to choose a file. Once it loads in the file, it saves the dataset. And the next step here is I don't want to type in this whole name of the file. So I'm just going to upload it, which was my variable here, and I'm saying grab me the first file that was uploaded, that should be the correct file name. And when you have that file name, read it into a CSV file using Pandas. So that's what I do in the next step. And we could see here, we now have a data frame that has 500 rows and 17 columns, which should exactly mirror this table that we have in our browser, let's view the contents of the first row. We do this by indexing zero for our first row, we could see that we have information on a city, so not the exact address. But we do have the city we do have latitude and longitude, which is super useful. And we can actually backtrack, the address based off of this. But for the use case of this tutorial, we'll be doing that. Then we also get the name, number of gas pricing and both formats. So in an integer format, as well as a string, as well, we get information on the room type, which for this first listing, the room type is a private room and a home stars, which is five stars and the URL if you want to take a deeper look into what this property looks like. We could see that this heading the name does match $7,000 A night is true. Which I could guarantee you that they only price that this way because they are no longer active on Airbnb. So they're keeping the listing active, but they don't want any guests because this is definitely not worth $7,000 A night. And so how do we actually get rid of some of these outliers? Well, to start, I'm not really interested in Bradenton, Florida, I really want Siesta Key Beach, I want to see properties that are right by the water. So if we go back to our map here, Siesta Key is a little strip off of the main body of land. And this is the only information that I want. So we're going to filter on that within a few steps. But first, we could also just view all of the columns that we have now for transformations. So in this case, I want to see how often do we have Siesta Key, and what percentage is a makeup of all the 500 listings that I returned. So we're going to do a group by which is a pandas function. And we're going to group by address, I want to count the number of names by address. So right and Ted and count the number of listings Siesta Key count the number of listings, then we're going to reset our index and rename this name into Cow. Then once we have a column called cow, we want to sort by it from highest to lowest, and get the percent. So let's run this. And you'll see that we now have a dress, which was what we grouped by four, we have Siesta Key as the first address, there's 302 Total listings in our dataset. And that makes up about 60% of our dataset. Coming in second is Sarasota, which makes sense because Sarasota is actually the town bordering Siesta Key. So Siesta Key, technically is in Sarasota. So our next step is to filter only on addresses that state Siesta Key. So we do this here. And now we have 302 rows, which matches what we saw up top. And we can preview the first five rows of our data set that addresses Siesta Key, we have the name of the property and the other relevant information. Next, let's look at room type. So I'm doing the same type of aggregation here, just switching it out to room type within the middle. And we can see that 34% of the time, these listings are the entire home. And in general, and tire comes up about 95% of the time. So we're going to be targeting Siesta Key as our next destination to have rental short term rental properties, then we want to make sure that we're not just renting by the room, but we're actually allowing guests to have the entire home available, since that's what's mostly in demand. Next is more of a hypothesis. So in my mind, I'm thinking the lower the stars, so the lower the ratings that guests have for listings, the lower the price likely is because they can't command a high price since their rating is down. So it's view this here. And we could see that 175 listings So over half the listings actually have five stars, which is really good for this area. There's only two listings at the bottom with 3.5 stars. On our third column, we have pricing the average pricing, which we were able to get up here, by aggregating by the mean on our price column. On average five star rated properties make at least $100 over are the relative properties of lower stars. So once we get our vacation rental, and we start renting it out, we want to make sure that we are on top of getting five stars because from what we could tell here, it doesn't make an impact on our pricing and potentially revenue. Next, we want to look at visualization see the distribution of our data. So in the first line, we're going to be using a histogram and Plotly. Plotly is an amazing, amazing graphic library. I've been using it for probably over five years at this point. And I've seen it go from a 20 line code solution all the way down to a one code solution using Plotly Express. So Plotly Express, we're able to view histograms. We can as well download our histograms to put in, say like a PDF if we're presenting this to someone. So let's run this on the number of guests. And we're going to set our bins to 10. Here, we could see that most of our data lies within six to seven guests that 1/3 of our data set is this. So this is giving us a good interpretation that if we want to have a rental here, we need to have at least a house that can occupy six gas, there are some cities where you'll see two two to three being the most because it's usually geared towards couples. But in this case, this is more geared towards family vacations with a larger group of people. Next, we will look at daily price. And we use a box plot to do this. This also lets us see the distribution of our data split into four core tiles. And if we hover over, we could see that the median price for our listings is $437.70 5% of our listings go for $800 or less, or the nightly rate, we can look at our upper fence as well as a minimum and maximum. And the minimum and maximum here is pretty disparate. From one listing, it only goes $68 A night. And for our highest listing it goes $4,828 A night. So that's a big difference. Maybe we have some outliers, how do we check this,
Ariel Herrera 12:09
we could actually use a scatterplot. To view pricing versus number of guests. Let's do that in the next cell. Again, only one line of code we specify our data frame and the columns that we want to plot on our x and y axis. So now on our x axis, we have the pricing and under y axis going up and down we have number of guests. And we generally have a correlation as we'd expect that the larger the house, the more guests that can accommodate, the higher the price is the most of our data falling in this range. If we would have seen something like that earlier example where it was a private room and only was for two people, but it was going for 7000 a night that would raise red flags of outliers in our data set that we probably want to remove. But in this case, looking at these visualizations that data makes sense. And now we could take the next step of visualizing all of our addresses within one single chart. Now to map all of our Airbnb listings, we are going to use the Plotly matte box scatter plot and make sure again that you do get a matte box access token it is free and then you'll be able to create and generate these types of charts. So I typically do is that I find a chart in the example that I want to use as closely as possible. And I copy the code, and then I tweak the code based on my needs. So let's walk through this code here. First thing we want to do is when we have the hover, so we see we have hover here, we want to be able to see the listing ID. So that's the unique ID. So we're going to just test by grabbing the first URL in our data frame and split it to get the URL sorry to get the listing ID. So let's run this. And we could see here that we printed our test URL, which was HTTPS airbnb.com. And then the N is a listing ID, which is what we want. So what we did was that we split it split this string by the forward slash, and we grabbed the last element, which should be this. And once we did that, we could see that we did get our listing. Now that we tested this for one single example, we could apply this for all rows of inner column, which is what we do in the next cell. And the next cell, I highly suggest when you do create plots that are going to be modified or added features to to create a separate data frame. So I usually create a data frame called DF plot. And if I have multiple I'll just renumber them DF plot one, two, and three. So here I copy the original data frame, set it to a new data frame called DF plot. And I create two different features. So we just did up top for splitting to get the listing ID I do that again down here. But now I apply it to all rows within the URL column. And they use apply and lamda. To do that. Next, we want to get the listing ID with the actual name. So in this case here, where we have one of the listings, we want to see the listing ID. So this is 3851053. And we also want to see the name of it so that we could locate this property down the road. That's why I have these two strings being joined over here. So if we run this, we could see that if we look at our new column called header, we do have the listing ID and the Name of the listing itself. Now for the fun part, creating the map box scatterplot. So we do here is we first set what our map box access token is, which in this case, I already created that variable called bat box API key, so I'm just referencing it here. Next, we create our scatter map box by passing in our DF plot, we state the columns that relate to latitude, longitude, then for color, I want to see color change based on pricing. So I want to be able to quickly see where are the high priced nightly rentals versus a low priced. And then for size, we're going to have the bubbles increase or decrease based on a number of guests. So if our listing that we are plotting has only two guests, it's going to be a smaller dot than that that has 14 guests.
Ariel Herrera 16:30
Here I specify the color scale that I want, which you could select another color scale if you'd like to. And then for the hover, we're going to use the header which was our new column that created as our hover. Then for zoom, I had to play around with this a little bit. But ultimately, for Siesta Key 13 was the right zoom, and then I had to stretch out the layout so that I could see the full view. Now let's press play, and our chart is generated, we could see that this map resembles the same map that we saw within Google. This was the map here, this was Siesta Key. And we could see the same type of map as well. If we look at the legend on the right, it shows us where in the color scale, the listing prices fall. So from our view, we could see that a lot of the listings here are purple or indigo, which makes sense because this represents properties that are less than $1,000 a night. And when we went back into our boxplot. And we looked at the distribution of our data, we saw 75% of our listings were under $800 A night. So this makes complete sense. But it also allows us to quickly gauge what are the high priced properties, the ones that are the outliers here? Where are they actually located? And could we potentially get a property in one of these areas. So most of the higher price properties are a higher number of guests from what we saw on our scatterplot. And most of these are right by the beach, and a lot of them are centered within this area because this is actually a downtown within Siesta Key with restaurants, and bars. Now if we want to view one specific, we could look at this listing, we have the listing ID and we also have the name. So if we want to view the single listing, we can input the listing right here, we can return the URL. And we can view the data frame. And if we click this Airbnb, we could see the listing, so the name matches up. And this is an absolutely gorgeous, gorgeous listing, which hopefully one day I'll be able to stay up. And this one goes for $2,800. If we want to look at checking dates, we could see this is pretty booked for several months out, which we did not get that information initially in our scraper, but we do have the availability to add it into our input if we'd like in the future. I hope this tutorial has been super useful for you to get an understanding of one how to grab data using the apify web scraper which we did in the previous video. Second, being able to load in your data into Google collab or whatever Python notebook you're using. And then third generating visuals that we can understand distribution of our data set and then we could further dive into some of these properties. If you enjoy this content and you want to see more of short term rental tutorials with data then please make sure that you add that into the comments below. Like this video and subscribe. Thanks so much