Python DataFrames | Real Estate Analytics

Oct 6

Video Block

Double-click here to add a video by URL or embed code. Learn more

Ariel Herrera 0:00

In this video we will cover a panda's data frames. These are tables with rows and columns similar to Excel or Google Sheets. You will create a panda's data frame based on property addresses and learn how to manipulate data. Let's jump in to Lesson Six notebook from our table of contents. What is a panda's data frame? PanDa's data frame is a two dimensional size mutable, potentially heterogeneous tabular data structure with labelled axes rows and columns. Put simply, it looks like a CSV or Excel file. If we look here, at this image, we have rows and then we have columns. And this example we're looking at players from the Boston Celtics basketball team, and the first row labeled as zero we have the name Avery Bradley team. He's on Boston Celtics, his numbers zero position, point guard and age 25. pandas data frames allow us to save data and then be able to manipulate it later on. Based on our purposes, we are going to use pandas data frames excessively in this course. And if you're looking to get in data analytics as a data analyst, data scientist and even a data engineer, this will definitely come up in your career. To get started, let's install pandas, we could run the first couple of cells here. Now I have some resources up top, a lot of them because there's so much information on pandas in a good way of tutorials for beginners. We'll starting off with the first link, we could look directly at pandas documentation. Pandas has a lot of different methods of things that we could do, like aggregating items, looking at box plots, which will cover copying your data frame and so much more. So you could always come back to the documentation if you need to reference one of the methods. Now going back into our notebook, we ran the first three cells. The first one is installing pandas. Luckily for us with Google collab. It already has pandas in our environment, but I want to show you what it looks like if you were using another notebook environment. Next, we're doing our first import. We are telling Python that we want to use functions created by pandas. So we import our pandas library. And we don't want to write pandas each and every time we want to use this package. So we re label it and nickname it PD. Every time we reference PD, it is going to be representing the pandas package. Now our first step is to create a data frame. If you're unsure how to create a data frame or anything in pandas, the best thing to do is go on Google type Python, pandas, and then the question that you have. So if you were to type in Python pandas create a data frame, you would likely come to the docks, where it states how to create a data frame, let's briefly go over what these stocks mean, because you will use this in the real world a lot to solve your problems. So here in this documentation, we have top the panda's data frame function. And then we have different parameters inside. So we can state our data, our index, our columns, D type, and copy, which there's explanation of each of these parameters here below columns is intuitive. That's going to be our headers. Now, if we scroll down a bit more, pandas also provides in documentation, similar functions that we could use to the current one that we're looking at. We will use read CSV pretty often in following lessons. Then down below, we could look at specific examples. In this example, they're creating a data frame, it starts off as a dictionary, there's a key column one and other key column two, then each key has a list. This dictionary is inputted into the data frame in the parameter data. And then we have here the table that is created, these keys represented our columns, and the information inside of them represented the data coming from those columns. There are more examples down below. And I highly encourage you to look at those as well. But the best part in general is that anything related to pandas has likely been done before. And there's lots of documentation online and google for you to be able to reference. Now let's go back to our notebook. And we could see here we are creating a data frame. I'm going to press play and then explain the contents. Let's look at the data frame. Here we have a table down below. It has addresses as one column, second column is income, third column is expenses. Then we have different rows going from zero to nine, which in total is 10 different rows. The way we created this data frame is some Similar to how it was done in the documentation by using a dictionary. And if we go back, we could see we had three different keys, which represented each column. So address, we had the address key, and then we had a list of different street addresses. Then the next column was income. So we have an income key. And then we have a list of all income, and expenses. The same is a key here, and it's associated to valleys, which is a list of expenses. What's really neat with Google collab is it allows you to explore the data in more detail, we could perform filters, and you could dive in deeper if you want to learn more about the interactivity of Google collab with these data frames. Now let's jump into summary stats. If we view the type of our data frame, we would see that we're going to get data frame returned pandas dot core dot frame that data frame. And if we want to view the info of our data frame, we can type dot info with open parentheses.

Here we could see we have three different columns. There's no null values, which this will be explained later on and handling missing data. But what we see here is that all of our streets have an income and an expense, which is good. Now if we want to understand our dataset in more detail, we could write df dot describe for each of our numeric fields, we can see that we have some summary stats, and this will be covered more in our stats lecture. The Count shows that there's 10 different rows, the mean value for income is 2050. And for expenses, 1760, we have standard deviation, minimum, and then we could see the quartile split, as well as maximum. This gives us a good picture up front of our data set and what the distribution might look like. Next, let's go into selecting data. If we want to get the length of our data frame, so the number of rows, it's the same that we've been doing previously, we wrap our object or our variable in Lehn. Now we can run our code. And we could see here that we have 10 different rows. Now if we want to look at columns, it's going to be the same except we're going to add dot columns at the end. And now we could see we have three different columns. What is is dF dot columns, if we copy this code and run it, we could see this allows us to view the names or the headers of our column, which is address income and expenses. This is a pretty small data frame about 10 rows, you will likely work with data frames that are a lot larger than this. I've worked with data frames, even in the billions of rows using Spark. What if we want to just view the top five rows in our data frame, we could do this easily in pandas, we type df, and then head this will grab us the first five rows, we could see here, it starts with 123 Main Street and ends at 302. Pine Street. And if we look up top when we created our data frame, that is accurate. Now what if we wanted to get the last five rows, we could do this by setting DF dot tail, and this will get the last five rows in our data frame. This is super useful if your data is sorted. Maybe you have the properties with the highest income first. So if you just grab head, the first five, then you know those are the top income properties. What if you want to view data, just the first three rows? Well, you could also put d f dot head and put three in the middle, and you'll just get the first three rows. There's another way to do it too. By indexing our data frame, we could select DF dot I lock, then you square brackets. And just like we did with lists, we can index it with a colon three and get the first three rows. This is really useful because pandas plays on the same type of logic that lists have when it comes to indexing. So it's already prior knowledge that you have next let's select income column as a series type. If we use the same logic that we just did previously of indexing, we know that income is going to be number one when it comes to columns, since Python always starts with zero. So we can change this, we want to get all rows so we could leave that colon there. And then we can add a comma and select only the column income. And now we see all of the values. And this is actually a series that's why it looks different than data frame. It's not in a data frame structure. If we want to get it in a data frame structure, we can copy this and get everything between one and two, which is going to allow us to see just this column in a data frame. Another way to select a single column is by wrapping it in square brackets. So in data frame, if we want it to get just income, we'd see that again as a series. If we want to see it as a data frame, let's actually grab expenses as well, we will have two square brackets, and then we will type the second column, which is expenses. And we could see here we've now narrowed down our data frame from three different columns to two selecting columns is definitely useful. But one of the main things we're going to be doing is adding columns, adding information based on the previous columns that we have. Let's add a cell here and be reminded again of what our data frame looks like. We could type data frame, and then just get the first row. That way we could see all of our columns and example row to work with. Now what if we want to calculate a new column called cashflow? Well, we already have income as a column and expenses. cashflow is income minus expenses. So in this case, we're going to say our new column, it will be called cashflow.

Then we're going to subtract the column income and the column expenses. Now let's view the top five rows of our data frame. We could see now for our data frame that we have a new column called cashflow, and a correctly calculated the subtraction of income and expenses. What if we want to see cash flow for all of our properties that are in Tampa, we can't do that right now, because this is stuck within this single column. Let's add different columns to split street, city and state, the best way to start is always selecting one single example. So I'm going to set a new variable called address. And it's going to be the first address that we see within our data frame. I'm printing our sample address, which is the first one in our data frame. That's why I selected the column and then located the first element. Now what if we want to get just Street? Well, if you remember from our previous lessons, that we can take an object like a string, and we could split it based on certain criteria. In this case, we could split it by a comma. So let's add dot split, and then enter in the comma here. Now we could see we broken out each of these into different sections, we only want to grab the first one for street, so we can grab the first element by indexing it. And now we have our street as 123 Main Street. Let's copy this logic and add it to the next two rows to get the city and state. So this will be indexing the next element which would be labeled one, and then to run our cell. Now we have 123 Main Street, and then the city and state. But if you see here is a bit of a gap. Because we're splitting on comma, it's not going to take away the spaces that are apparent in between the city and the state, we can use one of pythons built in functions called strip in order to remove any whitespace from the beginning or end of our strings. So I'm going to add strip here to both city and state. And now we have a cleaner view separating our address into street, city and state. Now that we've created these examples, how do we add them as columns. So we're going to follow a similar process where we first establish our column name. So let me copy this here, and paste sit down below. Now let's rename this to St. Since we want to get a column called Street. In Python, in order for us to add new columns, we can use a lambda function in order to do so here's the syntax for a lambda function, we have our data frame, then we do dot apply, and an open parenthesis, we established lambda, what are we going to call each row in our data frame, in this case, I'm just going to have it as x. And for each item, we're going to do something. So in this case, we want to take Street from address, so I'm going to copy address, paste it, and I'm going to print the first five rows. We could see here, we just made a carbon copy of our original field address, but we want to get street. So we saw up top and we did it for a single example, we were able to split and select an element. We can copy this here, apply it, run the code, and now we see we only get st which is perfect. It's exactly what we want. It's simplified of one single line of code. Now let's do the same for the next two. I'm going to copy this line, answer two more. replace this with city and then with state. Now our logic is a little bit different where we're stripping our values. So I'm going to copy the logic that we had up top, replace it with both city and for state.

Let's run this cell. And now we have address separated as street, city and state, we're starting to really get a data frame that has a lot of rich information. Now we have cash flow. But what if we want to actually see it bucketed into different groups. For example, we want to have all properties with cash flow between zero and 250 as group one, well, in pandas, we can actually do binning pretty easily. Let's first go back to DF dot describe. So we could see the min and max of our values. For our cash flow, our minimum is negative 100. And our maximum value is 650. That way, when we create our bin here, we know what the range is. So here, I'm setting a range of values between negative 250. And they go up incrementally to 1000. I could have created a range itself, but I just manually put these numbers for simplicity. Now we create a new column and it's going to be called cashflow Ben. In order to bin our information, we're going to write pandas, so PD dot cut, put in our column that we want to cut, which is going to be cash flow, and then comma bins, since we labeled our bins already, and let's view the first five rows here, pandas takes in our cashflow column, reviews of bins that we've established, and then associates bins to each and every one. If we look at this line right here, number three, our cash flow is 650. So it falls into our bin of between 500 to 750. Perfect. Now let's write our last new column here or new feature, we're going to call this DF cashflow rank, we want to see the worst performing and best performing properties. In order to do rank, we will first establish our column name, which is going to be kashflow. Then we're going to add the method rank. Now let's look at the first five rows of our data frame. This doesn't really mean much to us, let's sort these values so we could see it from highest to lowest. So let's remove that head and put sort values by and this will be cashflow rank our new column. And let's view this information. We could see here that negative cash flow the worst performing one is showing up as one, we don't want to see that we want to see the highest cash flow as number one, since it's going to be the best performing by going back to our pandas documentation. If we look at the rank function here, it takes in different parameters, one of them being ascending. Whether or not the elements should be ranked in ascending order. We want this to be false, since the default is true. So here we will write ascending and establish it as false. Now we could see our top performing property is ranked at number one. But we also have 3.5 odd values for rank. How do we just keep tiebreakers as different numbers? Well, if we look at pandas, we could see there's different methods of how we rank. The default is the average. If we want we could set it to dense, dense always increases groups by one. So if there's ties, it just goes from three to four instead of it being 3.5. That's going to be more easily readable for our data frame. So let's add here method as a parameter. And now change it to dense. We could see for tiebreakers, we don't have 3.5, it's actually the same value. There are other ways that we could force it to increase the value even if it's the same. But that's going to be out of scope for this lesson. In order to sort values by cash flow, we actually did it up top to make this a bit cleaner. We can move this code over into the next cell.

And if we want to look at tail instead of head, we could see our worst performing properties if one of them actually having negative cashflow. Now we have our data frame with features. We have metrics like cashflow, more detail locations split into street city state cashflow bin as well as rank we can start to answer questions that we have in our analysis. First, how many properties do we have in each date? Or we're answering questions like this where we're going to be grouping information by state we will use the group by function in Python pandas. Pandas has documentation on the group by function, it takes in different parameters and there has plenty of examples down below as well that you will find for free online, I'm going to show you how we get the properties for each state. So we're going to start for data frame and then add dot groupbuy. And we want to group by the column state. So we will wrap this in square brackets. And what do we want to count, we want to count the number of properties. So let's enter here address and then state count. Now if we view this, we could see in Florida, we have six properties in our data frame. And we have four properties in New Jersey. This is right now a series if we want to have it as a data frame once more, we can add dot reset index. And now our return value is in a tabular format where we can query it as a data frame. In our next part, how many properties do we have in each city? Now we want to look not just that state, but city. So we can copy over this code. And we can add a comma that we want to group by city as well. Now we have it split our account by the cities, we could see the city with the most addresses is Toms River in New Jersey, if this group by looks pretty familiar. You are right, because in Excel, this is the same thing as pivot tables, or summarizing our data by grouping information. Let's answer the next two questions. What is the average cash flow per city? Let's again copier group busts, we're going to be getting the information by city. But this is going to change. We don't want the count of addresses, we want the average cash flow. So we can input here change address to cash flow, which is our column. And instead of count here, we're going to put mean since we're going to get the average we can see here that our best performing city is Miami and Florida. Our worst performing city is Jacksonville in New Jersey. Let's take a step back. And overall, let's look how many properties fall into each cashflow bin. Are there any outliers that are making our cities look like they're performing super well. But in reality, it's just one property doing well in a bundle of them. Let's copy this up top. And instead of grouping by city and state, we want to group by our cash flow bin. And in this case, we want to count properties once more. So we can copy our code or recounted addresses and replace this down below. Now we can run this cell. And we see here that most of our properties fall between zero to 250 for cashflow, then 250 to 500. And we have two outliers and the other ends. This gives us a great picture of what our current portfolio looks like and areas for improvement or further investment.

Now how do we filter our dataset. In this case, we saw that our best performing city was Miami followed by Tampa, maybe we just want to look at those cities. And we want to remove any city that's from New Jersey, the way to filter and pandas is very simple. We add DF. And now we're going to locate DF here and state. What is our column that we want to filter on this case, we want to filter on state. And we don't want a state that equals New Jersey. So we can have an exclamation mark equals and then type New Jersey. Now we're going to see properties for our other cities it filtered out in New Jersey, we could also look at just New Jersey by keeping an equal sign. And now we could see just New Jersey properties. Another way for us to filter is by using lambda similar to how we created new columns, I copied previous code. And you could see it follows a similar structure where we establish our data frame dot locate, instead of having the column here, we set each value as x and then we do something to x, we state that we only want to look at properties in New Jersey, or let's actually change this to not New Jersey, and it returns the same values that we had previously. So it's just two different ways that we can filter our data set. If you're already mind blown of how amazing pandas is and getting information from tables. It takes it even a step further by allowing us to quickly plot our data. So what if we want to visualize our cash flow before we were able to view our data in bins, and we were able to see where most of our data lives, but it's going to be a lot easier to visualize this, especially for communicating it with other stakeholders. In order to visualize your data. We're going to set what column we want to visualize, which is going to be the cashflow column within our data frame. Then we're going to state the method dot plot. And the kind of plot that we want is a histogram. The title of our plot will be cashflow histogram. We could see from this chart that we have frequency on the Y axis. On the x axis, we have cashflow, we can see that most of our cash flow lies in this higher region with one single property that has negative cashflow at negative 100. We will detail this information more in our statistics lesson as well in our visualization part coming up next. Now for our last plot, we're going to create another way to view our data. This is going to be through a box plot. So we can set our data frame box plot, and state that we want to look at the column which is cashflow, I need to add an underscore here. And now we could see where our distribution lies. This is a box plot and I have an explanation up top of what it is basically splits our data so that we could quickly see any outliers. And as I mentioned previously, we have one single property that's a cashflow of 650 and one that's negative 100. Our median is around three, a little bit less than 300. This is a quick way for us to understand where our distribution is. And again, we'll be covered more in statistics section. This is one of my favorite lessons, we created our first Panda's data frame. We did summary stats, selecting data, adding columns or features, which is super useful if you're going to data science, sorting values, grouping information similar to pivot tables in Excel. We filtered our data and we plotted it. And if you noticed throughout this whole lesson, we use a minimal amount of code, which is why Plotly is so amazing of a package to use. We also briefly touch on that plot live visuals. So let's explore Plotly visualizations in the next lesson. See you there want full access the introduction to real estate Data Analytics course. Then sign up for the course in the link below. You will learn Python programming all with real estate related examples. This includes web scraping, retrieving data from sources like Zillow, realtor, Redfin, Yahoo Finance, US census and more. If you haven't already, check out the introduction video to the course on YouTube to get a full understanding of what the course has to offer. Also members of the free tech and real estate group on Facebook receive a 20% off the course. See in the next lesson

Ariel Herrera

Python DataFrames | Real Estate Analytics

Python Visualizations Plotly | Real Estate Analytics (Copy)

Python Functions | Real Estate Analytics

AnalyticsAriel