How to Create a Streamlit App to Merge Files using Python
Working across multiple data sources in spreadsheets? Check out how I combine files in an easy to use interface using Streamlit in Python.
Transcription
Ariel Herrera 0:00
Hey everyone, do you find yourself working across multiple spreadsheets and you just want to combine everything into one? Well, I'm going to show you how to do that using streamlet, and Python to create an easy app. In order to merge different datasets together. The tool that I've created uses prop stream data, which is a real estate software that takes leads of potential people to buy properties from, and then takes their contact information so that you can reach out to them merges those together automatically. My name is Ariel Herrera with the analytics aerial team, we bridge the gap between real estate and technology. I love love, love being able to simplify data, merge it together and develop insights. If that's the content that you enjoy, then please subscribe to this channel. As well like this contents, I need to make more of it. Alright, let's get started. As a quick overview, or we're looking to be able to do is combine two Excel files into one single file. And we want to be able to create a streamlined app so that any user whether they know Python or not, can go drop in those files and get that merge file back. So the way that I started working on this was because I was using a software called prop string, which is a real estate software, and information on properties as well as the contact information for those owners. So what those two spreadsheets look like, are as follows. So I have one Excel spreadsheet here, which basically gets information for a specific set of properties within an area called Woodbridge in New Jersey, I have information on the property, everything about the square footage, number of bedrooms, bathrooms, estimate and so on. Then on the other hand, I have a spreadsheet where I get the contact information of those homeowners reason being so I can reach out to them and try to possibly put an offer on their home. This is a technique used in wholesaling for off market properties. But aside from the real estate technicalities, the main thing is that we want to bridge these two data sources together, we don't want to have property owner property information, property owner contact information in two separate files. Reason is once we work with these files and other systems, in this case, I'm using Podio as a CRM system, it's going to be difficult to upload two separate files each time I have a new list. So just to play around with this, I first decided to see if I can merge these files successfully within a notebook environment. So right now you're looking at Google collab, which if you have a Google account to run, this is completely free. And within this, I'm looking at two different lists. So I have my marketing list, and I have my contact list. To start, I have my imports at the top. This is what allows me to read files that are inside my Google Drive. So here, I then specify where my files are, which I have within a folder called leads, I have my files here, the two that I want to be able to extract. So then I go over to the bottom, and I have a really useful function just to read Google Sheets. So I read both of these files in. And we could see this first one has the dress unit city state zip, which is the equivalent of that Excel file that I showed you beforehand, that's the property export. Then we have a second file, which is the actual contact information for these homeowners. This includes information like the landline, the cell phone number, if they're on a Do Not Call list, as well as email, if that is found. Next, I have an output that I want to be able to merge these files. So using Python pandas, really easy library to use. And here I have both my tables, I decide that from the contacts table, I don't want to grab everything. So I write here, I actually specify what columns in a table I want. Then in this next field here, I actually start changing the names of some of these columns. The reason is, is that the first file has first name listed as owner one first name. So instead of us having two separate columns of first names, let's just change one of the files so that they both can match up. So that's what's being done here. Then, with a simple PD dot merge function, I'm able to pass in both of my data frames and say what columns I want to join them on. And then I can print out the first three lines of my new data frame which has again the property address information, and then all the way at the end it has that contact data as well. And then output this into a file down here, which I'm able to download straight from Google collab. Now this is great and super easy. Only a few lines of code. But what if you're working within a team, and you want to be able to let business users have the ability to quickly drag and drop files, they're not going to want to work on a Google collab environment. So this is where creating apps within streamlet are super useful. So as a background, I'm a huge Plotly dash person, I've actually gotten really deep into the documentation of Plotly over the last several years. So I was a little bit apprehensive of using streamlet. This because learning a whole new tool does have its learning curve and can have complications. But streamlet is amazing. It actually takes so much less code to get started. And you could spin up the web apps directly on stream let's cloud for free. So what am I even talking about? Let's look at Streamlight. Quickly, it's the fastest way to build and share data apps today using Python and not having to know all that front end code. This is an example of one of their streamlet apps that they have in their gallery. It's New York City Uber ride sharing data. So say if you wanted to see where most Ubers are called from in New York City at a given time, you could quickly use this app, this tool is super dynamic. We have all the data loaded into these maps here. And we can select different hours for pickup. So if we wanted to know where most people are being picked up at 2pm, in New York City, we could see that based on the data, that it's mostly at the JFK Airport, and following LaGuardia Airport as well, which which makes sense because they're popular times to be traveling. Now let's get back into creating our own streamlined app that will merge different datasets. Remember, even if you have different datasets than I do, you can follow the same process for other CSV files or Excel files that you may be using. Right now I'm working within VS code Visual Studio code. It's an easy to use IDE to get started and write some Python script. So from here, I created a folder called streamline app merge files, and I created an app.py file. So what I'm going to do is just import what I'm going to be needing. So I need streamlet as one of my packages as well as pandas, then let's just test to make sure that we're able to get our app up and running.
So here, we're going to write what I want my header to be, which is prop stream, verge leads and contacts. So now if I save this file, I'm going to create a virtual environment for us to work in. So I'm going to call virtual. Net next, I'm going to go into my virtual environment. So I'll always type out source and then my virtual environment name, which has been then activate. And we want to install streamlet. So let's install streamlet. One Streema is installed, we want to be able to run our app to test. So we're going to type into the terminal streamlet run app.pi. And when we do we may be met with an error. So in this case, I have this terminal decoder error, which actually happens pretty frequently. So in this other thread, we can just copy to the CD. So change directory into streamlet. Faceless, then go back and remove this config to ml file, which will need to set that back when we tried to upload this. Or for now it's fine. So now let's go back to our directory. Now that we're back into our directory, we can then again run streamlet Run app that py. And when we do, we should see our app pop up, and we have our header right there. So our app works perfect. Now we can start adding on to some code. So the next thing I want to do is to actually allow user to upload some files. So in this case, I'm going to create these two variables. And one of them is marketing list Upload Files that's more specific for me, you could just call your file upload file one and two. And I use for streamlet. This file uploader function, which allows you to ingest many different files. So in this case, I'm expecting an XLSX file as well as a CSV file. So if I save that, and then go back to my app, refresh, which is being hosted locally, we can now see those two pop up, which is amazing because as a Python developer on the back end working with data, I don't want to be figuring out all the front end stuff so it's already have this icon in place. The button in place is just amazing and helps us really spin up these apps quickly. So the next thing we want to do is to be able to handle whether the button has been clicked or not. And to see if these files are uploaded. So first, let's see if the files are uploaded, we are going to add this in here. And if the file is not none, meaning there's some objects there and the file has been uploaded, then the first one, in my case, it's going to be an Excel file. So I want to read it into a panda's data frame. Then for the second one, I know it's a CSV file. So I want to read it in as a panda's data frame as well. And next, I have these two lines of code, which are more specific for the files that I know are being uploaded, I want to be able to take a part of that file name so that later when I merge the files, and I produce it back to the user, I'm able to say the specific city or state that they were actually searching for within their files. So now if we save this, we're not going to see anything change, because everything is still happening on the back end. But how are we actually going to get things moving? Once the files are dragged and dropped, we need to create a button to do so. So I'm going to copy several lines of code, which is all within a GitHub repository that's linked below if you want to take a look. And in this case, I have a button that I'm naming called streamlet button merge. So that's going to be the ability for the user to merge the files. I save this, I now see that merge button. Now let's understand what's happening here. Because I have two files that I'm uploading, I want to be able to handle different scenarios. So in the first scenario, if I have both files ready, then I want to start merging them. But in the other cases, maybe the user only uploaded one file, and they still click Merge. Well, we can't do anything with that we need both files. So we want to be able to handle the errors here. So here, if a user inputs one file, and not the other than I asked them to please upload that second file. So let's try that out here. Let's say that I browse files, and I select a file here.
Well, this is actually because the code hasn't been completed. So actually, let's finish out the code that we have. So let's go back to VS code here. And if you're using VS code has some really useful functionality like this, where it shows, hey, you have a function being called merge files, you're calling it, but we don't see it here. So let's actually put in the function that we have. So if you remember back in Google collab, we use pandas to merge the files. And this is a piece of code that's doing it, we're dropping columns we don't need we're renaming columns. And then we do a merge and return that. And the last piece that we also have, is specific to streamlet. So when streamlet handles downloads, because eventually we want our users to download this file, we need to be able to transform our data frame to encode it, and UTF eight format so that it can be processed by the web, and a user can have it in their hands. Now he's tried to download the file, we do still get an error, and it says, missing a dependency open pie XL. So because one of the files that I have is an Excel file, I need to be able to handle that. So I'm going to copy that and actually come out of my app. And I'm going to do pip install open pie XL. This is going to allow us to work with the xls. X file. So now I'm going to go back into running our streamline app. And if I go back to the code here, we can browse the file, which we want to get the property information, then we want to get the CSV information, which is this for contacts. And now we click Merge. Now when we click Merge, we had some things happen on the backend. So what actually happened. So we said once both files are populated, merge those files, which was that function above, then create three separate columns, which streamlet makes it super easy to basically separate out your, your different elements on your webpage. So I wanted to quickly summarize the number of leads, which is basically number of rows in my file, how many had certain fields populated? In this case, I had a column for cell phone numbers and a column for emails. So I want to see how often those are populated. So that logic is being done. And these cells right here, sorry, these rows. So I'm basically saying count the number of rows in a data frame, return that, and then also handle some logic to get the percentage of how often that column is present, that data is present in that column within my data frame. And the last piece here is I want to set up that download buttons that a user can then click this file. It's useful right now to have a preview over here, what we need them to have it in their hands. Next, we do a download file, and we specify that button is going to be called download. And for the CSV file, we customize the name based on what the user previously uploaded, which was back up top. I had some logic here to make sure we retain some of that file name. But you could just call it file dot CSV if you'd like. And the last bit was that output where we show that first 10 rows. So now we'll user can click download. And they're able to see that file that has the property information as well as the contact information. Overall, I hope this has been super useful to see how you can quickly spin up web apps using streamlet. And how you could use theme light to be able to merge different data frames together and provide that as a downloadable CSV file to potentially business users or other applicants. If this is the kind of content that you enjoy them, please be sure to subscribe so you can see more of it. Thanks