Pandas, Aws, and Json
Reading data with Pandas, from AWS, in Json format.
Getting data from AWS s3 and organized has been sort of fun for me to take a stab of time. Fetching files from s3 by date and creating a frame, then doing some parsing is what this is all about.
Finding files and making a dataframe
The following code finds s3 files between a begin and end date modified and makes a huge dataframe
of all the results. Below I am using the
read_json feature as all my lines are in json and separated
by a new line (which requires
# define your times begin = datetime.strptime("20-01-19 01:58", "%y-%m-%d %H:%M") end = datetime.strptime("20-01-19 02:05", "%y-%m-%d %H:%M") # define your timezone timezone = pytz.timezone("America/Los_Angeles") # optional locally begin_Los_Angeles = timezone.localize(begin) end_Los_Angeles = timezone.localize(end) # or utc begin_utc = pytz.utc.localize(begin) end_utc = pytz.utc.localize(end) # create a session bSession = boto3.session.Session(aws_access_key_id=access_key, aws_secret_access_key=secret_key, aws_session_token=security_token) # create a datawranger s3 to find files between two dates df = wr.s3.read_json(f"s3://bucket/sub/path/", lines=True, last_modified_begin=begin_utc, last_modified_end=end_utc, boto3_session=bSession) # show the results df
Json allows for nested records. I had to pull these out of each row to be used for reporting. Best way to pull nested data out of
nested_data_field is to create a new dataframe from the values of the column and send them
newdf = pd.DataFrame(index=df['index-field'], data=df.nested_data_field.values.tolist())
When you have a large dataset and are looking for a specific string in a column, I found the following to be most helpful:
df['column_holding_value'].value_counts().to_frame().filter(like = 'string-to-look-for', axis=0)
This was super helpful for summarizing a specific pattern across thousands of rows for that quick answer to the question: Is that even in here?
What? There are duplicates? Lets count them just to be sure they exist or not by using
col1 and then
col2 as the criteria for determining a duplicate:
df_duplicates = df[df.duplicated(subset=['col1','col2'])]
I really like
.nunique() to give a quick summary of the columns over all rows of the dataframe to answer
questions like “How many different values do we have in each column?” or “What columns contains large ranges of data?”
Now comes the best of all things with Pandas. Running each row through a function and making a new column or updating an existing column with data based on a set of actions you have determined.
df['new-column'] = df.apply(lambda row: myFunction(row), axis=1)
The above code takes the dataframe of
df and applies the
myFuction() with a parameter of
row. Now the
function has full access to
row and all of the columns. After
myFunction() has done its set of actions with
row it must return the
row and then the changes are applied! The above example uses the current
row to do some logic and makes a new column for that dataframe.
Grouping a dataframe by some columns really starts to reduce the data into a summary of greatness. The following
code shows how to set the priority of
col1 and then
col2 to be collected by unique values and grouped together by shared values of
to-group. Then its formed into a new dataframe which has the column
ending-group-col which contains the list of all values. While this might not seem useful at first, readon to the next section where we summarized this column even more.
df1 = df.groupby(['col1','col2'])['to-group'].apply(lambda group_series: group_series.tolist()).reset_index(name='ending-group-col')
Grouping values into a new column as a list is great, but now we want to count each of the values. Here is a repeat of the custom functions method but using some python iterators and zip:
def split_status_codes(row): codes_list = row['status_codes'] return dict(zip(list(codes_list),[list(codes_list).count(i) for i in list(codes_list)]))
This is my first real dive into Pandas and its just absolutly a joy to work with. I hope you found something useful in this article and if your new to Pandas, perhaps you are inspiried to use it on your next project.