Pandas, Aws, and Json

Reading data with Pandas, from AWS, in Json format.

Last updated on Mar 4, 2021 4 min read

Summary

Getting data from AWS s3 and organized has been sort of fun for me to take a stab of time. Fetching files from s3 by date and creating a frame, then doing some parsing is what this is all about.

Finding files and making a dataframe

The following code finds s3 files between a begin and end date modified and makes a huge dataframe of all the results. Below I am using the read_json feature as all my lines are in json and separated by a new line (which requires lines=True).

# define your times
begin = datetime.strptime("20-01-19 01:58", "%y-%m-%d %H:%M")
end = datetime.strptime("20-01-19 02:05", "%y-%m-%d %H:%M")
# define your timezone
timezone = pytz.timezone("America/Los_Angeles")
# optional locally
begin_Los_Angeles = timezone.localize(begin)
end_Los_Angeles = timezone.localize(end)
# or utc
begin_utc = pytz.utc.localize(begin)
end_utc = pytz.utc.localize(end)
# create a session
bSession = boto3.session.Session(aws_access_key_id=access_key, aws_secret_access_key=secret_key, aws_session_token=security_token)
# create a datawranger s3 to find files between two dates
df = wr.s3.read_json(f"s3://bucket/sub/path/", lines=True, last_modified_begin=begin_utc, last_modified_end=end_utc, boto3_session=bSession)
# show the results
df

Nested Records

Json allows for nested records. I had to pull these out of each row to be used for reporting. Best way to pull nested data out of nested_data_field is to create a new dataframe from the values of the column and send them tolist():

newdf = pd.DataFrame(index=df['index-field'], data=df.nested_data_field.values.tolist())

Filtering

When you have a large dataset and are looking for a specific string in a column, I found the following to be most helpful:

df['column_holding_value'].value_counts().to_frame().filter(like = 'string-to-look-for', axis=0)

This was super helpful for summarizing a specific pattern across thousands of rows for that quick answer to the question: Is that even in here?

Duplicates

What? There are duplicates? Lets count them just to be sure they exist or not by using col1 and then col2 as the criteria for determining a duplicate:

df_duplicates = df[df.duplicated(subset=['col1','col2'])]

Unique values

I really like .nunique() to give a quick summary of the columns over all rows of the dataframe to answer questions like “How many different values do we have in each column?” or “What columns contains large ranges of data?”

df.nunique()

Custom Functions

Now comes the best of all things with Pandas. Running each row through a function and making a new column or updating an existing column with data based on a set of actions you have determined.

df['new-column'] = df.apply(lambda row: myFunction(row), axis=1)

The above code takes the dataframe of df and applies the myFuction() with a parameter of row. Now the function has full access to row and all of the columns. After myFunction() has done its set of actions with the row it must return the row and then the changes are applied! The above example uses the current row to do some logic and makes a new column for that dataframe.

Grouping

Grouping a dataframe by some columns really starts to reduce the data into a summary of greatness. The following code shows how to set the priority of col1 and then col2 to be collected by unique values and grouped together by shared values of to-group. Then its formed into a new dataframe which has the column ending-group-col which contains the list of all values. While this might not seem useful at first, readon to the next section where we summarized this column even more.

df1 = df.groupby(['col1','col2'])['to-group'].apply(lambda group_series: group_series.tolist()).reset_index(name='ending-group-col')

Counting lists

Grouping values into a new column as a list is great, but now we want to count each of the values. Here is a repeat of the custom functions method but using some python iterators and zip:

def split_status_codes(row):
    codes_list = row['status_codes']
    return dict(zip(list(codes_list),[list(codes_list).count(i) for i in list(codes_list)]))

Conclusion

This is my first real dive into Pandas and its just absolutly a joy to work with. I hope you found something useful in this article and if your new to Pandas, perhaps you are inspiried to use it on your next project.