Categories
AWS Python

How to read spreadsheet in S3 using python

Last updated on June 2nd, 2022 at 08:03 am

Click to rate this tutorial!
[Total: 1 Average: 4]

In this tutorial we will focus on how to read a spreadsheet (excel) in an AWS S3 bucket using Python.

Before you jump on to the script please make sure that the below pre-requisites are met

  • IAM user created with sufficient permission to read the S3 bucket.
  • ACCESS_KEY and SECRET_KEY of that IAM user generated to connect to the bucket.
  • Python environment (Preferably python3)

Once you have all the above information handy, then let’s go to the next step.

Make sure to install boto3 and openpyxl using pip / pip3 (depending on your setup). Assuming you are using Python3 or latest

pip3 install boto3 
pip3 install openpyxl

This python script will connect to the bucket and read a spreadsheet (data.xlsx) . Then display details in that spreadsheet as the output.

Let’s break down the code in to chunks to see how each logic works

How to connect to S3 ?

Section of the code that initiate a connection to s3 bucket is shown below, you need to provide Access_key, Secret and the bucket names. The spreadsheet (object) name we are reading is data.xlsx. Feel free to modify it as per your requirement.

aws_id = ''
aws_secret = ''
bucket_name='YOUR_BUCKET'
object_key='data.xlsx'
s3 = boto3.client('s3', aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
try:
    obj = s3.get_object(Bucket=bucket_name, Key=object_key)
except:
    print("That key '"+object_key+"' doesn't exist or unable to connect to s3 for some reason")

If you are experiencing error while connecting to S3 bucket and would like to print the exact error message. Try replacing try / except statement with the below code along with an exit() function to quit processing lines after that.

try:
    obj = s3.get_object(Bucket=bucket_name, Key=object_key)
except Exception as e:
    print(e)
    exit()

Note: I am by default connecting to US-EAST-1 region. If the object resides in a different region make sure you modify boto3.client function with the attribute region_name=”<your_region>”, for example if you want to connect to eu-west-1 then the complete connection statement will be like,

s3 = boto3.client("s3", aws_access_key_id=aws_id, aws_secret_access_key=aws_secret, region_name="eu-west-1")

How to select the sheet (if spreadsheet has multiple tabs)?

Once the connection is successful variable body will have all the details of data read from S3 object (data.xlsx). Here we are printing obj and also getting the sheet name. I intentionally printed the obj variable for you to see the values returned. It returns a whole bunch response metadata like HTTPStatusCode / HTTPHeaders / Bytes/ LastModified time of the excel file being read etc.,

body = obj['Body'].read()
print(obj)
workbook = load_workbook(io.BytesIO(body),data_only=True)
get_sheet_names = workbook.sheetnames
print("Sheet Name "+str(get_sheet_names))
sheet = workbook.active

Current sheet the script process will depend on number of Sheets you have in the original excel spreadsheet.

For example if you have 2 sheets named “First” and “Second” , sheet=workbook.active will read “Second” sheet data by default unless you select the 1st sheet. To select sheet named “First” instead of the default “Second” sheet then,

get_sheet_names = workbook.sheetnames
sheet_index = get_sheet_names.index(get_sheet_names[0])
workbook.active=sheet_index
sheet = workbook.active
print("Sheet Names Available "+str(get_sheet_names))
print("Sheet Active "+str(sheet))

As you can see above we are getting the sheet names and then passing first value in the list get_sheet_names[0]. You can also explicitly pass sheet name instead like below or even loop through the list(not shown)

sheet_index = get_sheet_names.index("First")

How to read data from spreadsheet ?

Next step is to read data from the excel spreadsheet. We need to find maximum number of columns and rows in the spreadsheet and then use for loop to read it eventually

max_row = sheet.max_row
max_col = sheet.max_column
print("Max Rows "+str(max_row))
print("Max Columns "+str(max_col))
for row in sheet.iter_rows(min_row=1,max_row=max_row,min_col=1,max_col=sheet.max_column,values_only=True):
    print(row)

values_only=True parameter return only data instead of cell names like Cell A1, Cell B1, Cell B2 etc.,

How to ignore the first row while reading ?

This is going to be very simple, here we are just ignoring the first row sometimes we don’t need the column names for data collection.

for row in sheet.iter_rows(min_row=2,max_row=max_row,min_col=1,max_col=sheet.max_column,values_only=True):
    print(row)

If you give min_row=2, first row is ignored

How to get the complete Python script to read excel file in S3?

Here is the complete code.

Note: Code below reads 1st sheet named “First”. You may need to modify this script as per your requirement, following appropriate sections above.

#!/usr/local/bin/python3
import json
import boto3
import io
import tempfile
from openpyxl import Workbook,load_workbook
from io import BytesIO

aws_id = ''
aws_secret = ''
bucket_name='YOUR_BUCKET_S3'
object_key='data.xlsx'
s3 = boto3.client('s3', aws_access_key_id=aws_id, aws_secret_access_key=aws_secret)
try:
    obj = s3.get_object(Bucket=bucket_name, Key=object_key)
except:
    print("That key '"+object_key+"' doesn't exist or unable to connect to s3 for some reason")
 
body = obj['Body'].read()
workbook = load_workbook(io.BytesIO(body),data_only=True)

get_sheet_names = workbook.sheetnames
sheet_index = get_sheet_names.index(get_sheet_names[0])
workbook.active=sheet_index
sheet = workbook.active
print("Sheet Names Available "+str(get_sheet_names))
print("Sheet Active "+str(sheet))
max_row = sheet.max_row
max_col = sheet.max_column
print("Max Rows "+str(max_row))
print("Max Columns "+str(max_col))
print("===============values_only = FALSE=========")
for row in sheet.iter_rows(min_row=1,max_row=max_row,min_col=1,max_col=sheet.max_column,values_only=False):
    print(row)
print("===============values_only = TRUE=========")
for row in sheet.iter_rows(min_row=1,max_row=max_row,min_col=1,max_col=sheet.max_column,values_only=True):
    print(row)
print("=========Ignoring the first row=============")
for row in sheet.iter_rows(min_row=2,max_row=max_row,min_col=1,max_col=sheet.max_column,values_only=True):
    print(row)

Sample Output (Saved the above script as connect_s3.py )

$./connect_s3.py
Sheet Names Available ['First', 'Second']
Sheet Active <Worksheet "First">
Max Rows 5
Max Columns 3
===============values_only = FALSE=========
(<Cell 'First'.A1>, <Cell 'First'.B1>, <Cell 'First'.C1>)
(<Cell 'First'.A2>, <Cell 'First'.B2>, <Cell 'First'.C2>)
(<Cell 'First'.A3>, <Cell 'First'.B3>, <Cell 'First'.C3>)
(<Cell 'First'.A4>, <Cell 'First'.B4>, <Cell 'First'.C4>)
(<Cell 'First'.A5>, <Cell 'First'.B5>, <Cell 'First'.C5>)
===============values_only = TRUE=========
('Name', 'WebServer', 'Linux Flavor')
('Shane', 'Apache', 'Ubuntu')
('Scott', 'Nginx', 'RedHat')
('Peter', 'uwsgi', 'AmazonLinux')
('Kim', 'IIS', 'CentOS')
=========Ignoring the first row=============
('Shane', 'Apache', 'Ubuntu')
('Scott', 'Nginx', 'RedHat')
('Peter', 'uwsgi', 'AmazonLinux')
('Kim', 'IIS', 'CentOS')
Click to rate this tutorial!
[Total: 1 Average: 4]

Leave a Reply

Your email address will not be published.