So import Pandas,
import pandas as pd
Reading data from an Excel file
df = pd.read_excel('path/file.xls')
To read .xlsx files, we need to install the dependency, xlrd
pip3 install xlrd
df = pd.read_excel('path/file.xlsx')
Manipulating data
# Loop through the dataframe
for index, row in df.iterrows():
print(row) # We now have access to each row with their index
# Get value of a column under name 'Price'
print(row['Price'])
# Update df data with a specific value, eg, 200 at index 20
df.loc[20, 'Price'] = 200
# Append new data to the df
df = pd.DataFrame(columns=['email', 'name'])
new_df = pd.DataFrame({'email':['foo@example.com', 'bar@example.com'], 'name':['Foo', 'Bar']})
df = df.append(new_df, ignore_index=True)
# Search rows by certain conditions
# AND condition => &
result = df[(df['name'] == 'Foo') & (df['email'] = 'foo@example.com')]
# OR => |
result = df[(df['name'] == 'Foo') | (df['email'] = 'foo@example.com')]
# Check the result shape to determine if any result is found
if (result.shape[0] > 0):
print('At least one row is returned.')
else:
print('Not found')
# Locate a value by index and column name
name = df.at[index, 'name']
# Get all column names
col_names = df.columns
# Filter out the first 8 row, index starts with 0
df = df[7:]
# Selects only a range of rows
df = df[7:100]
# Rename columns
df = df.rename(columns={'old':'new', 'another_old': 'another_new'})
- In the above code snippet, row only has a copy of the data from the dataframe, so to make sure the new value is set to the df, refer to the df.loc[index, column_name] instead of row[‘column_name’]. This ensures the data is set to the df.
- df.append return a new dataframe instead of modifying the existing one so we need to assign the new value to df.
- ignore_index is set to True to append the index to n+1
- result.shape is a list storing rows and columns returned.
- It seems df.at and df.loc can both be used for updating certain values at certain position in a dataframe but there are some differences as mentioned in this Stackoverflow post.
- More about iterrows from pandas.org
- Because
iterrowsreturns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=[‘int’, ‘float’]) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row[‘int’].dtype) float64 >>> print(df[‘int’].dtype) int64 To preserve dtypes while iterating over the rows, it is better to useitertuples()which returns namedtuples of the values and which is generally faster thaniterrows. - You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Writing data
# Output to an excel file
df.to_excel('path/filename.xls')
Writing to .xlsx files requires xlwt
pip3 install xlwt
# Write to multiple sheets in a single Excel file
writer = pd.ExcelWriter('path/file.xls')
sheets = ['A', 'B', 'C']
for sheet in sheets:
df = pd.DataFrame([[1,2,3], [4,5,6]], columns=['x','y','z'])
df.to_excel(writer, sheet) # This will append all the sheet to the writer
writer.save() # This writes the sheets to a single file