python

Some Basic Pandas APIs

Pandas library is a very powerful and popular library for manipulating tabular data and carrying out data analysis. However, I personally find that some APIs are somewhat weird and not consistent with other APIs. I end up asking Google from time to time when I want to use them. Hence here I put them down as a memo collection.

So import Pandas,

import pandas as pd

Reading data from an Excel file

df = pd.read_excel('path/file.xls')

To read .xlsx files, we need to install the dependency, xlrd

pip3 install xlrd
df = pd.read_excel('path/file.xlsx')

Manipulating data

# Loop through the dataframe
for index, row in df.iterrows():
  print(row) # We now have access to each row with their index
  # Get value of a column under name 'Price'
  print(row['Price'])
  # Update df data with a specific value, eg, 200 at index 20
  df.loc[20, 'Price'] = 200
  
# Append new data to the df
df = pd.DataFrame(columns=['email', 'name'])
new_df = pd.DataFrame({'email':['foo@example.com', 'bar@example.com'], 'name':['Foo', 'Bar']})
df = df.append(new_df, ignore_index=True)
  
# Search rows by certain conditions
# AND condition => &
result = df[(df['name'] == 'Foo') & (df['email'] = 'foo@example.com')]
# OR => |
result = df[(df['name'] == 'Foo') | (df['email'] = 'foo@example.com')]

# Check the result shape to determine if any result is found
if (result.shape[0] > 0):
  print('At least one row is returned.')
else:
  print('Not found')

# Locate a value by index and column name
name = df.at[index, 'name']

# Get all column names
col_names = df.columns

# Filter out the first 8 row, index starts with 0
df = df[7:]

# Selects only a range of rows
df = df[7:100]

# Rename columns
df = df.rename(columns={'old':'new', 'another_old': 'another_new'})


  • In the above code snippet, row only has a copy of the data from the dataframe, so to make sure the new value is set to the df, refer to the df.loc[index, column_name] instead of row[‘column_name’]. This ensures the data is set to the df.
  • df.append return a new dataframe instead of modifying the existing one so we need to assign the new value to df.
  • ignore_index is set to True to append the index to n+1
  • result.shape is a list storing rows and columns returned.
  • It seems df.at and df.loc can both be used for updating certain values at certain position in a dataframe but there are some differences as mentioned in this Stackoverflow post.
  • More about iterrows from pandas.org
  1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=[‘int’, ‘float’]) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row[‘int’].dtype) float64 >>> print(df[‘int’].dtype) int64 To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.
  2. You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Writing data

# Output to an excel file
df.to_excel('path/filename.xls')

Writing to .xlsx files requires xlwt

pip3 install xlwt
# Write to multiple sheets in a single Excel file
writer = pd.ExcelWriter('path/file.xls')
sheets = ['A', 'B', 'C']
for sheet in sheets:
  df = pd.DataFrame([[1,2,3], [4,5,6]], columns=['x','y','z'])
  df.to_excel(writer, sheet) # This will append all the sheet to the writer
writer.save() # This writes the sheets to a single file

Done

Leave a comment