Pandas: Essential Data Manipulation Guide

In the modern world, pandas play a key role in data manipulation. It helps us to manipulate, analyze, and extract insights from Data. This article is based on data manipulation from the basic understanding of data to getting data on our needs. This is part of the course Machine Learning and its privacy implications.
Feel free to play around with the code and the dataset to get more hands-on with the data manipulation. Kagge Notebook Link.

Part 1 ( Data structure info)

1. importing libraries & loading Dataset

understanding Dataframe
basic function to get information

Part 2 (Manipulating Data)

1. Sorting values

subsetting
Filtering Data on based conditions
isin()
Additional Functions (unique,groupBy, sum)

Part 1: Understanding Data Structures

Importing Libraries and Loading the Dataset:

To begin, we need to import the necessary libraries and load our dataset into a Pandas DataFrame.

# Load dataset
df = pd.read_csv('your_dataset.csv')

2. Understanding DataFrame:

It is mainly divided into 3 parts values, columns,index

data_values = df.values
column_labels = df.columns # to get coloumn only
row_index = df.index

3. Basic Functions to Get Information

We have some basic functions to get an understanding of the data.

1. shapeattribute to return the number of rows and columns.

2. To check the type of the DataFrame, use the type function.

3. describe function to generate descriptive statistics for numerical columns.

4. info function helps us to get a concise summary of the DataFrame, including the data types and non-null count

print(df.shape)
print(type(df))
print(df.describe())
print(df.info())

Part 2: Manipulating Data

1. Sorting Values:

Sorting values is a common operation to organize data. We can sort values in ascending or descending order.

i ) Ascending Order

sorted_df = df.sort_values(by='column_name', ascending=True)
print(sorted_df)

ii) Descending Order

sorted_df_desc = df.sort_values(by='column_name', ascending=False)
print(sorted_df_desc)

2. Subsetting

Subsetting allows us to select specific columns from the DataFrame.

i ) Single Column

subset_single = df['column_name']
print(subset_single)

ii) Multiple Columns

subset_multiple = df[['column1', 'column2']]
print(subset_multiple)

3. Filtering Data Based on Conditions

Filtering helps in extracting rows that meet specific conditions.

i) Single Condition on a Single Column

filtered_single = df[df['column_name'] > value]
print(filtered_single)

ii) Multiple Conditions on Multiple Columns

filtered_multiple = df[(df['column1'] > value1) & (df['column2'] < value2)]
print(filtered_multiple)

4. Using `isin()` Function

The isin function filters data based on a list of values.

filtered_isin = df[df['column_name'].isin([value1, value2, value3])]
print(filtered_isin)

5. Additional Functions

Unique: The unique function returns the unique values in a column.

unique_values = df['column_name'].unique()
print(unique_values)

GroupBy: The groupby function is used to group data based on one or more columns and perform aggregate operations.

grouped = df.groupby('column_name').sum()
print(grouped)

Sum: The sum function calculates the sum of values in a column or along an axis.

column_sum = df['column_name'].sum()
print(column_sum)

Final Marks

Mastering data manipulation techniques in Pandas is essential for any data scientist or analyst. These fundamental operations enable efficient data processing, making it easier to extract meaningful insights from datasets. Whether you’re sorting values, subsetting columns, or performing aggregate functions, Pandas provides the tools you need to work effectively with data.

A Guide to Data Manipulation with Pandas