In the modern world, pandas play a key role in data manipulation. It helps us to manipulate, analyze, and extract insights from Data. This article is based on data manipulation from the basic understanding of data to getting data on our needs. This is part of the course Machine Learning and its privacy implications.
Feel free to play around with the code and the dataset to get more hands-on with the data manipulation. Kagge Notebook Link.
Part 1 ( Data structure info)
1. importing libraries & loading Dataset
- understanding Dataframe
- basic function to get information
Part 2 (Manipulating Data)
1. Sorting values
- subsetting
- Filtering Data on based conditions
- isin()
- Additional Functions (unique,groupBy, sum)
Part 1: Understanding Data Structures
- Importing Libraries and Loading the Dataset:
To begin, we need to import the necessary libraries and load our dataset into a Pandas DataFrame.
# Load dataset
df = pd.read_csv('your_dataset.csv')
2. Understanding DataFrame:
It is mainly divided into 3 parts values, columns,index
data_values = df.values
column_labels = df.columns # to get coloumn only
row_index = df.index
3. Basic Functions to Get Information
We have some basic functions to get an understanding of the data.
1. shape
attribute to return the number of rows and columns.
2. To check the type of the DataFrame, use the type
function.
3. describe
function to generate descriptive statistics for numerical columns.
4. info
function helps us to get a concise summary of the DataFrame, including the data types and non-null count
print(df.shape)
print(type(df))
print(df.describe())
print(df.info())
Part 2: Manipulating Data
1. Sorting Values:
Sorting values is a common operation to organize data. We can sort values in ascending or descending order.
i ) Ascending Order
sorted_df = df.sort_values(by='column_name', ascending=True)
print(sorted_df)
ii) Descending Order
sorted_df_desc = df.sort_values(by='column_name', ascending=False)
print(sorted_df_desc)
2. Subsetting
Subsetting allows us to select specific columns from the DataFrame.
i ) Single Column
subset_single = df['column_name']
print(subset_single)
ii) Multiple Columns
subset_multiple = df[['column1', 'column2']]
print(subset_multiple)
3. Filtering Data Based on Conditions
Filtering helps in extracting rows that meet specific conditions.
i) Single Condition on a Single Column
filtered_single = df[df['column_name'] > value]
print(filtered_single)
ii) Multiple Conditions on Multiple Columns
filtered_multiple = df[(df['column1'] > value1) & (df['column2'] < value2)]
print(filtered_multiple)
4. Using isin()
Function
The isin
function filters data based on a list of values.
filtered_isin = df[df['column_name'].isin([value1, value2, value3])]
print(filtered_isin)
5. Additional Functions
Unique: The unique
function returns the unique values in a column.
unique_values = df['column_name'].unique()
print(unique_values)
GroupBy: The groupby
function is used to group data based on one or more columns and perform aggregate operations.
grouped = df.groupby('column_name').sum()
print(grouped)
Sum: The sum
function calculates the sum of values in a column or along an axis.
column_sum = df['column_name'].sum()
print(column_sum)
Final Marks
Mastering data manipulation techniques in Pandas is essential for any data scientist or analyst. These fundamental operations enable efficient data processing, making it easier to extract meaningful insights from datasets. Whether you’re sorting values, subsetting columns, or performing aggregate functions, Pandas provides the tools you need to work effectively with data.