Numpy and Data Handling using Pandas

 Numpy

NumPy is a Python tool that helps you work with numbers and arrays easily. It allows you to create and use multi-dimensional arrays, which are like lists but can have many rows and columns. NumPy is very fast and useful for handling large amounts of data.
  • You can do many math operations with NumPy, like multiplying matrices, adding or subtracting numbers in arrays, and applying functions like squares or logarithms to each number. 
  • It also has special functions for math related to linear algebra and random number generation.
  • NumPy works well with other programming languages like C and C++, and it’s faster than regular Python lists because it uses precompiled code.
  •  To start using it, you just need to install it with a simple command and then import it in your Python program.
So, NumPy is a powerful and easy-to-use package for mathematic and data tasks in Python.

Array

In Python, an array is a collection of elements that all have the same data type, such as all integers, all floats, or all characters. Arrays are different from Python lists because lists can hold elements of mixed types (for example, an integer, a string, and a float in the same list), while arrays require all elements to be of the same type.

Arrays are useful when working with large amounts of data of the same type, such as numbers or characters, because they are more memory-efficient and can perform operations faster than lists. One of the benefits of arrays is that they can grow or shrink dynamically, meaning you can add or remove elements without needing to define the array’s size ahead of time.

To work with arrays in Python, you need to import the built-in array module. You can do this using either import array or from array import *
Once the module is imported, you can create an array using the syntax:

array_name = array(type_code, [elements]) 

In this syntax, type_code is a single-character string that specifies the type of data the array will hold. For example:
  • 'i' is used for signed integers.
  • 'f' is used for floating-point numbers.
  • 'u' is used for Unicode characters.
Here is an example of creating an array of integers: 

from array import array
numbers = array('i', [1, 2, 3, 4])

NumPy Arrays 

NumPy arrays are a more powerful and flexible type of array provided by the NumPy library, which is widely used in data science and scientific computing. 
  • A NumPy array is an N-dimensional array object, which means it can represent data not only in one dimension (like a simple list) but also in two dimensions (like a table), three dimensions, or even more.
  • NumPy arrays are created from nested Python lists, and all elements in the array must have the same data type.
  • Each NumPy array has a data-type object, known as dtype, which defines the type and size of its elements.
  • The elements are accessed using zero-based indexing, meaning that the first element is at index 0.

To use NumPy arrays, you must first install and import the NumPy library. You can install it with pip install numpy, and then import it with:
    
 import numpy as np
  
Once NumPy is imported, you can create an array like this: 

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])

In this example, arr is a two-dimensional array with two rows and three columns. 

One-Dimensional Array

A one-dimensional array is the simplest form of an array. It is a linear collection of elements arranged in a single row or column. 

Example:
import numpy as np

arr1d = np.array([10, 20, 30, 40, 50])

print("Full 1D array:")
print(arr1d)

# Accessing a specific element
print("\nAccess arr1d[2] (Third element):")
print(arr1d[2])  # Output: 30

# Slicing examples
print("\nSlicing arr1d[1:4] (Elements from index 1 to 3):")
print(arr1d[1:4])  # Output: [20 30 40]

print("\nSlicing arr1d[:3] (Elements from start to index 2):")
print(arr1d[:3])  # Output: [10 20 30]

print("\nSlicing arr1d[2:] (Elements from index 2 to end):")
print(arr1d[2:])  # Output: [30 40 50]


Output:
Full 1D array:
[10 20 30 40 50]

Access arr1d[2] (Third element):
30

Slicing arr1d[1:4] (Elements from index 1 to 3):
[20 30 40]

Slicing arr1d[:3] (Elements from start to index 2):
[10 20 30]

Slicing arr1d[2:] (Elements from index 2 to end):
[30 40 50]

Two-Dimensional Array

A two-dimensional array has both rows and columns, similar to a table or matrix. Each row can contain multiple columns of data, and the data is still of the same type throughout the array.

Example:
import numpy as np

arr2d = np.array([[1, 2, 3], [4, 5, 6]])

print("Full 2D array:")
print(arr2d)

# Accessing a specific element
print("\nAccess arr2d[1][2] (Second row, third element):")
print(arr2d[1][2])  # Output: 6

# Slicing examples
print("\nSlicing arr2d[0] (First row):")
print(arr2d[0])

print("\nSlicing arr2d[1, :2] (First two elements of second row):")
print(arr2d[1, :2])

print("\nSlicing arr2d[:, 0] (First column of all rows):")
print(arr2d[:, 0])

Output:
Full 2D array:
[[1 2 3]
 [4 5 6]]

Access arr2d[1][2] (Second row, third element):
6

Slicing arr2d[0] (First row):
[1 2 3]

Slicing arr2d[1, :2] (First two elements of second row):
[4 5]

Slicing arr2d[:, 0] (First column of all rows):
[1 4]
 

Three-Dimensional Array

A three-dimensional array consists of multiple 2D arrays stacked together. You can think of it as a cube or a collection of tables. It has depth (or layers), rows, and columns.

Example:
import numpy as np

arr3d = np.array([
    [[1, 2, 3], [4, 5, 6]],
    [[7, 8, 9], [10, 11, 12]]
])

print("Full 3D array:")
print(arr3d)

# Accessing a specific element
print("\nAccess arr3d[1][0][2] (Second block, first row, third element):")
print(arr3d[1][0][2])  # Output: 9

# Slicing examples
print("\nSlicing arr3d[0] (First block):")
print(arr3d[0])

print("\nSlicing arr3d[1, 1] (Second block, second row):")
print(arr3d[1, 1])

print("\nSlicing arr3d[:, :, 0] (First element of each row in all blocks):")
print(arr3d[:, :, 0])
 
   
Output:
Full 3D array:
[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]

Access arr3d[1][0][2] (Second block, first row, third element):
9

Slicing arr3d[0] (First block):
[[1 2 3]
 [4 5 6]]

Slicing arr3d[1, 1] (Second block, second row):
[10 11 12]

Slicing arr3d[:, :, 0] (First element of each row in all blocks):
[[ 1  4]
 [ 7 10]]


This array has:
  • 2 blocks (depth layers)
  • Each block has 2 rows
  • Each row has 3 columns

Operations on Arrays

We have many functions in numpy. First, import the module:
from array import array

Common Methods:

Method Description Example Output
.append(x) Adds an element x to the end of the array arr.append(6) [1, 2, 3, 6]
.insert(i, x) Inserts element x at position i arr.insert(1, 9) [1, 9, 2, 3]
.remove(x) Removes the first occurrence of element x arr.remove(2) [1, 3]
.pop([i]) Removes and returns the element at index i (last if i not given) arr.pop() returns 3
.index(x) Returns the index of the first occurrence of x arr.index(3) 2
.reverse() Reverses the order of elements in the array in-place arr.reverse() [3, 2, 1]


.count(x) Counts how many times x appears in the array arr.count(2) 1
.extend(iterable) Adds elements from an iterable (e.g., list or array) to the array arr.extend([7, 8]) [1, 2, 3, 7, 8]


Program:

from array import array

# Create an integer array
arr = array('i', [1, 2, 3])
print("Original array:")
print(arr)

# Append an element at the end
arr.append(4)
print("\nAfter append(4):")
print(arr)

# Insert an element at index 1
arr.insert(1, 9)
print("\nAfter insert(1, 9):")
print(arr)

# Remove the first occurrence of value 2
arr.remove(2)
print("\nAfter remove(2):")
print(arr)

# Pop the last element
popped = arr.pop()
print("\nAfter pop():")
print("Popped element:", popped)
print("Array now:", arr)

# Find the index of element 3
index_of_3 = arr.index(3)
print("\nIndex of element 3:")
print(index_of_3)

# Reverse the array
arr.reverse()
print("\nAfter reverse():")
print(arr)

# Count how many times 1 appears
count_1 = arr.count(1)
print("\nCount of 1 in array:")
print(count_1)

# Extend the array with another list
arr.extend([7, 8])
print("\nAfter extend([7, 8]):")
print(arr)

Output:

Original array:
array('i', [1, 2, 3])

After append(4):
array('i', [1, 2, 3, 4])

After insert(1, 9):
array('i', [1, 9, 2, 3, 4])

After remove(2):
array('i', [1, 9, 3, 4])

After pop():
Popped element: 4
Array now: array('i', [1, 9, 3])

Index of element 3:
2

After reverse():
array('i', [3, 9, 1])

Count of 1 in array:
1

After extend([7, 8]):
array('i', [3, 9, 1, 7, 8])

Concatenating Arrays 

Concatenation means joining two or more arrays into a single array. This is very common when you're combining datasets, rows of data, or features in machine learning and data analysis.

Concatenation can happen in:
  • Row-wise direction (adding more rows): axis=0
  • Column-wise direction (adding more columns): axis=1
You can use:
  • np.concatenate() → more general
  • np.vstack() → vertical stack (row-wise)
  • np.hstack() → horizontal stack (column-wise)
Example 1: Concatenating 1D Arrays

import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

result = np.concatenate((a, b))
print(result)

Output:
[1 2 3 4 5 6]

Example 2: 2D Arrays – Row-wise Concatenation

a = np.array([[1, 2],[3, 4]])
b = np.array([[5, 6],[7, 8]])

row_concat = np.concatenate((a, b), axis=0)
print(row_concat)
 
Output:
[[1 2]
 [3 4]
 [5 6]
 [7 8]]

Example 3: 2D Arrays – Column-wise Concatenation

a = np.array([[1, 2],[3, 4]])
b = np.array([[5, 6],[7, 8]])

col_concat = np.concatenate((a, b), axis=1)
print(col_concat)

Output:
[[1 2 5 6]
 [3 4 7 8]]

Example 4: 2D Arrays – Vertical Stack (Row-wise, like axis=0)
It is the same as np.concatenate((a, b), axis=0).

a = np.array([[1, 2],[3, 4]])
b = np.array([[5, 6],[7, 8]])

print(np.vstack((a, b)))

Output:

[[1 2]
 [3 4]
 [5 6]
 [7 8]]

Example 5: 2D Arrays – Horizontal Stack (Column-wise, like axis=1)
It is same as np.concatenate((a, b), axis=1).

a = np.array([[1, 2],[3, 4]])
b = np.array([[5, 6],[7, 8]])

print(np.hstack((a, b)))

Output:
[[1 2 5 6]
 [3 4 7 8]]

Example 6: 2D Arrays – Depth Stack 
It joins arrays along the 3rd dimension (depth).

a = np.array([[1, 2],[3, 4]])
b = np.array([[5, 6],[7, 8]])

print(np.dstack((a, b)))

Output:
[[[1 5]
  [2 6]]

 [[3 7]
  [4 8]]]

Reshaping Arrays

Reshaping means changing the shape (rows × columns × dimensions) of an array without changing its data.
It is useful when you want to reorganize data for analysis, machine learning, or matrix operations.

Function:
numpy.reshape(array, newshape)

Example 1: 1D → 2D Reshaping 

import numpy as np

arr = np.arange(12)   # Creates array from 0 to 11 (12 elements)
print("Original Array:", arr)

reshaped = arr.reshape(3, 4)    # Reshape into 3 rows and 4 columns
print("Reshaped to 3x4:\n", reshaped)

Output:
Original Array: [ 0  1  2  3  4  5  6  7  8  9 10 11]

Reshaped to 3x4:
 [[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

Example 2: Using -1 for Auto Dimension

arr = np.arange(12)
reshaped = arr.reshape(4, -1)   # NumPy figure out the number of columns
print("Reshaped with -1:\n", reshaped)

Output:
Reshaped with -1:
 [[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]
  [ 9 10 11]]

Example 3: 1D → 3D Reshaping

arr = np.arange(8)
reshaped = arr.reshape(2, 2, 2)
print("Reshaped to 2x2x2:\n", reshaped)

Output:
Reshaped to 2x2x2:
 [[[0 1]
   [2 3]]

  [[4 5]
   [6 7]]]

Example 4: Flattening (3D/2D → 1D)

arr = np.arange(12).reshape(3, 4)
flat = arr.reshape(-1)
print("Flattened Array:", flat)

Output:

Flattened Array: [ 0  1  2  3  4  5  6  7  8  9 10 11]

Splitting Arrays

Splitting means dividing one array into multiple sub-arrays.
This is useful when you want to:
  • Separate data into chunks
  • Divide features and labels in machine learning
  • Organize large arrays into smaller sections
NumPy Functions for Splitting:

Function
Description
np.split()
Split an array into equal parts
np.array_split()
Split into unequal parts if needed
np.hsplit()
Split along columns (axis=1)
np.vsplit()
Split along rows (axis=0)

Example 1: Using np.split() – Equal Split (1D Array)

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])
split_arr = np.split(arr, 3)  # Split into 3 equal parts

print("Original Array:", arr)
print("Split Arrays:", split_arr)

Output:
Original Array: [1 2 3 4 5 6]
Split Arrays: [array([1, 2]), array([3, 4]), array([5, 6])]

Example 2: np.array_split() – Unequal Split Allowed

arr = np.array([1, 2, 3, 4, 5, 6, 7])
split_arr = np.array_split(arr, 3)

print("Split into 3 (unequal):", split_arr)

Output:
Split into 3 (unequal): [array([1, 2, 3]), array([4, 5]), array([6, 7])]

Example 3: Splitting a 2D Array (Row-wise)

arr2d = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
split_rows = np.vsplit(arr2d, 2)

print("Original Array:\n", arr2d)
print("Row-wise Split:\n", split_rows)

Output:
Original Array:
 [[1 2]
  [3 4]
  [5 6]
  [7 8]]
Row-wise Split:
 [array([[1, 2],
        [3, 4]]), 
  array([[5, 6],
        [7, 8]])]

Example 4: Splitting a 2D Array (Column-wise)

split_cols = np.hsplit(arr2d, 2)

print("Column-wise Split:\n", split_cols)

Output:
Column-wise Split:
 [array([[1],
        [3],
        [5],
        [7]]), 
  array([[2],
        [4],
        [6],
        [8]])]

Statistical Operations on Arrays

Function / Attribute Description
ndim Returns the number of array dimensions
shape Returns a tuple of array dimensions
size Returns the total number of elements
transpose() Transposes the array (rows ↔ columns)
ravel() Flattens the array into 1D
np.zeros() Creates an array of all zeros
np.ones() Creates an array of all ones
np.linspace() Creates evenly spaced values between a range
np.max() Returns the maximum element
np.min() Returns the minimum element
np.sum() Returns the sum of all elements
np.sqrt() Returns the square root of each element
np.std() Returns the standard deviation
np.sort() Returns a sorted copy of the array

Program

import numpy as np

# Create a 2D array
arr = np.array([[10, 20, 30], [40, 50, 60]])

# 1D array for math ops
arr1d = np.array([4, 1, 9, 3, 5])

# Zeros, Ones, Linspace arrays
zeros_arr = np.zeros((2, 2))
ones_arr = np.ones((2, 2))
lin_arr = np.linspace(1, 5, 5)

print("Original 2D Array:\n", arr)

# Properties
print("\nNumber of Dimensions (ndim):", arr.ndim)
print("Shape:", arr.shape)
print("Size:", arr.size)

# Transformations
print("\nTranspose:\n", arr.transpose())
print("Ravel (flattened):", arr.ravel())

# Array creation
print("\nZeros Array:\n", zeros_arr)
print("Ones Array:\n", ones_arr)
print("Linspace Array (1 to 5, 5 parts):", lin_arr)

# Statistical and Math Operations
print("\n1D Array for math:", arr1d)
print("Maximum:", np.max(arr1d))
print("Minimum:", np.min(arr1d))
print("Sum:", np.sum(arr1d))
print("Square Roots:", np.sqrt(arr1d))
print("Standard Deviation:", np.std(arr1d))
print("Sorted Array:", np.sort(arr1d))

Output:

Original 2D Array:
 [[10 20 30]
  [40 50 60]]

Number of Dimensions (ndim): 2
Shape: (2, 3)
Size: 6

Transpose:
 [[10 40]
  [20 50]
  [30 60]]

Ravel (flattened): [10 20 30 40 50 60]

Zeros Array:
 [[0. 0.]
  [0. 0.]]
Ones Array:
 [[1. 1.]
  [1. 1.]]
Linspace Array (1 to 5, 5 parts): [1. 2. 3. 4. 5.]

1D Array for math: [4 1 9 3 5]
Maximum: 9
Minimum: 1
Sum: 22
Square Roots: [2.         1.         3.         1.73205081 2.23606798]
Standard Deviation: 2.6076809620810595
Sorted Array: [1 3 4 5 9]

Data Handling using Pandas

Pandas is a powerful open-source data manipulation and analysis library for Python. 
  • It provides data structures like DataFrame and Series, which are used to work with structured data. 
  • It provides fast, flexible, data structures that make it easy to handle and analyze large datasets, similar to working with tables or spreadsheets. 
  • It is a Python library used to analyze, clean, explore, and manipulate data in a structured and efficient way.
It’s widely used in:
  • Data science
  • Machine learning
  • Real-world data processing
  • CSV/Excel file handling
Key Features

1. Data Structures
  • Series → A one-dimensional array-like object. It’s like a single column from a DataFrame.
  • DataFrame → This is the main structure in Pandas, similar to a table or spreadsheet. It consists of rows and columns, and you can easily manipulate, filter, and analyze data within it.

2. Data Manipulation
  • Handling missing data easily.
  • Data filtering, sorting, grouping, and aggregation.
  • Merging and joining datasets.
  • Time-series functionality (date parsing, resampling, etc.).
3. Performance
  • Built on NumPy, so it’s optimized for performance.
  • Vectorized operations for fast computation.
4. Integration
  • Works well with libraries like NumPy, Matplotlib, and Scikit-learn.
5. Indexing and Selection
  • Selecting rows and columns using labels or indices
  • Conditional selection using boolean indexing
6. Data handling
  • You can import/export data from various formats like CSV, Excel, SQL databases, JSON, HTML, etc.

Advantages
1. Efficient Data Handling
  • Pandas provides fast, flexible, and expressive data structures (DataFrames and Series), which are optimized for performance. These structures allow you to perform complex data manipulation quickly and efficiently.
2. Easy to Use
  • The syntax is intuitive, making it easy to learn for beginners and efficient for experienced users. The API is highly user-friendly, allowing you to perform data manipulations with minimal code.
3. Handling Missing Data
  • Pandas provides built-in methods for dealing with missing data (NaN), such as filling, dropping, or forward/backward filling, which is critical for real-world data analysis.
4. Support for Various File Formats
  • Pandas can read and write data from a variety of formats: CSV, Excel, JSON, HDF5, SQL databases, and more. This makes it a great choice for integrating different data sources.
5. Integration with Other Libraries
  • Pandas integrates seamlessly with other Python libraries like NumPy (for numerical computations), Matplotlib/Seaborn (for visualization), and Scikit-learn (for machine learning), allowing for a full data science workflow.

Disadvantages

1. Memory Consumption
  • While Pandas is fast and powerful, it can be memory-intensive, especially when working with large datasets. The DataFrame structure can consume a lot of RAM, which can lead to performance issues on very large datasets (over a few GBs).
2. Learning Curve
  • While basic operations in Pandas are easy to learn, more advanced features (like multi-indexing, pivoting, or complex aggregation) can be tricky for newcomers to grasp.
3. Inconsistent Performance on Specific Operations
  • While Pandas is highly optimized, some operations (like string manipulations, or using apply() with custom functions) can be slower than vectorized operations with NumPy or more specialized libraries.
Applications

1. Data Analysis
  • Exploratory Data Analysis (EDA): Pandas is commonly used for EDA to analyze datasets, calculate statistics (mean, median, mode), and understand the distribution of data.
  • Data Summarization: It’s used for calculating summary statistics like average, standard deviation, etc., as well as grouping data by categories for aggregation.
2. Data Cleaning
  • Handling Missing Data: Removing, filling, or interpolating missing values in datasets.
  • Data Transformation: Renaming columns, changing data types, applying functions across columns/rows.
  • Dealing with Outliers: Detecting and handling outliers in datasets.
3. Machine Learning Data Preprocessing
  • Data Preprocessing: Before applying machine learning algorithms, data often needs to be cleaned, transformed, and standardized. Pandas is a key tool in preparing data for machine learning models.
  • Feature Engineering: Creating new features from existing data (e.g., time-based features, aggregating data) is done with Pandas.

Pandas Data Structures

Pandas provides 2 data structures:

1. Series

Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.).

Key Features:
  • Homogeneous data (all elements are of the same type)
  • Associated labels (index)
  • Can be thought of like a column in a spreadsheet or SQL table

Creation of Series

Series are generally created from:
  1.  Arrays 
  2.  Lists 
  3.  Dict
From Arrays:

import pandas as pd
import numpy as np

# Create a pandas Series from a NumPy array
arr= np.array([10, 20, 30, 40])
s = pd.Series(arr)
print(s)

Output:

0    10
1    20
2    30
3    40
dtype: int64
  • Left column is index
  • Right column is data
From Lists:

import pandas as pd

# Creating a Series from a list
s = pd.Series([10, 20, 30, 40])
print(s)

Output:

0    10
1    20
2    30
3    40
dtype: int64

From Dictionary:

# Create a pandas Series from a dictionary
dict_data = {'a': 100, 'b': 200, 'c': 300}
series_from_dict = pd.Series(dict_data)

print(series_from_dict)

Output:

a    100
b    200
c    300
dtype: int64

2. DataFrames

DataFrame is a two-dimensional labeled data structure with columns (like a spreadsheet or SQL table).

Key Features:
  • Heterogeneous data (each column can have a different data type)
  • Labeled axes (rows and columns)
  • Flexible indexing and powerful data manipulation
DataFrames in pandas can be created from :
  1. List 
  2. List of tuples 
  3. Dictionary  
  4. Excel Spreadsheet files 
  5. csv (common separated values) files
From a list:

import pandas as pd

# List of lists
data = [['Swathi', 'Vizag'], ['Surya', 'Hyderabad'], ['Chinnu', 'Pune']]
df = pd.DataFrame(data, columns=['Name', 'City'])
print(df)

Output:

     Name       City
0  Swathi      Vizag
1   Surya  Hyderabad
2  Chinnu       Pune


From a list of tuples:

import pandas as pd
# List of tuples
data = [('Swathi', 'Vizag'), ('Surya', 'Hyderabad'), ('Chinnu', 'Pune')]
df = pd.DataFrame(data, columns=['Name', 'City'])
print(df)

Output:

     Name       City
0  Swathi      Vizag
1   Surya  Hyderabad
2  Chinnu       Pune

From a Dictionary

import pandas as pd

# Creating a DataFrame with your data
data = {
    'Name': ['Swathi', 'Surya', 'Chinnu'],
    'City': ['Vizag', 'Hyderabad', 'Pune'],
    'Age': [28, 32, 26],
    'Salary': [60000, 75000, 52000]
}

df = pd.DataFrame(data)
print(df)

Output:

     Name        City       Age  Salary
0  Swathi        Vizag     28   60000
1  Surya   Hyderabad   32   75000
2  Chinnu         Pune    26   52000

From an Excel File (.xlsx

df = pd.read_excel('data.xlsx')  # install openpyxl if not already: pip install openpyxl
print(df)

You need to have the Excel file (data.xlsx) in your directory, or provide the full path.

From a CSV File (.csv

df = pd.read_csv('data.csv')
print(df)

You need to have the CSV file (data.csv) in your directory, or provide the full path.



Comments

Popular posts from this blog

Getting started with Python, Strings

FUNCTIONS, PYTHON OOPS AND EXCEPTION HANDLING