Python Data Analysis Ecosystem
Python offers a powerful suite of libraries for data analysis, manipulation, and visualization. These tools form the foundation of modern data science workflows.
# Core data analysis libraries:
import numpy as np # Numerical computing
import pandas as pd # Data manipulation
import matplotlib.pyplot as plt # Visualization
import seaborn as sns # Statistical visualization
# Typical workflow:
# 1. Load data → 2. Clean data → 3. Explore data
# 4. Analyze data → 5. Visualize results
These libraries work together seamlessly to handle everything from simple data exploration to complex statistical analysis.
Pandas Fundamentals
Pandas provides DataFrame objects for efficient data manipulation with integrated indexing.
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# Basic operations
df.head() # First 5 rows
df.info() # DataFrame info
df.describe() # Statistical summary
# Selecting data
df['Name'] # Single column
df.loc[0] # Row by label
df.iloc[0] # Row by position
# Filtering
df[df['Age'] > 30] # People older than 30
Data Cleaning with Pandas
Real-world data is often messy. Pandas provides tools to handle missing data, duplicates, and inconsistencies.
# Handling missing data
df.isna().sum() # Count missing values
df.dropna() # Drop rows with missing values
df.fillna(0) # Fill missing with 0
df.fillna(df.mean()) # Fill with mean
# Removing duplicates
df.drop_duplicates()
# Data type conversion
df['Age'] = df['Age'].astype('float')
# String operations
df['Name'].str.upper() # Convert to uppercase
df['Name'].str.contains('Ali') # Find names containing 'Ali'
# DateTime handling
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
Data Aggregation & Grouping
Pandas provides powerful tools for grouping and aggregating data to extract insights.
# Grouping data
grouped = df.groupby('City')
grouped.mean() # Mean of each numeric column by city
# Multiple aggregations
df.groupby('City')['Age'].agg(['mean', 'min', 'max', 'count'])
# Pivot tables
pd.pivot_table(df, values='Age', index='City', aggfunc=np.mean)
# Cross tabulation
pd.crosstab(df['City'], df['Age' > 30])
# Merging DataFrames
pd.merge(df1, df2, on='key') # SQL-style join
pd.concat([df1, df2]) # Stack vertically
Data Visualization
Visualizations help uncover patterns and communicate findings effectively.
# Matplotlib basics
plt.plot(df['Age'])
plt.title('Age Distribution')
plt.xlabel('Index')
plt.ylabel('Age')
plt.show()
# Seaborn for statistical plots
sns.histplot(df['Age'])
sns.boxplot(x='City', y='Age', data=df)
sns.scatterplot(x='Age', y='Income', hue='City', data=df)
# Pandas built-in plotting
df.plot(kind='bar', x='Name', y='Age')
df['Age'].plot(kind='hist')
# Advanced visualizations
sns.pairplot(df) # Scatter matrix
sns.heatmap(df.corr(), annot=True) # Correlation matrix
Advanced Analysis Techniques
Python offers powerful tools for statistical analysis and machine learning.
# Statistical analysis with scipy
from scipy import stats
stats.ttest_ind(df[df['City']=='New York']['Age'],
df[df['City']=='London']['Age'])
# Linear regression with statsmodels
import statsmodels.api as sm
X = sm.add_constant(df['Age'])
model = sm.OLS(df['Income'], X).fit()
print(model.summary())
# Machine learning with scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X = df[['Age']]
y = df['Income']
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LinearRegression().fit(X_train, y_train)
predictions = model.predict(X_test)
Python Data Analysis Videos
Master Python data analysis with these handpicked YouTube tutorials:
Learn data manipulation with Pandas:
Creating insightful visualizations:
Statistical and machine learning techniques:
End-to-end data analysis projects: