Data Science

aditya
9 minute read
0


Excel Frmula :

 =VLOOKUP("M", 'Product Table'!A:B, 2, FALSE) 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Links :

DATA_SET.csv : https://drive.google.com/file/d/1nl8s8nSTn8MrsuCDhiwfV6z2s1gGB8Ns/view?usp=drive_link  

                                              

DATA_SET.xlsx (Excel File) : https://docs.google.com/spreadsheets/d/1AgF6cjg5VH2_UxaMYN-8FTU62TdYLfHw/edit?usp=sharing&ouid=116484988672302914957&rtpof=true&sd=true

 

diabetes(2).csv : https://drive.google.com/file/d/1RxamvTmOUVumxvdwYOScnClOFO3fwRPd/view?usp=drive_link


ds.json : https://drive.google.com/file/d/1Wii4GNsQ2k414X-3otzNydns30MTlt0A/view?usp=drive_link


iris.csv : https://drive.google.com/file/d/1RJIt4Npphq7M8FFia_pL0Q5FhfCGei8S/view?usp=drive_link

-----------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------


Practical 2 

Aim : Data Frames and Basic Data Pre-processing 

 -->Read data from CSV and JSON files into a data frame. 

 -->Perform basic data pre-processing tasks such as handling missing values and outliers. 

 -->Manipulate and transform data using functions like filtering, sorting, and grouping

CODE:-

import pandas as pd # Read data from CSV file into a data frame csv_file_path = 'DATA_SET.csv' df_csv = pd.read_csv(csv_file_path) # Read data from JSON file into a data frame json_file_path = 'ds.json' df_json = pd.read_json(json_file_path) # Display the first few rows of each data frame to inspect the data print("CSV Data:") print(df_csv.head()) print("\nJSON Data:") print(df_json.head()) # Handling missing values # Drop rows with missing values df_csv_cleaned = df_csv.dropna() # Fill missing values with a specific value (e.g., 0) df_json_filled = df_json.fillna(0) # Handling outliers # Assume 'Sales' is the column with outliers # Replace outliers with the median median_value = df_csv['Sales'].median() upper_threshold = df_csv['Sales'].mean() + 2 * df_csv['Sales'].std() lower_threshold = df_csv['Sales'].mean() - 2 * df_csv['Sales'].std() df_csv['Sales'] = df_csv['Sales'].apply(lambda x: median_value if x > upper_threshold or x<lower_threshold else x) # Manipulate and transform data # Filtering filtered_data = df_csv[df_csv['Sales'] > 10] # Sorting sorted_data = df_csv.sort_values(by='Sales', ascending=False) # Grouping and calculating mean for numeric columns numeric_columns = ['Sales', 'Cost', 'Profit'] grouped_data = df_csv.groupby('Category')[numeric_columns].mean() # Display the results print("\nCleaned CSV Data:") print(df_csv_cleaned.head()) print("\nFilled JSON Data:") print(df_json_filled.head()) print("\nFiltered Data:") print(filtered_data.head()) print("\nSorted Data:") print(sorted_data.head()) print("\nGrouped Data:") print(grouped_data.head())




Output :












-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
Practical 3
Aim:Feature Scaling and Dummification
-->Apply feature-scaling techniques like
standardization and normalization to
numerical features.

-->Perform feature dummification to convert
categorical variables into numerical
representations.

Code:


import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the data
data = {
    'Product': ['Apple_Juice', 'Banana_Smoothie', 'Orange_Jam', 'Grape_Jelly', 'Kiwi_Parfait',
                'Mango_Chutney', 'Pineapple_Sorbet', 'Strawberry_Yogurt', 'Blueberry_Pie', 'Cherry_Salsa'],
    'Category': ['Apple', 'Banana', 'Orange', 'Grape', 'Kiwi',
                 'Mango', 'Pineapple', 'Strawberry', 'Blueberry', 'Cherry'],
    'Sales': [1200, 1700, 2200, 1400, 2000, 1000, 1500, 1800, 1300, 1600],
    'Cost': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800],
    'Profit': [600, 850, 1100, 700, 1000, 500, 750, 900, 650, 800]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the original dataset
print("Original Dataset:")
print(df)

# Step 1: Feature Scaling (Standardization and Normalization)
numeric_columns = ['Sales', 'Cost', 'Profit']
scaler_standardization = StandardScaler()
scaler_normalization = MinMaxScaler()

# Apply standardization
df_scaled_standardized = pd.DataFrame(
    scaler_standardization.fit_transform(df[numeric_columns]),
    columns=numeric_columns
)

# Apply normalization
df_scaled_normalized = pd.DataFrame(
    scaler_normalization.fit_transform(df[numeric_columns]),
    columns=numeric_columns
)

# Combine the scaled (standardized) numeric features with the original categorical features
df_scaled = pd.concat([df_scaled_standardized, df.drop(numeric_columns, axis=1)], axis=1)

# Display the dataset after feature scaling
print("\nDataset after Feature Scaling (Standardization):")
print(df_scaled)

# Step 2: Feature Dummification (One-Hot Encoding)
categorical_columns = ['Product', 'Category']

# Create a column transformer for dummification
preprocessor = ColumnTransformer(
    transformers=[
        ('categorical', OneHotEncoder(), categorical_columns)
    ],
    remainder='passthrough'
)

# Apply the transformer
df_dummified = pd.DataFrame(preprocessor.fit_transform(df).toarray())

# Display the dataset after feature dummification
print("\nDataset after Feature Dummification (One-Hot Encoding):")
print(df_dummified)





--------------------------------------------------------------------------
--------------------------------------------------------------------------
Pactical 4
Aim: Hypothesis Testing
-->Formulate null and alternative
hypotheses for a given problem.
-->Conduct a hypothesis test using appropriate
statistical tests (e.g., t-test, chi square test
-->Interpret the results and draw conclusions
based on the test outcomes.

Code :

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Generate two samples for demonstration purposes
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(sample1, sample2)

# Set the significance level
alpha = 0.05

# Display the results
print("Results of Two-Sample t-test:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")

# Plot the distributions
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)
plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()

# Highlight the critical region if null hypothesis is rejected
if p_value < alpha:
    critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(), sample2.max()), 1000)
    plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')

# Show the observed t-statistic
plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center',
         color='black', backgroundcolor='white')

# Show the plot
plt.show()

# Draw Conclusions
if p_value < alpha:
    if np.mean(sample1) > np.mean(sample2):
        print("Conclusion: There is significant evidence to reject the null hypothesis.")
        print("Interpretation: The mean value of Sample 1 is significantly higher than that of Sample 2.")
    else:
        print("Conclusion: There is significant evidence to reject the null hypothesis.")
        print("Interpretation: The mean value of Sample 2 is significantly higher than that of Sample 1.")
else:
    print("Conclusion: Fail to reject the null hypothesis.")
    print("Interpretation: There is not enough evidence to claim a significant difference between the means.")



Output :
















---------------------------------------------------
---------------------------------------------------
Practical 5
Aim :ANOVA(Analysis of variance)

--> Perform one-way ANOVA to compare means across
multiple groups
--> Conduct post-hoc tests to identify significant
differences between groups means



Code :


from matplotlib import pyplot as plt
movies=["golmaal","annabelle","bhoot-uncle","bhoothnath","de dana dan"]
num_oscars=[5,10,3,6,8]
plt.bar(range(len(movies)),num_oscars)
plt.title("Horror Movies")
plt.ylabel("oscar award 2024")
plt.xticks(range(len(movies)),movies)
plt.show()




output :

Code :



from matplotlib import pyplot as plt
years = [2020,2021,2022,2023,2024]
failurepercentrates = [60,70,50,10,0]
plt.plot(years,failurepercentrates,color = "green" ,marker ="o", linestyle="solid" )
plt.title("corona times success rates")
plt.ylabel("percentages rates")
plt.show()



output :






code :



from matplotlib import pyplot as plt
from collections import Counter
totalnumber =[83,95,91,67,70,100]
histogram=Counter(min(score // 10*10,90) for score in totalnumber)
plt.bar([x+5 for x in histogram.keys()], histogram.values(), 10, edgecolor=(0,0,0))
plt.axis([-5,105,0,5])
plt.xticks([10*i for i in range(11)])
plt.xlabel("total_score")
plt.ylabel("N no of student")
plt.title("disttibution of exam 1 marks")
plt.show()


output :















--------------------------------------------------
--------------------------------------------------
Practical 6
Aim : Regression and Its Types
--> Implement simple linear regression using a
dataset.
-->Explore and interpret the regression model
coefficients and goodness-of-fit measures.
-->Extend the analysis to multiple linear
regression and assess the impact of additional
predictors




Code :



#Practical 6
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# Load the dataset
df = pd.read_csv('/content/sample_data/diabetes (3).csv')
# Simple Linear Regression
X_simple = df[['Age']]
y_simple = df['Pregnancies']
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(X_simple,y_simple, test_size=0.2, random_state=0)
regressor_simple = LinearRegression()
regressor_simple.fit(X_train_simple, y_train_simple)
# Predictions on the test set
y_pred_simple = regressor_simple.predict(X_test_simple)
# Model evaluation
print('Simple Linear Regression:')
print('Intercept:', regressor_simple.intercept_)
print('Coefficient:', regressor_simple.coef_)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test_simple, y_pred_simple))
print('Mean Squared Error:', metrics.mean_squared_error(y_test_simple, y_pred_simple))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test_simple, y_pred_simple)))
print('R-squared:', metrics.r2_score(y_test_simple, y_pred_simple))
# Visualization for Simple Linear Regression
plt.scatter(X_simple, y_simple, color='gray')
plt.plot(X_simple, regressor_simple.predict(X_simple), color='red', linewidth=2)
plt.title('Simple Linear Regression')
plt.xlabel('Age')
plt.ylabel('Pregnancies')
plt.show()
# Multiple Linear Regression
X_multi = df[['Glucose', 'BloodPressure', 'Insulin']]
y_multi = df['Outcome']
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(X_multi, y_multi,test_size=0.2, random_state=0)
regressor_multi = LinearRegression()
regressor_multi.fit(X_train_multi, y_train_multi)
# Predictions on the test set
y_pred_multi = regressor_multi.predict(X_test_multi)






Output:













--------------------------------------------------
--------------------------------------------------
Practical 7


Aim : Logistic Regression and Decision Tree --> Build a logistic regression model to predict a binary outcome.
--> Evaluate the model's performance using classification
metrics (e.g.,accuracy,precision, recall).
--> Construct a decision tree model and interpret the decision
rules for classification
Code :

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv('/content/sample_data/diabetes (2).csv')

# Drop the 'BloodPressure' column and set 'Age' as the target variable
X = df.drop(columns=['BloodPressure'])
y = df['Age']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression model
log_reg_model = LogisticRegression(max_iter=1000)
log_reg_model.fit(X_train_scaled, y_train)
y_pred_log_reg = log_reg_model.predict(X_test_scaled)

# Decision Tree model
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train_scaled, y_train)
y_pred_dt = dt_model.predict(X_test_scaled)

# Evaluation - Logistic Regression
print("Logistic Regression:")
print("Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Classification Report:")
print(classification_report(y_test, y_pred_log_reg, zero_division=1))

# Evaluation - Decision Tree
print("\nDecision Tree:")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_dt))
print("Classification Report:")
print(classification_report(y_test, y_pred_dt, zero_division=1))

# Plot confusion matrices
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, y_pred_log_reg), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix - Decision Tree')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.tight_layout()
plt.show()











Output :











---------------------------------------------------------------------------
---------------------------------------------------------------------------
Practical 8
Aim : K-Means Clustering

--> Apply the K-Means algorithm to group similar data points into
clusters.
--> Determine the optimal number of clusters using elbow method or
silhouette analysis.
--> Visualize the clustering results and analyze the cluster
characteristics



Code :


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Determine the optimal number of clusters using silhouette analysis
silhouette_scores = []
for n_clusters in range(2, 11):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)  # Explicitly setting n_init
    cluster_labels = kmeans.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Plot silhouette scores to determine the optimal number of clusters
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.show()

# Choose the optimal number of clusters based on the silhouette score
n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2  # Adjusted for 0-based indexing

# Apply K-Means clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(X)
cluster_labels = kmeans.labels_

# Visualize the clustering results
plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='*', s=300, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering')
plt.show()

# Analyze the characteristics of each cluster
cluster_df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
cluster_df['Cluster'] = cluster_labels
cluster_summary = cluster_df.groupby('Cluster').mean()
print(cluster_summary)




Output :







--------------------------------------------------
--------------------------------------------------
Practical 9
Aim : Principal Component Analysis (PCA)
--> Perform PCA on a dataset to reduce
dimensionality.
--> Evaluate the explained variance and
select the appropriate number of principal
components.
---> Visualize the data in the reduced-dimensional
space.





Code:




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Calculate explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)

# Visualize the data in the reduced-dimensional space
plt.figure(figsize=(8, 6))
for i in range(len(iris.target_names)):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], label=iris.target_names[i])

plt.title('PCA of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()




Output :



---------------------------------------------------------------------------
---------------------------------------------------------------------------
Practical No. 10

Aim : Data Visualization and Storytelling
-->Create meaningful visualizations using data
visualization tools

-->Combine multiple visualizations to tell a
compelling data story.

-->Present the findings and insights in a clear
and concise manner.

Code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample dataset (you can replace this with your own dataset)
df = sns.load_dataset('tips')

# Explore the dataset
print(df.head())

# Visualization 1: Distribution of Total Bill Amount
plt.figure(figsize=(10, 6))
sns.histplot(df['total_bill'], kde=True)
plt.title('Distribution of Total Bill Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Frequency')
plt.show()

# Visualization 2: Relationship between Total Bill and Tip Amount
plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_bill', y='tip', data=df, hue='sex')
plt.title('Relationship between Total Bill and Tip Amount')
plt.xlabel('Total Bill Amount')
plt.ylabel('Tip Amount')
plt.legend(title='Sex')
plt.show()

# Visualization 3: Box plot of Total Bill Amount by Day
plt.figure(figsize=(10, 6))
sns.boxplot(x='day', y='total_bill', data=df)
plt.title('Box plot of Total Bill Amount by Day')
plt.xlabel('Day')
plt.ylabel('Total Bill Amount')
plt.show()

# Visualization 4: Count of Customers by Day and Time
plt.figure(figsize=(10, 6))
sns.countplot(x='day', hue='time', data=df)
plt.title('Count of Customers by Day and Time')
plt.xlabel('Day')
plt.ylabel('Count of Customers')
plt.legend(title='Time')
plt.show()

# Data Storytelling
print("\nInsights:")
print("1. The distribution of total bill amounts is right-skewed, with most bills falling between $10 and $20.")
print("2. There is a positive relationship between total bill amount and tip amount, with some variations based on gender.")
print("3. Total bill amounts tend to be higher on Saturdays compared to other days.")
print("4. The count of customers is higher during dinner time compared to lunchtime on all days.")





Output :














Post a Comment

0Comments
Post a Comment (0)