Using CV2 techniques for image transformation

Image for post
Image for post
(Source: By Author)

What is OpenCV?

OpenCV is an open-source python library used for computer vision and machine learning. It is mainly aimed at real-time computer vision and image processing. It is used to perform different operations on images which transform them using different techniques.

It majorly supports all languages like Python, C++, Android, Java, etc. It is easy to use and in demand due to its features. It is used in creating image processing or rendering application using different languages.

In this article, we will try to perform some image transformation using the CV2 version of OpenCV. …


Performing Data Visualization using PySpark

Image for post
Image for post
Photo by William Iven on Unsplash

Data Visualization

Data Visualization plays an important role in data analysis because as soon as the human eyes see some charts or graphs they try finding the patterns in that graph.

Data Visualization is visually representing the data using different plots/graphs/charts to find out the pattern, outliers, and relation between different attributes of a dataset. It is a graphical representation of data.

Data Visualization Using PySpark

We can perform data visualization using PySpark but before that, we need to set it up on our local machine. …


Performing SQL operations on Datasets using PySpark

Image for post
Image for post
Photo by Franki Chamaki on Unsplash

What is SQL (Structured Query Language)?

SQL is a language that is used to perform different operations on data like storing, manipulating, and retrieving. It works on relational databases in which data is stored in the form of rows and columns.

SQL commands can be classified into three types according to their properties:

  1. DDL(Data Definition Language)

As the name suggests DDL commands are used to define the data. The commands which are included in DDL are CREATE, INSERT, TRUNCATE, DROP, etc.

2. DML(Data Manipulation Language)

Data Manipulation commands are used to alter and update the data according to user requirements. …


Exploratory Data Analysis using PySpark

Image for post
Image for post
Photo by Markus Spiske on Unsplash

Exploratory Data Analysis

Exploratory Data Analysis is the most crucial part, to begin with whenever we are working with a dataset. It allows us to analyze the data and let us explore the initial findings from data like how many rows and columns are there, what are the different columns, etc. EDA is an approach where we summarize the main characteristics of the data using different methods and mainly visualization.

Let’s start EDA using PySpark, before this if you have not yet installed PySpark, kindly visit the link below and get it configured on your Local Machine.

Importing Required Libraries and Dataset

Once we have configured PySpark on our machine we can use Jupyter Notebook to start exploring it. In this article, we will perform EDA operations using PySpark, for this we will using the Boston Dataset which can be downloaded Kaggle. Let’s start by importing the required libraries and loading the dataset. …


PySpark Installation on Windows from scratch

Image for post
Image for post
Photo by Markus Spiske on Unsplash

Apache Spark

Apache Spark is a platform for Big Data processing that provides the capability of processing a huge amount of scale data. It is a data analytics engine for big data processing, which has in-built models for SQL, machine learning, deep learning, and graph processing

PySpark was released to make an interface between Spark and Python in order to make use of both for fast data processing. PySpark is a Python API for Spark.

In simpler terms, PySpark is a general-purpose distributed computation engine that can run across multiple servers in a coherent way and they can read distributed data sets and process that data based on code you have written to run within the Spark (“engine”). …


Understanding different ways of finding Feature Importance

Image for post
Image for post
Photo by Franki Chamaki on Unsplash

Machine Learning model performance is the most factor in selecting a particular model. In order to select a machine learning model, we can look at certain metrics that can help us select the best model with the highest accuracy and minimum error.

Feature variable plays an important role in creating predictive models whether it is Regression or Classification Model. Having a large number of features is not good because it may lead to overfitting, which will make our model specifically fit the data on which it is trained. Also having a large number of features will cause the curse of dimensionality i.e. …


Using Yellowbrick Visualizations for analyzing Model Performance

Image for post
Image for post
Photo by Hunter Harritt on Unsplash

Machine Learning is the study of computer algorithms that improve automatically through experience. There are a large number of machine learning algorithms according to the problem and the dataset we are dealing with.

Machine Learning model performance is the most factor in selecting a particular model. In order to select a machine learning model, we can look at certain metrics that can help us select the best model with the highest accuracy and minimum error. Other than all these factors the most important factors which show the model performance are different types of visualizations. …


GUI for Analyzing Pandas Dataframe

Image for post
Image for post
Photo by William Iven on Unsplash

Exploratory Data Analysis is the most crucial part, to begin with whenever we are working with a dataset. It allows us to analyze the data and let us explore the initial findings from data like how many rows and columns are there, what are the different columns, etc. EDA is an approach where we summarize the main characteristics of the data using different methods and mainly visualization.

EDA is an important and most crucial step if you are working with data. It takes up almost 30% of the total project timing to explore the data and find out what it is all about. EDA allows us and tells us how to preprocess the data before modeling. …


Automatically Selects the best machine learning model for any given dataset

Image for post
Image for post
Photo by Fatos Bytyqi on Unsplash

Machine Learning provides the advantage of studying algorithms which improves automatically with experience. There are ’n’ numbers of machine learning algorithms and techniques and generally, we need to test most of them in order to find the best prediction model for our dataset which has the highest accuracy.

Most of the machine learning methods like regression techniques, classification techniques, and other models are defined in Sklearn but in order to select which technique is best for our problem statement or our dataset, we need to try out all these models along with hyperparameter tuning and find out the best performing model. …


Python Package for speed starting Machine Learning Projects

Image for post
Image for post
Photo by Chris Liverani on Unsplash

Machine Learning provides the ability to the system to learn automatically and improve from the experiences. There are a large number of machine learning algorithms available and it's really difficult to test them all in order to find the best model for your dataset or problem statement. Other than this we need to prepare the data before processing it in the model. We need to analyze the data and find out the patterns, anomalies, etc. that the data have. …

About

Himanshu Sharma

An Aspiring Data Scientist passionate about Data Visualization with an Interest in Finance Domain.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store