Please note! This essay has been submitted by a student.
A regression algorithm can basically be used to identify the relationship between two different features in the dataset. This relationship between these two variables is purely on the arithmetical association between them and hence it is non-deterministic. While using this algorithm, we need to identify the correct set of attributes from the dataset that will help us correctly predict the outcome for a given test data. Another important thing that we need to consider is to identify and understand the attributes that adds significant weight to the prediction. An interesting thing is to understand the impact of changes when we add or remove an attribute from the predictor set and hence it can be used to predict the trend. There are various regression techniques like the simple linear regression, multiple linear regression, logistic regression, etc.
We can use Naive Bayes algorithm when the dimensionality of the predictor set is more. We will have a set of labels and a training data set. As new data come for prediction, it is compared with the training data and the probability of the match is estimated. The label with the highest probability is returned as the prediction. Naïve Bayes works based on the concept of probability. It can help us in predicting the occurrence of a peak usage given a probability of occurrence of another peak electricity usage. This is one of the simple classification models. For the complexity of the data set we are working with and the number of attributes in the data, Naïve Bayes would take too long for the model to get trained.
The logistic regression is an algorithm that is suitable for classification when the number of predicting variables is two. It is a predictive analysis algorithm that can explain the relation between one dependent variable and more than one independent attributes. The dependent variable is said to be binary in nature. When we have two predictors, it is required that these two attributes should not have any relationship or in other words, the correlation between these two predicting attributes should be less. This will add to the accuracy of the prediction. It is certainly important to take into account the number of independent variables we add to the prediction list. For example, when we add more and more attributes to the prediction list, the model becomes overfit and when the test data is not present in the training data, the model will be able to predict the outcome. So, before we work on the logistic regression process, we need to identify the right variables that are independent of each other.
KNN is an algorithm for prediction. We need to be careful in choosing the value of K. The K is number of distinct clusters or boundaries with in the data. The distance between the test and the data is calculated using the Euclidian or Manhattan distance and centroids are calculated. The process goes through a number of iterations and this helps exact classification of the boundaries. When an outcome has to be predicted, we need to calculate the distance and identify the cluster in which the data falls which is returned as the predicted outcome.
Random Forest is a hybrid technique that can be seen as an ensemble of a few techniques. For example, we could run a logistic regression, KNN and a decision tree and then take vote for the majority to predict it as outcome. This algorithm can use multiple decision trees and merge the output for better accuracy. We plan to select a few significant parameters like date, time range and construct multiple decision trees to predict an outcome quickly.
As of the improvement in the technology, the companies and research institutes are able to process massive amount of data in a few minutes using GPUs and cloud storage. We are able to process more complex data sets and are able to achieve better accuracy as we have no dearth for the processing and computing power. There has been a substantial increase in the scale of the data. Hence, it becomes very demanding for the process to extract the meaningful information from the data. This process is done with the objective of finding the methodical relationships between the successions of the events and association of the arrangement in the dataset. So, if a new event is said to occur, it can be matched with all of the prior events and useful results can be obtained. A trivial change in the specimen of the data could make radical difference in the result of the data mining models.
This process starts with the key part of understanding the data set, classifying the characteristics and their importance. The investigation of the data is said to be an important phase as it is important for us to eliminate the repetition in the data. This phase of exploration is an iterative but a systematic and a logical step in decluttering of the data which may involve cleaning the data, making transformations in the data. All data in the data store should be understandable by the machine and before we build a model, we need to transform the data. Once we complete the conversion and the exploration of the data, we need to build the data mining model. The most apparent objective any data mining-oriented research will be to predict the occurrence or the characteristics of an event. This model would help in identifying the pattern or the trend that is hidden in the data. This has to be verified against the test data for the correctness and the accuracy of the model. The final step is to deploy the model in production setup so that the model can be fully functional.
While there are many methods that are considered to be effective in this field of data mining, it is very imperative to choose the algorithm wisely. We need to clearly determine the motivation of the project before we work with the dataset and the analysis upon it. While there are several motivations, we definitely wanted to work with a dataset associated with problems whose solutions could have immense real-life impacts but would still allow for meaningful work on our end given the time constraint and scope of our project. Second, we discovered existing literature in the field that would help us hit the ground running when tackling those problems, enabling this project to be a great learning experience for us. Third, we found that the dataset struck a good balance between having enough complexity in its size and composition and being small enough to being feasible to work with on our machines.