Before understanding what is “Data mining” we need to understand what “Data” is. Data is defined as “Basic values or facts” which are taken from any individuals or other organizations.
To understand why data mining is required let’s put it in perspective, every day about 700 million tweets are generated. This amount of data is very difficult to track and analyze, that is why we need data mining and its intelligent algorithms to predict and make lives easier.
Essay due? We'll write it for you!
But with so much data going around it is very easy to commit frauds and illegal practices. That is why data mining’s intelligent algorithms and practices of statistical techniques, artificial intelligence, forensic analytics, link analysis, Bayesian networks, decision theory, and Sequence matching are used to analyze whether all the data is in place and verified.
Data mining in Fraud detection includes Lie detection, Intrusion detection, and Criminal investigation and detection of other fraudulent transactions.
Data mining is a process by which we gather usable data information from a big set of raw data by examining patterns in it using algorithms or software.
It is an important process of extracting useful and essential patterns in huge sets of data. It allows us to analyze the datasets in different perspectives and categorize it and finally use it effectively by reducing the interrelations in the data.
It also involves efficient compilation, storage as well as processing of data. For categorization and evaluation of future events, it uses complex mathematical algorithms. It is also known as Knowledge Discovery in Data. It helps in examining and compiling different aspects of data in one place.
What can it do?
- Helps in the identification of patterns while shopping.
- Increases website optimization.
- Beneficial for marketing campaigns.
- Determination of customer groups.
- Helps measure profitability factors.
- Increases brand loyalty.
The essential purpose of the data mining process is to find the relevant information in a given database and present it in a simpler form for easier understanding of others.
- Increases client’s faith.
- Identifies hidden profitability.
- Minimizes client’s involvement.
- Customer satisfaction.
- Can predict future trends.
- Signifies customer habits.
- Helps in decision making.
- Increases company revenue.
- depends upon market-based analysis.
- Quick fraud detection.
- It violates user privacy.
- Additional irrelevant information.
- Misuse of information.
- An accuracy of data with limitations.
PROCESS AND ALGORITHMS
The following are the phases through which a data mining process will go through.
- Selection and Sampling
It includes having a good knowledge and understanding of the raw data to find important knowledge and targets to create a dataset and selecting the appropriate subdomain of data to be actually used.
- Pre-processing and Cleaning
It includes cleaning and pre-processing of the raw data which involves removing irregularities and noise from the raw data.
- Transformation and Reduction
Data reduction is done to obtain useful information and reduce the volume of data which is not needed.
- Mining, Visualization and Evaluation
This includes choosing a mining task, analyzing the target of the exercise such as arrangement of data into different subsets, making safe predictions and finally trend analysis and giving it an apt model.
An Algorithm is used to solve a problem using logic or mathematical techniques. There are many pre-defined algorithms available which perform many different tasks.
- C4.5 – It constructs a tool that takes in data and attempts to classify it in different classes in the form of a decision tree. To do this, it is given information that is already categorized. Example – a database of a group of patients. We have the details of the patients like age, past diseases, blood group, family history, etc. about each patient which called characteristics or attributes. Given these, we can calculate whether a patient will get a specific disease or not. The patient can fall into 1 of 2 categories: will get it or won’t. C4.5 is first told the class for each patient. Then using this data it constructs a decision tree that can anticipate the class for new patients based on their characteristics.
- K-means – It creates a number of groups (say k) from a set of raw data so that the members of one group are similar to each other. It’s a type of cluster analysis method for examining a dataset in which a group of algorithms is made to form associations such that the members of one group are more similar to each other than other non-group members. Example – We have a dataset of patients. In the examining process, these are called observations. We know many things about each patient like age, blood group, pulse, blood pressure, cholesterol, etc. which forms a vector signifying the patient. Then we input to k-means the number of clusters we want. Then k-means enhances the rest by its modifications.
- Support vector machines – This algorithms learns a subspace with one lower dimension than that of its surroundings to classify the data into classes. SVM works in a related way like C4.5 but doesn’t use decision trees. It just projects the data into higher dimension and then works out the best subspace which splits up the data into the 2 classes. Example – Bunch of big and small rocks are placed on a floor. If the rocks aren’t too mixed together, we can take a rod and without moving the rocks, separate them with it. When a new rock is added on the floor, we can predict its size by knowing which side of the rod, the rock is on. The rocks represent data points, and big and small sizes represent 2 classes.
- Apriori – This algorithm can learn rules for association of data and they are applied to a database having a huge volume of transactions. It has to learn the mining approach for interrelationships and links among various identifiers in a database. Example – We have a database of pharmaceutical purchases. It behaves as a large table where each row is a patient purchase and every column represents a different medicine. By applying the algorithm, it can learn the medicines that are bought together. The basic Apriori algorithm is a 3 step process – Join, Prune and Repeat.
Apart from these there are algorithms such as – EM, PageRank, AdaBoost, kNN, Naive Bayes, CART which work in a similar fashion to above.
Data Mining is mainly used by companies and organizations which strongly focus on their customer or are a financial, retail or a marketing organization, to be able to focus on a client’s need and preferences. With this, an organization can use previous transactions of customer to promote products and promotions that appeal to that specific consumer groups.
Significant sectors where data mining is extensively used:
- Future Healthcare – It uses data analysis to work out best approaches that can reduce costs and boost care by using analysis approaches like datasets in different dimensions, artificial intelligence, and visualization of data. It can also strongly anticipate the future volume of patients and also help health insurance agencies to detect fraud and abuse.
- Market Basket Analysis –It is based upon the idea, that if we purchase a specific group of items, then we are more likely to purchase another group of items. With this the seller can understand the buying behavior of a consumer and help him to know the consumer’s needs in a better way and change his store’s aisle layout respectively.
- Education –It is used to predict a students’ future study behavior. It can also be used by an educational institution to take careful judgments and anticipate the results of a student so that they can implement different ways of teaching a student.
- Customer Relationship Management – It is about gaining clients whilst also improving their reliability by enforcing client directed strategies. To keep a good relationship with a client, an organization collects and analyzes the information and gets to know where to focus to retain the client.
- Apart from this Data Mining is also used in Consumer distribution, Banking, Corporate Inspection, Research examination and bio – informatics, along with Fraud Detection.
DATA MINING AND FRAUD DETECTION
With increasing amounts of data, data can easily be modified and cause harm. Data Mining helps prevent it with its intelligent algorithms and thorough processing. Millions and Millions of dollars have already been misplaced to the act of frauds. Fraud detection is relevant to numerous sectors including banking and finance, insurance agencies, government organizations and even police.
Example and Prevention
- In the financial banking sector, fraud involves using pinched credit cards, duplicate checks, inaccurate accounting operations, etc. Fraud in financial sector ranges from false losses to intentionally provoking an accident for damages.
- Older methods of fraud detection were time exhausting and sophisticated. Mining the data has aided in providing essential patterns and turning raw data into useful information. Mining of data and stats help to forecast and quickly detect fraud and take action to reduce losses. With data mining tools, billions of transactions and purchases can be analyzed to spot irregularities and detect fraudulent purchases.
- A foolproof fraud detection system should protect the data of a user. In a trained fraud detection model collection of sample transactions take place. These are then grouped into fraudulent or non-fraudulent. Then a model is generated using these records and then the software can classify whether the new incoming data is fraudulent or not with the help of algorithms.
Common Frauds and How Data Mining Helps
- Intrusion Detection – Any operation that endangers the integrity and security of data is an intrusion. Measures like user verification, avoiding programming failures, and protection of data should be taken to avoid this. Data analysis can help increase the efficiency of its detection by focusing more on inconsistency detection to help an analyst distinguish a fraudulent activity from legit activities.
- Lie Detection – Catching a convict is simple whereas producing out the truth from him is hard. Police use means of data mining to examine offences, monitor messages of alleged terrorists. Police also use text mining to identify and apprehend criminals. This method includes finding out useful patterns in raw data which is typically unorganized text. The data illustrations collected from former analysis are matched and a model for lie detection is formed. With this the model’s procedures can be created accordingly.
- Criminal Investigation – This comprises of exploration and detection of crimes and their relations with the criminals. Due to large volume and complexity of interrelationships of crimes and criminals in the criminal database criminology is a very suitable field for using data mining.
The further work and development of technologies in data mining and its various analysis techniques would bring a sudden boom and many new forms of industries. As after the industrial revolution world saw the immense change and the development of countries and technology was multiplied with a rapid pace, similarly if world is able to perfect the concept and basic idea behind data mining it would revolutionize the world with greater consequences as it would help to create a very efficient lifestyle and help lower the amount of fraud generated yearly. Though there are many deficiencies present in today’s day world to safeguard, analyze, store and process the data generated from different parts and methods by present day technology. In big data analysis industry there is a shortage of data scientists and other analysts who have mastered or have a good experience working with data mining in a distributed, open source or licensed environment. It is because of the reluctance and lack of adequate knowledge about big data mining and its different technologies used to process it.
As the importance and application of data mining is growing the methods and techniques to examine and analyze it will also have huge implication and will have an exponential growth.