MapReduce is currently an extensively used programming model prominent in large-scale data-parallel applications in the cloud. The widely adopted open-source implementation of MapReduce framework, Hadoop widely used in homogenous as well as heterogeneous cluster nodes speculating closely tied performance. One major snag in heterogeneous nature of Hadoop has resulted the appearance of stragglers. Task stragglers dramatically impede parallel job execution of data-intensive computing in Cloud Datacentres due to the uneven distribution of input data resulted from heterogeneous data nodes, resource contention situations, and network configurations. Resulting in low performance and efficiency of the framework.
In this research paper, we have derived a framework to address these challenges that effectively deploy the identification of straggler nodes in a timely manner using Resource awareness Mechanism for job scheduler embedded with Machine Learning algorithm to identify the straggler nodes. We are also discussing various other mitigation techniques that can be incorporated in Data-intensive cloud computing to identify the straggler MapReduce tasks. Our strategy mainly focuses on providing a framework that aids high performance and efficiency with effective load balancing by identifying the slower straggler nodes accurately.
The modern society is in an era of data explosion. With the emergence of new technologies taking place at a faster pace, users are actively participating in the Internet world. Due to which abundant data is generated each day. However, users are not aware of how to effectively utilize and analyse this large data, which has become one major research problem. The sheer volume of data that internet services work has led to interest in parallel processing on commodity clusters. One such example is the use of MapReduce framework by Google, processing petabytes of data daily according to Jeffrey et al. (2008). Most commonly available services such as web indexing, datamining, social networking websites such as Facebook and Instagram etc., and scientific simulation also generates a massive volume of data. Enabling operators or developers to analyse the patters from the system logs and thereby increasing the revenue by proper diagnosing techniques in the production. One of the greatest challenges faced by the computing world is to securely store this abundant data for effective querying, analysing and utilizing. One major solution proposed was to use commodity machines to build a distributed data-centre    embed with huge storage and processing demands with low cost. The evolvement of MapReduce, a distributed programming model developed by Google for processing large dataset is currently used in the computing world widely. MapReduce framework provides an attractive feature of handling tasks automatically by simultaneous distribution of tasks. Allowing the job scheduler for inter-machine communication between the networks and disks for efficient use. MapReduce programming model contains two core functions: Map function and Reduce function. The framework facilitates the operation in pairs; as two phases. First, map and sort, each has a weight of 1/2, while reduce task is divided into three phases: shuffle, sort and reduce, each of which accounts for 1/3 of the score. Figure 1 displays a detailed computation of MapReduce Framework.
Once a job is submitted by the job scheduler from the data store. The MapReduce function breaks the cluster nodes in two stages. The distributed nodes are being used for processing the tasks continuously and then progressively updating the tasks. The Map function extracts the key-value from the data store and then send to user defined functionalities provided by the Map and Combine function to generate corresponding output. The generated output value is send for reduced phase for sorting and grouping the corresponding intermediate key-value pairs to generate the final output. Resulting in task completion of the task allotted.
A Java based open-source implementation of MapReduce framework, Hadoop is developed by Apache Software Foundation for processing abundant data using MapReduce framework and Hadoop Distributed File System (HDFS) stated by Gobioff et al. (2003). Providing more availability to MapReduce users in a fault-tolerant fashion. Hadoop is introduced as a successful execution of MapReduce programming model, comprising of two entities as mentioned above. Figure 2 provides a detailed yarn architecture of Hadoop. The framework is similar to that of Master-Slave architecture. The master node is the Resource Manager and the Node Manager along with Application Master belong to slave node. From the hardware’s perspective, the MapReduce system and HDFS runs on various nodes, which include both computing and storage thereby, effectively schedule computing tasks in the MapReduce system to the storage nodes from data store.
Hadoop deals with several concurrent map and reduce function from the master node to slave node to avoid conjunction by overlapping and I/O. Once a task is completed by the slave node, the empty slot initiates a request to master function by initiating heartbeat requests. The job scheduler then assigns the next task to the empty slot. Providing with reduce task thereby not waiting for all the map functions to complete the tasks since each reduce function take its own time for mapping the outputs for the next phase of input to be taken.
The overall performance of the heterogeneous framework deals the jobs to be completed slowly. Some of the major drawback includes faulty hardware, slow performance of a node or misconfiguration. The slow performance of the slave nodes or the slow performing nodes are called or identified as stragglers. The inbuilt scheduler in Hadoop by default recognize these straggler nodes for faster execution. Our research work identifies that the along with the slower performance of nodes or the other physical incompetence, the main reason of scheduler is due to inappropriate task scheduling for each nodes resulting resource contention. In this work, we identifies different straggler issues and impose and improved mitigation technique using Resource Mechanism Framework employed with effective Machine Learning algorithm help in identifying the straggler nodes and not sending the tasks to such nodes. This paper is organised in the following sections. Section 1 provides an introduction of the MapReduce and HDFS along with major straggler issue. Section 2 presents related works carried out in mitigating the straggler issue using various other techniques. Section 4 carries out the strategy that is used in this research work along with research questions handles for better performance of MapReduce by identifying the straggler nodes enabling to effectively scheduled tasks thereby.
The scope of this research work is to provide with an effective Straggler identification framework that provides a mitigating scheme in the cloud MapReduce environment. The degradation of MapReduce job performance due to the stragglers is widely discussed in the recent years. Numerous straggler identification and mitigation techniques have been proposed and implemented in order to improve the execution time resulting in increased job response with effective resource utilization. Several techniques are proposed in identifying the slow performance task and redundant copies that occurs leading straggler nodes. Existing straggler mitigation techniques either proactive or reactive fall in short of providing a complete or unique solution at a stance in accordance with straggler problems.
Reactive Straggler Identification Mechanisms
The most popular approach to address the straggler problem is to identify the straggler nodes using speculative execution helps in monitoring the slow performance of each nodes and thereby initiating duplicate task copies to other nodes thereby completing the jobs scheduled on time. However the task is completed but due to wait and speculate scheme, the time utilized becomes inefficient. Also, due to multiple copies increased resource utilization of the mechanism degrades the performance. An improvement over this technique was later proposed by Zaharia et al. (2008), LATE an improved design of speculative execution – Longest Approximate Time to End (LATE) algorithm.
The algorithm mainly focuses on the approximate time left instead of progress rate taken by slow performing tasks. The algorithm was developed initially to address performance degradation in Hadoop environment with respect to heterogeneous nodes. Two heuristic characteristics were maintained for minimal resource utilization that includes: (1) A Speculative Cap denoted for the number of several speculative tasks running. (2) A Slow Task Threshold value is calculated to determine the slow performance of the nodes which is compared to progress rate. Although, LATE algorithm maintains better backup strategies. However, the approximate time calculation to end the running task calculation results in non-identification of straggler nodes resulting in resource wastage.
Enhanced Reactive Straggler Approaches
Using historical data acquired from each corresponding map stage, Q. Chen et al. (2010) suggested Self-Adaptive MapReduce Scheduling Algorithm (SAMR) as another variant of LATE algorithm for detection of slow tasks dynamically. For achieving better progress score in execution of task, SAMR utilised the historical data stored in each map node in the cluster node to manage the weights of corresponding stages of map and reduce for the same time execution. Thereby helping in identifying the slow nodes. Due to the uses of historical data stored on each node in the cluster; the technique deals that even with same weight for map and reduce stage job tasks the corresponding size changes which is one major drawback. To overcome this limitation ESAMR, Enhanced Self-Adaptive MapReduce Scheduling was designed by Sun et al. (2012). Using K-mean clustering Machine learning algorithm, the historical data is being categorized helping in analysing the dynamic fine tune parameters determining quick identification of slow tasks.
Current straggler identification techniques are developed focusing mainly on the best utilization of application resources. However, the impact of nodes performance is one vital reason for straggler issue. Due to large usage of network data system such as Amazon EC2 , has revealed that the poor response time is achieved due to the weak performance of nodes rather than the network. And it is also understood that the characteristics is pervasive, and its impact is continuous over time. Hence speculative scheduling techniques will experience a major failure in the performance. In order to understand and correlate with the straggler problem, an offline analysis mechanism is used to rank the node performance that following a 3-parameter-loglogistic distribution as mentioned by Bailey et al. (2013), done simultaneously to identify the weaken nodes. Using Google cluster, the mechanism was able to identify the straggler nodes within the cluster yielding better performance.
Straggler Mitigation by Cloning
For effective straggler mitigation targeting mainly on smaller jobs and improved resource performance for analysing large datacentres Ananthanarayanan et al. (2013) has proposed Dolly. This straggler mitigation technique was introduced to reduce the straggler occurrence by cloning the small tasks and removing the lagging tasks. However reduction in cloning cost was maintained but there is huge evident is resource wastage.
Previously, researchers such as Zhang et al. (2000) has claimed that studies on engaging the efficiency of resource contention and utilization enhances to improve performance. The mechanism is deployed for distribute systems based on the performance of CPU utilization and storage memory enabling CPU load sharing. This supports a correlated link between the identification of straggler nodes as well as for resource utilization. Similarly Zaharia et al. (2011) has also stated that allocation of equal resource utilization for heterogeneous environment leaving the cluster nodes with fair allocation. For dynamic provisioning of CPU capacity Zhang et al. (2013) presents a resource awareness management called HARMONY, by scheduling the tasks only for valid nodes and shutting down the poor or idle nodes for resource consumption.
Replication-based techniques such as cloning mechanism suggested by Ananthanarayanan et al. (2013) also inherit wastage of resources, by relaunching the tasks scheduled if the task is been lagged or delayed and not completed. Most of the frameworks have been discussed based on these characteristics and also there occurs a chance that both these re-scheduled tasks can also undergo straggler issue.
Straggler Identification: Based on Coding Theory and Data Parallelism Techniques
According to Cadambe et al. (2016) another recent approach to address the straggler identifications on the basis of coding-theory techniques works exclusively for linear operations in distributed systems. The proposed techniques suggested by Halbawi et al. (2017) and Lei et al. (2017) deals a gradient application using coding techniques. However, the mitigation approach demands a (r + 1) redundancy factor for ‘r’ stragglers.
Perhaps J. Lin et al (2009) and Huang et al. (2011) proposed techniques categorized under data parallelism and model parallelism approach which is related in line with randomized linear algebra for larger optimization using dimensionality reduction. Further, these frameworks also works with asynchronous strategies that result in unbound delays allowing hard bound delays on the basis of delay distribution.
ML based Straggler Mitigation
Machine Learning, Distributed Machine or otherwise multi-node machine learning algorithm techniques are designed for straggler mitigation. Using ML algorithm, Yadwadkar et al.(2012) suggested an automated approach for learning the dissimilarities between mode-level and execution time of each task is computed using decision tress algorithm. The method formulates pre-defined set of rules by training the dataset using historical information resulting in identifying the straggler nodes with smaller overhead. Distributed ML algorithm  benefits with better performance for larger dataset providing with significant reducing in error occurrence benefiting more complex calculations. One such technique “Batched Coupon’s Collector” (BCC) introduced by Kalan et al. (2017) using Distributed ML approach train the entire dataset and partition them into batchers to identify the straggler nodes precisely.
Proactive Straggler Identification Mechanism
Proactive approaches comes up with more time effective and limited resource usage with maximum performance due to the avoidances of replication of the jobs tasks. Wrangler proposed by Yadwadkar et al. (2014) is one proactive approach prominently used to identify straggler nodes using linear modelling. Figure 7, provides a detailed architecture of the Wrangler approach. The master allots tasks to the slave depending on the heartbeat request send by the workers. Using previous logs of jobs scheduled and snapshot of the counter function of resource usage a model builder is being generated which is one major component. Using the assumptions from the model builder, the prediction of a node being straggler can be assumed and predicted thereby node congestion will not be allowed by not overloading the nodes that leading them to stragglers.
The existing techniques are intended for maximum straggler impact reduction. Table 1 have shown a summary of various straggler techniques have been analysed above. From the brief analysis, Proactive straggler mechanism is considered to be currently one of the best approach. However, the technique is not effective on eliminating the straggler nodes at a stance.
Research Method and Specifications
Current straggler techniques are relied being sceptic due to the straggler appearance and hence replication or speculative copies are generated to mitigate. However, the root cause of the occurrence of straggler nodes are still clueless. Hence, a smarter scheduling techniques with proper algorithm must be taken up to equally share the tasks within the nodes providing smarter scheduling performance providing with a better insight. The root cause analysis of the straggler issues are commonly due to CPU utilization, unmaintained I/O leading to queuing and congestion of network resources. Thus we came up with a unique solution that helps in predicting the straggler nodes proactively ahead rather than identifying and mitigating the nodes after the job scheduling is taken place. The architecture is highly desirable in predicting the straggler nodes ahead using predictive Machine learning algorithm.
The overall design of this proposed method is illustrated in Figure 8w. The proposed design uses predictive machine learning algorithm for identifying the straggler nodes proactively and also for effective job allocation for each node using timely counter functions in map reduce. The key component used in the proposed design is the predictive Machine Learning algorithm, which helps in