Please note! This essay has been submitted by a student.
This paper will discuss the idea of predicting match results for basket- ball games using a combination of machine learning algorithms on match statistics with sentiment analysis from twitter crowd. There have been many researches in the sports industry predicting match results based on match related metrics and history logs of teams, while others tried using only sentiment analysis of crowd. This paper will propose a model for combining both ways together. In extension to the previous researches, this paper will try to do result prediction by applying sentiment analysis in the basketball domain and also try to improve performance of existing statistics based methods.
The research area of sports analytics has gained a big reputation in the recent time. Though the area itself isn’t new and it has been previously researched, but the continuous development of statistical and mathematical models helped increase the awareness of different stakeholders in the sport industry about the potential opportunities there to grasp.
The success story of the Oakland Athletics baseball team published at 2003 in the book Moneyball grabbed the attention to using statistical models. At that time it was believed that the only way to have a good team was by providing high payrolls to buy the best players, but the Oakland team was able to compensate for that depending on statistical analysis . One of the areas in the sports analytics field that has grabbed the attention of different researchers is match results prediction. The importance of such a field is due to the huge amount of money that is being invested in sports betting; this is why fans, bookmakers and betters are all interested in accurate predic- tions . It also helps team managers evaluate their strategies, assess players performance and discover potential talents . There have been many papers reviewing previous research and developing mod- els for match results prediction in different sport domains like football[1, 3, 9, 13], tennis [7, 8, 10, 13], American football [1, 5, 13] and basketball [14, 5]. 1 In this paper will review some of the previous literature to discuss the various models used for match results prediction, but will focus on the application in the domain of basketball using sentiment analysis on data from twitter and players’ statistics.
One of the earliest result prediction models in football is the 1982 model in- troduced by Maher . The model’s idea depended on representing the goals scored and goals conceded by a team as 2 Poisson variables. Where the goals scored depends on the team’s attacking variable and opponent’s defencing vari- able, and goals conceded depends on opponent’s attack and own defence. It also included a home advantage effect variable, which represents an advantage to the hosting team playing on their field.
A research by  proposed a new idea, they said that although many previous researches has discovered the strong association between results prediction and betting odds, but they were never combined together in a model. The main idea was to use a hierarchical Bayesian Poisson model where the inputs were; his- torical performance parameters of the team and betting odds from bookmakers converted into probabilities. It has shown to have a positive impact on leverag- ing the prediction performance of results. A different methodology was suggested by , instead of depending on tradi- tional statistics of matches and history and environmental factors for prediction, the paper proposes using sentiment analysis which proved a great success in the business domain. Tweets are collected 4 days before the match, then tone and polarity are extracted from them. Then the results prediction are compared to a baseline prediction model which was built over betting odds.
The paper states that the baseline models outperformed all their sentiment models in terms of match result prediction, which this paper will try to overcome, but on the other hand their models were better predictors for which betting payouts. 2.2 Tennis As for the tennis domain, the paper of  investigated tennis match results prediction. It discussed the problem with rankings produced by official sports associations, in this case the Association of Tennis Professionals (ATP), and how they can’t be used alone for reliable predictions results. Instead they used a Bradley-Terry model, which was introduced in 1952, which is used for do- ing pairwise comparisons. The model represented each player with an ability parameter based on previous of statistics collected the previous nine seasons and used a decaying function to account higher weights for more recent per- formance in recent seasons than older ones. It also considered a tennis specific factor which is the effect of the surface on results because players don’t perform 2 equally on different court surfaces. Also the paper by  focused on the same idea of combining statistical and environmental data for better prediction. They built a multi-layer perceptron neural network model that was fed by both feature sets and it was found to outperform previous models that depended only on historical statistical data.
The paper by  predicted NBA basketball match results by building a gen- eralized model that can predict the match result between two teams based on team statistics, opponent statistics, players statistics and Game history data. The model also considered the home advantage factor and predicted the results based on averaging the statistics of the last 10 games. Then different machine learning algorithms like Simple Logistics Classifier, Artificial Neural Networks, SVM and Naive Bayes were used. The results of prediction were up to 69.76% accuracy. 3 Methodology 3.1 Motivation As mentioned in the introduction, the problem of sports prediction is of interest to many parties in the sport industry. It can help serve team managers assess their team strategies and players improve their performance. Also because of the huge investments in the sports betting area, it is of great importance for bookmakers and fans.
The approach that will be proposed in this paper is combining statistical anal- ysis of match related data by , where different machine learning techniques were used to predict basketball matches, with the one from , where predic- tions were inferred from sentiment analysis done on football tweets from twitter. Machine learning models have empirically proven their prediction power to pre- dict match outcomes, by exploiting different match related, historical data and even external factors. While on the other hand as  mentioned in his paper, that many previous researches concluded that collective crowd opinion in sports was a better forecaster, even though it is naively based and not scientifically, for results than experts schemes which were based on expertise and previously recognized patterns.
We believe that this combination can lead to better predic- tion results by combining statistical analysis as well as crowd-sourcing opinion analysis. So the contribution of this paper will be extending the sentiment analysis of  to the basketball domain as well as trying to enhance the performance of 3 the basketball model by .
The model will work on two different datasets; the basketball statistics data and the tweets collected about the match. For the basketball statistics data, we will use the same website that was used by  which is Basketball-reference.com. The website offers a lot of information regarding the players, the starting line-up of games, schedule of games and statistics about teams and history of games played. For the tweets, we will follow the same methodology described by . Tweets will be collected using the twitter API and filtered by hashtags related to the desired teams (ex. chicagobulls, BullsNation, Rockets… etc), then a sentiment analysis is done using the OpinionFinder tool. The analyzes the given tweets and provides two outputs; subjectivity (Subjective or Objective) and polarity (Positive, Negative and Neutral).
For the results prediction a total of three models will be built; the first one predicts based on statistical data only, the second one predicts based on senti- ment analysis only and the third one will predict based on both statistical and sentiment analysis. The first model will be the same as the one mentioned in . It will run differ- ent algorithms like Naive Bayes, SVM and Neural networks for training, using average statistics of the last played 10 games for the desired teams, statistics of recent games between the two teams and statistics of overall performance of the teams in the last season. For the second model, the same criteria used by  will be used. Based on the different models generated by the OpinionFinder tool, the prediction is done by normalizing negative and positive tweets of crowd for both teams, then the decision is made based on the higher normalized value. The final model will work under the assumption that both models participate equally in the final decision.
Since the final output of the model is a result prediction of either winning or losing and since match results prediction is a supervised learning problem, the prediction results will be compared to real matches’ results. Accuracy is a metric that is used to compare the ratio of correctly predicted results to the total number of predicted matches. Both papers suggested and used accuracy as the main evaluation metric for evaluation. This also will allow for a comparison between the three designed models, to see if statistical 4 prediction and crowd opinion are good predictors when used alone, or of the combination of both can lead to better prediction.
This paper has proposed a model that will predict match results between basket- ball teams in a supervised learning mode by combining two powers; the power of sentiment analysis of non-domain expert crowd from twitter and the power informed prediction by machine learning algorithms on sport statistics.