Connect-4 Game Outcome Analysis using Supervised and Unsupervised Machine Learning
Link to my GitHub Repository Github Code.
Introduction
Connect-4 is a two-player board game with a simple objective: connect four discs of the same color in a row, column, or diagonal. This project investigates the Connect-4 game dataset, comprising legal game positions where victory is not yet achieved, and the next move is not forced.
Goals
- Examine the effectiveness of various analytical techniques.
- Utilize supervised algorithms for outcome prediction.
- Employ unsupervised learning for pattern discovery.
Theoretical Background
A detailed discussion of each algorithm’s theory and application is provided in the report, encompassing topics like Logistic Regression, Decision Trees, SVMs, Neural Networks, PCA, SVD, and Clustering.
Methodology
The project includes data visualization, preprocessing with one-hot encoding, training various models, and evaluating them based on accuracy, silhouette score, and completeness score.
Results and Discussions
The table provides insights into the performance of different models and the clustering algorithm on the given dataset. The accuracy metric represents the percentage of correctly classified samples, while the silhouette score and completeness score measures how well the clusters capture the original classes. Based on the results, the random forest model achieved the highest accuracy, indicating its effectiveness in predicting the target variable. The linear SVM and clustering algorithm performed relatively poorly compared to other models, suggesting that the dataset may not exhibit clear linear separability or distinct clustering patterns.
The given figure illustrates the Outcome Distribution for the connect-4 dataset, revealing an imbalance in the data distribution. However, it is noteworthy that an ample number of data points are available for each outcome category. Consequently, we can confidently proceed with this project.
Random Forest Results:
Random Forest is the best model according to the accuracy values.
Analyzing feature importances is essential for understanding the underlying factors that drive the model’s predictions. It helps identify which features have the most influence on the target variable and can aid in feature selection or feature engineering processes.
The confusion matrix plot provides an overview of the performance of the random forest model by visually representing the predicted and actual values of the target variable, The diagonal elements of the confusion matrix represent the instances that were correctly classified, while the off-diagonal elements represent the instances that were misclassified.
Neural Network Performance Visualization:
The neural network model that was trained achieved an impressive test accuracy of 79.55%. There is potential for further improvement by either developing a more robust model or making modifications to the existing model. Exploring different neural network architectures in future endeavours could lead to more advanced analyses and insights for the Connect-4 game.
Singular Value Decomposition:
The scree plot is a graphical representation of the eigenvalues of the principal components, showing the amount of variance explained by each component. If you observe that the scree plot has a clear elbow or cutoff point, where the eigenvalues sharply drop after a certain number of components, it suggests that only a few principal components contribute significantly to the variance explained. In this case, it appears that the first eight principal components contribute to a larger proportion of the total variance, while the remaining components have minimal effect on the variance. This finding can be useful for dimensionality reduction purposes. Instead of using all the components, you may consider retaining only the first eight components that explain the majority of the variance. By doing so, you can reduce the dimensionality of your data while still preserving most of the relevant information.
Conclusion:
In summary, this project analyzed the Connect4 dataset using machine learning techniques such as logistic regression, decision trees, random forest, boosting, SVMs, neural networks, and K-means clustering. The key findings include:
- The random forest model was the most accurate, with an 81.94% success rate in predicting game outcomes.
- The neural network also performed well, achieving 79.55% accuracy.
- Lower accuracies were observed with linear SVM and the clustering algorithm, highlighting challenges in linear class separation and cluster identification.
Visualizations were used to illustrate data distribution and model performances, particularly identifying crucial game board positions for winning. Despite some limitations, this study provides valuable insights into the predictive capabilities of various machine learning models in classifying outcomes in Connect-4 and similar tasks. Future work could involve refining these models or exploring alternative approaches.