DSL Final Project — Spring 2021

16 min readMay 7, 2021

Kush Desai, Trey Boehm, Sahil Vaidya, Santhosh Saravanan, Seth Sehon, Jordan Pamatmat

Motivation

Everyone on our team has played chess at some level before. However, while we were brainstorming project ideas, we realized that we all approached the game with a different perspective due to our highly varied skill levels. We decided to apply our knowledge of data science to this domain and explore, quantitatively, how differently ranked players approach the game, as well as delve into interesting features in past games.

Dataset Generation

Tools

Lichess makes all of their games since 2013 available for download in the standard Portable Game Notation (PGN) format. These files include game metadata, such as the players’ ELO ratings, the opening that they played, and the type of game (classic, bullet, etc.). It also includes the list of moves in Standard Algebraic Notation (SAN).

We used python-chess to parse games from the PGN files. Conveniently, python-chess also provides a useful API to display boards, view legal moves, query what pieces are on specific board positions, and generally create a usable abstraction of the game.

The final chess-specific tool we used was Stockfish, a cutting-edge chess engine. Stockfish is a capable chess bot that also exposes some interesting functionality. We focused on its evaluation function, which, given a board state, provides a score that indicates which player has the advantage. Evaluating the board involves searching through the tree of possible move sequences and computing some score at those points, as well. This means that the score depends not only on structures and positions clearly visible at a given point in time, but also on how the game is likely to play out after that point.

Features

Human players evaluate the positions on chess boards using some combination of heuristics and intuitions that they gain with experience. Common features that players look at include material, pawn structure, piece development, and control of key squares on the board. The first task we set for ourselves was codifying these numerically. Some, like material, were quite straightforward: simply count the number of each piece type and color on the board and multiply by a set weight. Others were trivial, such as checking whether the current player was in check using the python-chess function. Several involved a bit more work, though, including counting the pins, skewers, and forks. In the end, we wrote functions (or used the python-chess API) to compute the following features:

Per-color:

The weighted material count.
How many unique piece types (pawn, bishop, etc.) the player has moved. This is a proxy for development.

Black has moved 2 unique piece types (pawn and knight), and White has moved 3 (pawn, knight, and bishop).

The number of pawn islands.
The number of single pawns.

White and Black each have one pawn island (marked). The rest of the pawns are single.

The king’s mobility for its “top 3” (nearest the other player’s side) squares.
Whether the king can castle.
Whether the king has castled.

White cannot castle (since it already has), but Black can. The marked squares represent the king’s “top 3” mobility.

How many legal moves each piece has (mobility).
The rank and file for each piece.
How many promoted pieces there are. We just count extra pieces, so the original black queen and a black pawn promoted to a queen look identical if they are the only black queen on the board.
The number of pieces attacking/defending the center squares (d4, e4, d5, and e5).
The number of pieces attacking/defending every square.
The lowest-valued attacker/defender for each square.

White has three pieces attacking/defending d4, and Black only has one. For both sides, the lowest-valued piece attacking is a knight. The overall control of the center squares goes to White, as well.

Number of pins, forks, and skewers (both total and the number targeting the king).

Black has two pins and both sides have one skewer and one fork.

Color-independent features:

The next player to move.
The move number.
The next piece that was moved (we omit this from our models).
Whether the current player is in check.
Whether the game has reached checkmate.
Whether the game has reached stalemate.

Extracting Features

To generate our dataset, we processed roughly 12,000 games from the January 2013 Lichess database, which translated to nearly 800,000 board states. We ran our feature extraction functions on each board state and also computed the Stockfish evaluation (with the model depth limited to 16). With four parallel processes, we were able to generate the dataset in about 20 hours. Using a larger depth for the Stockfish model would have yielded better evaluations, but we tried to strike a balance between having plenty of data and having accurate scores.

After generating the dataset, we realized we would have liked to have the game metadata (ELO ratings, etc.) associated with each board state. We did another pass and simply appended these features to each row. Subsequent models that we looked at for predicting the Stockfish evaluation would ignore these columns and stick to features that are visible on the board.

Analyses performed

We were interested in answering questions in two categories: 1) the differences between good players and novices, and 2) the impact of visible board features in the board’s evaluation. The former relies mainly (but not entirely) on metadata found in the PGNs, while the latter uses the extracted board features exclusively.

Metadata in the dataset

We compiled data from 10 million different games from Lichess.org, scraping events, results, player ratings, openings, and the number of turns. Since Lichess can recognize common openings, it also labels each game with the specific name (i.e. Queen’s Gambit). As one might expect, we found that White tends to win more often than Black (while 4.0% of games ending in a draw), as shown in the chart below. We also show the nearly-identical histograms for White and Black ELO ratings below, which range from 700 to 2800 and are mostly centered around 1700.

We wanted to understand whether there was a correlation between rating and White/Black win rates. We hypothesized that as ratings increased, the draw rate would increase since Black would play for a draw more often and that White would win more often. We found that the White win rate does not change much from 50% but the Black win rate decreased while draws increased. The number of ties steadily increased over ratings, which supports part of our hypothesis that higher ratings result in more ties.

When looking at ratings, we also decided to see if we could find similar features between different games, in different rating groups. We believed that we could find trends between higher rated players moving their pieces into similar, stronger positions that our extracted features would have picked up. The top and bottom 10 correlated features can be seen for White and Black are below.

Here, White has the highest correlation coefficient with White center control, indicating that higher rank players have more center control, something we would expect to see since the center is a commanding position. Additionally, White can-castle is the lowest, which may mean that lower ranked players usually have the ability to castle for longer and higher ranked players prioritize castling early. Black has the highest coefficient for the king’s file and the lowest for side-controlling-h8 (the top right square). This could be due to higher level players preferring center control rather than the outermost squares. The boxplots for two of these are below. There is high spread at 0 for each feature, since all board states start at 0 regardless of rating.

We also studied the relationship between rating and the length of the game. Our hypothesis was that higher player ratings would result in shorter games since players would forfeit more often when realizing they were losing. Surprisingly, our hypothesis was completely incorrect. There was a very strong upward trend as ratings increased. It almost seemed to be perfectly linear, with high level players taking far more turns than their lower level counterparts. This may indicate that better players play more defensively or for a draw.

Popular Openings

We also looked into popular openings in our dataset. There were a total of 2,975 openings, including the different variations and lines. Overall, the most popular openings (including their several variations) are: the Sicilian Defense, the French Defense, the Queen’s Pawn game, the Scandinavian Defense, and the King’s Pawn Game. However, these five openings only account for about 33% of overall play. The huge number of openings and variations show how expressive players can be in their approach to the game.

Naturally, differently-skilled players have their own style, so we investigated the most popular openings amongst varying rating levels. For example, the King’s Pawn Game is the most popular opening in low level play, but has very little play in mid level (2%) and high level (0.24%) games. The King’s Pawn Game is a generic “e4” opening which does not go down a typical line. Lower skilled players would not have as strong a handle on opening theory, and hence not be inclined to play a typical opening.

Next, we looked into the performance of two popular openings across skill levels: the Sicilian Defense and the Nimzo-Larsen Attack. Based on their names, we expect that the Sicilian would be more defensive and have a higher draw rate compared to the Nimzo-Larsen Attack which is implied to be more aggressive with a lower draw rate.

Sicilian Defense

The Sicilian Defense is the overall most popular opening and responds to White’s “e4” initial move with “c5”. Based on the opening’s performance across rating, the Sicilian Defense remains a strong option for Black for most players. At the top level of play, there is a very high win and draw rate for Black. This indicates that although the Sicilian is thoroughly utilized across skill levels, the very best players can make the most out of the opening to generate advantage. Although there appears to be a slight advantage for White in the 2250–2550 skill range, there are also more draws at this level. At the master’s level (>2200), where players seldom make blatant mistakes, the increased draw rate coupled with the diminished Black win rate displays the defensive nature of the Sicilian.

Nimzo-Larsen Attack

Prior to looking into the data, we hypothesised there would be openings that can only be fully utilized at the highest level of play. Based on the opening’s performance, the Nimzo-Larsen Attack is one such opening. At the grandmaster level of play (>2500) the Nimzo-Larsen Attack is the 2nd most popular opening and has an impressive 70% win rate for white. At lower skill levels, the opening appears to only slightly favor White. However, at the grandmaster level, the opening heavily favors the white side. This trend indicates that only strong players can make the most out of the opening. Compared to the Sicilian Defense, the draw rate for the Nimzo-Larsen Attack does not have a significant upwards swing at higher levels. Implied by its name, the Nimzo-Larsen Attack is more aggressive and pushes for wins compared to the Sicilian Defense.

Board states and evaluations

We tested a few different regression models to see which might be the best at predicting Stockfish evaluations based on the features we generated. Linear and Ridge Regression performed quite poorly, as expected, due to their inherent limitations. Our dataset proved too large for Polynomial Regression to be practical on our computers, so up next was boosting, starting with XGBoost. On an initial small training set of around 2,000 board states, XGBoost didn’t perform particularly well, but still greatly outperformed the previous two models after tuning parameters through a grid search, with a training score (R²) of 0.99 and a testing score of 0.58 on a 67/33 train/test split. This showed promise, so we trained the model on a larger sample of the data, using 10,000 board states instead of 67% of 2,000. This lowered the training score to 0.945, but increased the testing score (still testing on the 33% split of the initial 2,000 board states) from 0.58 to 0.90. With the model still fit on the new training set of 10,000 board states, we tested its predictions on a separate set of 10,000 more board states, which produced a testing score of around 0.53. Due to these mixed results, we tried CatBoost next, which gave a testing score of 0.565 when trained and tested on those sets of 10,000 board states after being tuned with a grid search, a marginal improvement over XGBoost. Lastly we tried training a LightGBM regressor model on the same training set of 10,000 board states, but this model produced a training R² score of -0.69, which was abysmal in comparison to XGBoost and CatBoost.

Given that XGBoost and CatBoost were the two best regression models we were able to test, and both scored pretty close to each other, we looked at their feature importances. XGBoost gave its top five features as the number of full moves completed so far, Black’s total material, White’s total material, the last type of piece moved (pawn, bishop, etc.), and the white queen’s mobility. CatBoost gave its top five features as Black’s total material, White’s total material, which rank the white queen is on, which rank the rightmost white rook is on (given that both white rooks are still on the board), and the black queen’s mobility. Both models notably ranked black material higher than white material, which could be due to White’s inherent advantage from going first. The importance of White’s rightmost rook’s rank is arguably the most unexpected and interesting takeaway from either of these models.

Next, we hoped to take advantage of CatBoost’s separation of categorical and numerical features. We used the same training and testing sets as before, but this time we marked many of the features as categorical when training the model, including the the last piece type moved, which sides could castle or had castled already, whose turn it was, whether there was a checkmate or stalemate in place, whether the remaining pieces were sufficient to create a future checkmate, and which color had more pieces attacking each square of the board. After running a grid search on this new model, it gave an R² testing score of 0.56, which unfortunately was 0.01 worse than the CatBoost model that marked no features as categorical.

Giving linear models a second chance

As mentioned above, a vanilla linear model performs pretty poorly on the dataset. Below is a scatter plot showing the actual Stockfish evaluation on the horizontal axis and the linear model’s predicted score on the vertical axis. Ideally, this would be a straight line.

It looks like there are three distinct regions: a roughly-linear part in the center, a spread on each extreme reminiscent of bifurcation diagrams, and some blobs just outside of the linear region in the center. K-means for k=6 was able to identify these regions, as shown below. Using the centers to compute the bounds, we can see that the extreme points have scores with magnitude greater than about 9,600 and the central points have scores whose absolute value is under about 3,250. The centers are slightly asymmetric, but that is not surprising given the slightly skewed nature of the dataset towards White.

With the dataset split up into three different regions (low, middle, and high evaluations), we can re-train linear models and make some interesting observations. All of these had a better R² than the model fit to the full dataset. Intuitively, this is unsurprising. A score in the tens of thousands means that Stockfish identified an opportunity to reach checkmate in some number of moves. Each move from mate is another 1,000 “centipawns” (the units of the evaluation) from 30,000 (or -30,000, for Black). Thus, a score of 29,000 means White is one move away from mate, and a score of -22,000 means Stockfish found an 8-move path to checkmate for Black. Since Stockfish is actually performing a search over possible move sequences, it has much more information than one could reasonably deduce from observing the features visible on the board, statically. A linear model does no such searching. This explains why our model can usually see the side that has an advantage, but not actually compute a reasonable evaluation for a board that is many moves away from checkmate. We suspect that these three regions roughly correspond to the beginning of the game (board evaluations are relatively low), the middle game (one player has an advantage, but not necessarily a clear path to mate), and the endgame (high board evaluations mean mate is imminent). These aren’t perfect categories, though, since players are not guaranteed to see the sequence of moves to lead them to victory, nor are they guaranteed to avoid grave mistakes.

One great feature of linear models is their explainability. The coefficients tell us how important each feature is for the overall evaluation. This allows us to explore how different board characteristics matter and compare them to our intuitions.

The plots above show how some important features can differ between the three subregions. The heatmaps in the first column show the magnitude of the coefficient for square control. Recall that this feature just counts how many attackers and defenders each square has from each side. If there are more black attackers, the number will be negative, and vice versa. Dark red squares indicate those that are particularly important to control, whereas blue suggests that controlling a square might be a waste of resources. Notice that the darkest squares for the endgame heat map are next to where a king could castle. These squares are especially important to control at the end of the game because players are likely to checkmate in these areas.

The middle heatmaps show the coefficients associated with White’s lowest valued attacker/defender for each square. The features associated with these squares are semi-categorical, where a pawn is a 1 and a king is a 6. However, the categories are still ordered by piece importance, so using a linear model is still somewhat reasonable here. The dark red squares denote those that are somehow important for White to be attacking. Similarly, the third column of heatmaps shows the coefficients for Black’s attacker/defender maps. We have flipped the sign of the coefficient so that here, too, red squares denote more important squares to control. These two heatmaps are never quite mirror images of each other, but there are some features that stand out. For the linear model fit to all of the data (the top row), both sides have the darkest red squares in the center, suggesting that controlling those squares is quite valuable in general. Similarly, the second row shows that attempting to control the other player’s queen square is a detriment.

The final column of plots shows the importance of piece mobility to a linear model for different regions of evaluations. The blue bars are for White and the green bars are for Black. There are some mystifying results, here: the second plot from the bottom shows king, knight, and pawn mobility actually hurts the model’s evaluation for Black. We’re not sure at all why this is the case. If Black and White are both castled on the King’s side, White likes to pressure Black by advancing his own kingside pawns to further his attack. Advancing these pawns would increase the white king’s mobility. This aggressive tactic appears more for White than Black, because White openings tend to be more aggressive, because they start with an extra tempo.. The same idea of White’s aggression may explain the disparity for the pawns as well. White wants highly mobile pawns to further his attack, while Black wants to keep his pawns stagnant to fend off attacks. Perhaps more interesting are the flipped coefficients for queen mobility in the second row. If we accept that scores in that range belong to the beginning of the game, this might imply that developing the queen too early is dangerous. Finally, the last row shows that king (and, to a lesser extent, pawn) mobility become exceptionally important for the extreme-scoring boards. If this region does generally correspond to the endgame, this is again unsurprising and confirms the conventional wisdom about what typically matters at this phase of the game. In typical endgames, the King and the pawns are some of the few pieces still available on the board.

Conclusion

The most interesting part of this project was understanding how different features evolved across various skill levels. As players get better at the game, they start using a different subset of openings as well as start prioritizing different game features. Analyzing the game dataset with linear models also led to interesting discussions about how different board features contribute to the overall game result. However, this is still a vast dataset, with many novel features to explore and is ripe for exploration! Check out our GitHub repository here — https://github.com/kdesai2018/DSL-final-project.