PoE AI Part 5: Real-Time Obstacle and Enemy Detection using CNNs in TensorFlow

This post is the fifth part of a series on creating an AI for the game Path of Exile © (PoE).

  1. A Deep Learning Based AI for Path of Exile: A Series
  2. Calibrating a Projection Matrix for Path of Exile
  3. PoE AI Part 3: Movement and Navigation
  4. PoE AI Part 4: Real-Time Screen Capture and Plumbing
  5. AI Plays Path of Exile Part 5: Real-Time Obstacle and Enemy Detection using CNNs in TensorFlow

As discussed in the first post of this series, the AI program takes a screenshot of the game and uses it to form predictions that are then used to update its internal state. In this post, methods for classifying and organizing information from visual input of the game screen is discussed. I have made the source code for this project available on my GitHub. Enjoy!

Architecture of the Classification System

Figure 1: Flowchart of AI Logic

Recall from part 3 that the movement map maintains a dictionary of 3D points to labels. For example, at a given time, the bot might have the data shown in Table 1 in its internal map.

World Point Projected Point
(0,0,0) Open
(1,0,0) Open
(0,-1,0) Obstacle
(1,1,0) Obstacle
(-1,-1,0) Open

Table 1: The Internal Map

Also recall from part 2 that the projection map class allows for any pixel on the screen to be mapped to a 3D coordinate (assuming the player is always on the xy plane. This 3D coordinate is then quantized to some arbitrary precision so that the AI’s map of the world consists of an evenly spaced grid of points.

Thus, all that is needed is a method which can identify if a given pixel on the screen is part of an obstacle, enemy, item, etc. This task is, in essence, object detection. Real-time object detection is a difficult and computationally expensive problem. A simplified scheme which allows for a good trade-off between performance and accuracy is presented.

To simplify the object detection task, the game screen is divided up into equally sized rectangular regions. For a resolution of 800 by 600, a grid consisting of m = 7 rows and n = 9 columns is chosen. Twelve, four, and four pixels are removed from the bottom, left, and right edge of the screen so that the resulting sizes (792 and 588) are divisible by 9 and 7 respectively. Thus, each rectangle in the grid of the screen has a width and height of 88 and 84 pixels respectively. Figure 2 shows an image of the game screen divided using the above scheme.

Figure 2: Division of the Game Screen

One convolutional neural network (CNN) is used to classify if a cell on the screen contains an obstacle or is open. An obstacle means that there is something occupying the cell such that the player cannot stand there (for instance a boulder). Examples of open and closed cells are shown in Figure 3.

Figure 3: Labeling of Image Cells

A second CNN is used to identify items and enemies. Given a cell on the screen the CNN classifies the cell as either containing an enemy, an item, or nothing.

In order to only target living enemies a third CNN is used as a binary classifier for movement. Given a cell on the screen, the 3rd CNN determines if movement is occurring in the cell. Only cells that contain movement are passed into the second CNN. This CNN then predicts if these cells contain either items or enemies. Items labels are detected as movement by toggling item highlighting in successive screenshots.

Image data for movement detection is created by taking 2 images of the screen in rapid succession and only preserving the regions in the images that are significantly different. This is implemented using the numpy.where function (16 is chosen arbitrarily).

I1 = #... Image 1
I2 = #... Image 2
R = np.where(np.abs(I1 - I2) >= 16, I2, 0)

To summarize, screenshots are captured from the game screen and input to each of the 3 CNNs. The first CNN detects obstacles in the screen cells. The 3D grid points within each cell on the screen are then labeled accordingly in the movement map. The internal map keeps a tally of the predictions for each cell and reports the most frequent prediction class when a cell is queried. The second and third CNNs are used in conjunction to detect enemies and items.

The Dataset

Using screenshots taken with the ScreenViewer class, a training data set is manually constructed. Currently, the dataset only consists of data from the level Dried Lake in act 4 of the game. The dataset consists of over 14,000 files across 11 folders and is roughly 164MB in size. Screenshots of the dataset are shown in Figure 4.

Figure 4: The Training Dataset

In the dataset, images in Closed are cells containing obstacles. The first CNN uses the folders Closed, Open, and Enemy. The second CNN uses the folders Open, Enemy, and Item. The third CNN uses the folders Move and NoMove.


A somewhat modest CNN architecture is employed for the AI; two sequences of convolutional and pooling layers are followed by 3 fully-connected layers. The architecture is shown below in Figure 5.

Figure 5: CNN Architecture

Cross-validation accuracy in the mid to high 90s is acheived in roughly 20 to 30 epochs through the entire dataset. Epochs are performed by randomly sampling batches of size 32 from the training data until the appropriate number of samples are drawn. Training on a NVidia GTX 970 takes roughly 5 to 10 minutes.

Using Concurrency for Better Performance

To improve the performance of the AI, the CNN detection is performed concurrently. This allows for a speed-up as numpy and TensorFlow code avoids the global interpreter lock issue from which normal Python code suffers. Code to launch the classification thread for enemy targeting follows.

    #Gets a current list of enemy positions in a thread safe manner
    def GetEnemyPositions(self):
        lut = self.tlut         #Last update time for enemy cell positions
        rv = self.ecp[:]        #Make copy of current list to prevent race conditions
        if self.tllct == lut:
            return []           #No update since last time
        self.tllct = lut        #Note time for current results
        return rv

    def StartTargetLoop(self):
        self.ctl = True
        thrd = Thread(target = self.TargetLoop)
        return True
    def TargetLoop(self):
        while self.ctl:     #Loop while continue targeting loop flag is set
            self.ts.DetectMovement(self.GetScreenDifference())  #Find cells that have movement
            self.ts.ClassifyDInput(self.sv.GetScreen())         #Determine which cells contain enemies or items
            tecp = self.ts.EnemyPositionsToTargets()            #Obtain target locations
            self.ecp = tecp                                     #Update shared enemy position list
            self.tlut = time.time()                             #Update time
    def StopTargetLoop(self):
        self.ctl = False        #Turn off continue targeting loop flag

Figure 6: Thread Logical Organization

Thus, classification is performed concurrently and data members containing the predictions are provided to the main thread in a thread-safe manner using mutex locks. Figure 6 illustrates the logical organization of the threads and mutex locks. In the figure, ecp and pct are the data members of the Bot class that contain the enemy cell positions and predicted cell types respectively.

The Result

The following video summarizes the project and contains over four minutes of the AI playing Path of Exile.

Figure 7: PoE AI Footage

More footage of the latest version of the bot is available on my YouTube channel.