Figure 1 shows the number of records and unique names per year. There are spikes in the number of records around 1910 and 1940. The total number of unique names increases progressively more rapidly into the 20th century. In the most recent several years, both trends reverse to an extent.

**Figure 1: Total Number of Names and Records**

Despite spikes in the number of records, name diversity increases more steadily. The result is that a large number of people born in the middle 20th century have common names. The most popular names for each year are shown in Figure 2.

**Figure 2: Most Popular Names Over Time**

The number of occurrences and ranking of a particular name can be visualized over time. The distributions for the names “Nicholas”, “Alice”, and “Dwight” are shown in Figure 3. The green bar and red dot highlights the ranking at a specific birth year. A lower ranking indicates a more popular name, with 0 being the most popular name.

**Figure 3: Occurrences and Popularity of Several Names**

The graph for the name “Nicholas” shows an interesting trend. The number of occurrences of the name sharply declines from around the year 2000. The ranking of the name declines as well, but at a slower rate.

To investigate this trend, the number of occurrences of the 1st and 5th most popular names are computed for each year. The result is plotted in Figure 4 alongside the total number of records for each year.

**Figure 4: Occurrences of Top Names**

The total number of records and occurrences follow a similar pattern initially. The lines diverge around 1980. After this point, it requires fewer occurrences to be one of the top names. This is the reason that the ranking of the name “Nicholas” declines more slowly than its number of occurrences around the time.

The percentage of records accounted for by the top names is computed for various values of . The result is shown in Figure 5.

**Figure 5: Percentage of Records with Common Names**

A small percentage of popular names represent the majority of records. The rapid increase in the number of names around 1910 reduces the percentage accounted for by the most popular names. The top names regain popularity in the middle century before declining into the 21st century.

**Figure 6: Records with Common Names in 1950 and 2010**

This trend is examined further by arranging the top four hundred names into groups of 100. The result is displayed in a pie chart. In 1950, the top 100 names represent almost 60% of names. In 2010, they represent less than half of that.

Next, the number of names in several of the top percentiles of popularity are computed. The result is shown in Figure 7. The rapid increase starting around 1990 indicates more diversity in the top percentiles.

**Figure 7: Number of Names in Top Popularity Percentiles**

The sharp increase in records around 1940 is not accompanied by a sharp increase in names. In fact, the opposite is true. The result is a large number of records with common names and few with rare ones.

Over the remainder of the century, preferences shift to more uncommon names as diversity increases. Many of the most common names of the 1950s are still popular, but they share an increasing amount of their popularity with less traditional ones. In the 21st century, it requires fewer occurrences to be one of the top names and the top percentiles of popularity contain more names than in the past.

The numbers of male and female records are computed for each year. The result is shown in Figure 8. The chart shows a sizable imbalance between male and female records.

**Figure 8: Count of Male and Female Records**

There are several factors that may contribute to this imbalance. The first is that records with names that occur fewer than five times in a year are omitted. The second is that the data includes immigrants as well as natural born citizens. The effect of the first can be approximated while that of the second is more difficult.

Next, the percentage of records with common names is computed for each year and gender. A common name is defined as one of the top 100 names. The charts reveal that men are typically given more common names than women after 1910.

**Figure 9: Percentage of Records with Common Names by Gender**

Unfortunately, the dataset does not contain names that occur fewer than 5 times in the nation per year. However, the SSA also maintains statewide data. Names are excluded from the statewide data if they occur fewer than 5 times in each state. By taking the difference between the national and state data, the number of these rare names is approximated. The number of records missing from the state data for each gender is shown in Figure 10.

**Figure 10: Records Missing in State Data**

As can be seen, there are more women with names that are excluded from the state counts but included in the national counts due to their rarity. It seems reasonable that the imbalance in male and female records is largely a result of the relative prevalence of very rare female names.

Next the statewide data is considered. The ratio of records to unique names is computed and displayed in a choropleth map. States with higher name diversity are shown in lighter colors.

**Figure 11: Name Diversity by State in 2016**

By making successive maps, the popularity of a name can be visualized over time. The popularity of the name Mary is shown in Figure 12. The name peaks in popularity in the middle 20th century before entering a decline.

**Figure 12: Name Popularity over Time**

The gender imbalance in each state is computed for records with a birth year of 2016. The imbalance is computed as the difference between the number of male and female records divided by the total number of records. The result is shown in a choropleth map in Figure 13. The map is similar to the map for name diversity.

**Figure 13: Gender Imbalance Among Records by State in 2016**

To investigate the similarity further, a scatter plot is constructed which plots gender imbalance against state name diversity. Next, a trend line is fit to the data. The of the fit is 0.894 and the F-statistic is 419.9 with a corresponding p-value 2.8e-26.

**Figure 14: Relationship Between Gender Imbalance and Name Diversity**

There is clear evidence showing that states with more diverse names have more male records in the dataset. Further, there are more male records with common names than female records overall. Together, these facts support the notion that the gender imbalance is due to the omission of more female records than male records. It appears that more women have especially rare names than men.

]]>*An intermediate activation volume produced by a convolutional neural network predicting the attractiveness of a person.*

Does beauty truly lie in the eye of its beholder? This chapter explores the complex array of factors that influence facial attractiveness to answer that question or at least to understand it better.

The Chicago Face Database is a dataset created at the University of Chicago by Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink [1]. It consists of portrait photos of several hundred participants along with corresponding measurement and questionnaire data. Each of the participants had several of their facial features measured. For example, the measurements include:

- The distance between the outer edge of the subject’s nostrils at the widest point.
- The average distance between the inner and outer corner of the subject’s eye.
- The distance between the subject’s pupil center to hairline.

**Figure 1: Chicago Face Dataset Subject AM 238**

Subject | Attractive | Feminine | Masculine | Face Width Cheek | Average Eye Height |
---|---|---|---|---|---|

AM-238 | 3.120 | 1.769 | 4.292 | 634 | 46.250 |

AF-200 | 4.111 | 5.630 | 1.357 | 676 | 65.250 |

LM-243 | 2.778 | 1.179 | 4.857 | 653 | 48.750 |

**Table 1: Chicago Face Dataset Sample Data**

In addition to the numerical data, a questionnaire was given to another set of participants along with the images. The participants were asked to rate several qualities of each subject on a scale from 1 to 7. These qualities include:

- How attractive the subject is with respect to people of the same race and gender.
- How feminine the subject is with respect to people of the same race and gender.
- How masculine the subject is with respect to people of the same race and gender.

An example of an image from the dataset is shown in Figure 1. A sample of several rows and columns of the dataset is listed in Table 1.

The literature often quotes facial symmetry, averageness, and secondary sexual characteristics as influencing attractiveness. In an often-cited paper, Little et. al describe plausible reasons why these particular features are considered attractive from an evolutionary perspective [2]. Further, they claim that despite there being cultural differences to facial attractiveness, favorable cross-cultural traits exist.

Symmetry is stated to be attractive as it represents the ideal outcome of development. Asymmetry is asserted to arise from both genetic and environmental issues. The issues cited include genetic defects, mutations, nutrition, and infections. Humans are claimed to have evolved to find these features less attractive as they suggest less than ideal fitness.

Averageness is stated to be found attractive as an indicator of genetic diversity. Genetic diversity improves fitness in regards to immunity and reduced risk of genetic issues in offspring. Again, humans are claimed to have evolved to find averageness to be more attractive.

Secondary sexual characteristics represent the features that distinguish male faces from female faces. These traits develop in part due to hormonal differences between men and women. Again, these differences are declared to be found attractive in that they indicate genetic robustness. Little et al. state that there is a solid evidence for a positive relationship between femininity and attractiveness in women. The link between masculinity and attractiveness in men is claimed to be less pronounced.

The effect of averageness is examined in the Chicago Face Dataset. The matrix of measurement data is normalized so that each column has mean 0 and variance 1. Next, the L2 norm of each sample is computed. The result is a single value representing the total irregularity of a subject with respect to all facial measurements. Next, linear regression is used to analyze the relationship between attractiveness and irregularity.

The purpose of linear regression is to find a vector of coefficients together with a bias that minimize the prediction error of the model. The formulas for linear regression and prediction error are shown in Equations 1 and 2. In these equations, is the matrix of sample data, is the matrix of targets, are the predictions, and denotes the L2 norm.

**Equation 1: Linear Regression**

**Equation 2: Linear Regression Error Term**

A linear regression model is fit which regresses the attractiveness rating against irregularity. The result is shown in Figure 2.

**Figure 2: Irregularity Scatter Plot and Trend Line**

target variable that is explained by the regular variable(s). Its formula is shown in Equation 3, with being the number of data points, the -th target vector, and the vector of target column means.

**Equation 3: The Coefficient of Determination**

As can be seen, there is a minor negative relationship between irregularity and attractiveness. However, an of 0.052 does not provide substantial evidence for a relationship between the two variables. This number implies that only 5% of variation in attractiveness can be explained by averageness. It is important to note that the relationship is inverted here as the -axis represents distance from average or irregularity. A negative relationship shows that attractiveness increases as the feature measurements move closer to average.

Two sorted lists are constructed from the sample data. In one list, the samples are sorted by their averageness. In the other list, the samples are sorted by their attractiveness. The average absolute difference between these two orderings is 175.7 with a standard deviation of 126.9. If the subjects stand in a line ordered by averageness, each subject would move past 175 people on average to re-order the line by attractiveness. In a dataset of roughly 600 people, this is anything but impressive. Figure 3 shows the distance each subject needs to move to reorder the line.

**Figure 3: Averageness Ordering to Attractiveness Ordering**

Next, the effect of symmetry is evaluated. The dataset contains several separate measurements for the left and right portions of the face. The absolute differences between the left and right measurements are computed. The result is 6 features measuring facial asymmetry. A multiple regression model is constructed which predicts attractiveness from these 6 derived features. Figure 4 shows a scatter plot of the target values against the predictions.

**Figure 4: Scatter Plot of Predictions for Symmetry Model**

The plot is labeled with the adjusted of the fit. The adjustment to is made to account for the fact that models with more dependent variables spuriously obtain higher values. The formula is shown in Equation 4, where is the number of parameters in the model. In this case, the model has 6 parameters. is a more robust estimate of model performance when multiple explanatory variables are involved.

**Equation 4: The Adjusted Coefficient of Determination**

The scatter plot does not show a clear relationship between the predictions and attractiveness. This is reflected in the negative . An of -0.006 implies that the model has no explanatory power in predicting attractiveness.

This model is compared to a baseline model that always predicts the mean standardized attractiveness: 0. The general linear *F-statistic* is used to perform the comparison. The F-statistic and associated *p-value* quantify the amount of additional variance in attractiveness that is explained by a model that is not explained by a baseline model. The formula for the F-statistic is shown in Equation 5 with the number of samples and the degrees of freedom for the model. The computed F-statistic is 0.42 with an associated *p-value* of 0.87. There is no evidence that the model provides a better fit than one that merely predicts the mean standardized attractiveness.

**Equation 5: The F-Statistic for Multiple Regression**

The lack of a significant relationship here does not prove that symmetry is useless in predicting attractiveness. There are many other possible explanations for this lack including poor features, poor data, or even random variation. Regardless, the lack of strong relationships in the above models demonstrates the notion that there are many aspects to facial attractiveness. Relying too heavily on any one aspect can affect model performance. The reality is that real world data is often noisy and full of complex and unintuitive relationships.

Insight:The effects of symmetry and averageness appear overstated. |

A multiple regression model is constructed which predicts the attractiveness of a sample given the 40 facial measurement features. This model is subsequently analyzed to gain insight about facial features which affect attractiveness. In order to fairly compare the effects of each variable, each measurement is first standardized. If this is not performed, then measurements with larger values may appear to be weighted less heavily. For example, the measurements for facial width at the cheek are much larger than those for eye height. Thus, each column is centered by subtracting its mean and then scaled by its standard deviation. The result is that each entry is now a t-score for that column.

A baseline model is first constructed for the sake of comparison. The baseline model always predicts the mean standardized attractiveness: 0. The baseline model achieves a *root mean squared error* (RMSE) of 0.77. The formula the RMSE is shown in Equation 6.

**Equation 6: The Root Mean Squared Error**

*Cross-validation* is used to further access the performance of the model. Figure 5 shows a depiction of a train/test cross-validation split on a hypothetical dataset of 100 samples.

**Figure 5: Cross-Validation Split**

In Figure 5, each cell represents an entry in the dataset. By dividing the dataset into training and testing sets, the performance of the model can be evaluated on samples with which it has not been trained. This validation is needed to ensure the model is not simply memorizing the target values and has the ability generalize. When performing both standardization and cross-validation, care must be taken to prevent *data leakage*. Data leakage is providing your model with information about the cross-validation data. To avoid this, the column means and standard deviations must only be computed on the training data.

The dataset is repeatedly divided into training and testing sets and then standardized. The model is fit using the training set and then scored using the testing set. Scoring is accomplished by taking the predicted value of the model and comparing it to the real value from the dataset. Performing training and testing 512 times, the average cross-validation loss is 0.691 with a standard deviation of 0.054. An additional model is fit to the entire data set. The target values are plotted against the predicitons of this model and the result is shown in Figure 6. The of the fit is 0.247.

**Figure 6: Scatter Plot of Predictions for Facial Measurement Model**

The coefficient vectors of each of the 512 linear regression models are recorded and analyzed. The average and standard deviation for each coefficient value is computed and the results of the top 6 most influential positive and negative features are listed in Table 2.

Positive Feature Weights | Negative Feature Weights | ||||
---|---|---|---|---|---|

Name | Avg. | Std. | Name | Avg. | Std. |

L Eye H | +8.09% | +2.93 | Avg Eye Height | -14.48% | +5.67% |

R Eye H | +7.69% | +2.88 | Lip Fullness | -4.89% | +2.15% |

Lip Thickness | +4.42% | +2.13 | Chin Length | -4.27% | +2.02% |

Cheekbone Height | +3.82% | +1.55 | Forehead | -4.18% | +2.30% |

Midface Length | +3.76% | +1.88 | Pupil Lip L | -3.47% | +1.29% |

Upper Head Length | +2.98% | +1.60 | Faceshape | -2.79% | +1.75% |

**Table 2: Most Influential Linear Regression Coefficients**

Features with negative coefficients decrease attractiveness as they increase in value; those with positive coefficients do the opposite. The complicated relationship amongst the variables is illustrated in the table. Individual eye height measurements positively affect attractiveness while average eye height negatively affects it. It appears that the effects of these two coefficients cancel out at least somewhat. A similar paradox is apparent with lip fullness and lip thickness. Due to this, it is difficult to determine the true importance of the various features.

This situation arises from *multicollinearity* in the input data. Multicollinearity exists when there are strong linear relationships between the explanatory variables. When this occurs, the related variables do not change in isolation. Change in one variable typically results in a proportional change in the related variables. These proportional changes can have additive or subtractive effects to the change induced by the original variable. It is this behavior that makes interpretation difficult.

With facial measurements, the cause of this behavior is intuitive. If the height of the left eye increases, then the measurements for the average height also increases. Lip fullness and thickness are similarly defined. A table of the top negative and positive correlation among facial measurements is shown in Table 3.

Positive Correlations | Negative Correlations | ||||
---|---|---|---|---|---|

R |
Feature i |
Feature j |
R |
Feature i |
Feature j |

+0.985 | Midcheek Chin R | Cheeks avg | -0.825 | Face Width Mouth | Heart Shapeness |

+0.984 | Midcheek Chin L | Cheeks avg | -0.811 | Cheekbone Prominence | Face Roundness |

+0.977 | L Eye H | Avg Eye Height | -0.783 | Face Length | Faceshape |

+0.976 | R Eye H | Avg Eye Height | -0.761 | Face Width Mouth | Cheekbone Prominence |

+0.975 | R Eye W | Avg Eye Width | -0.752 | Heart Shapeness | Face Roundness |

+0.973 | L Eye W | Avg Eye Width | -0.731 | Nose Length | Noseshape |

+0.969 | Pupil Lip R | Pupil Lip L | -0.697 | Pupil Lip L | f WHR |

+0.954 | Lip Thickness | Lip Fullness | -0.695 | Pupil Lip R | fWHR |

**Table 3: Most Correlated Measurement Features**

A *lasso regression* model is used to address the multicollinearity. The term lasso is an abbreviation for “least absolute shrinkage and selection operator.” Lasso regression penalizes the absolute value of the regression coefficients to help prevent situations where one coefficient cancels the effect of another. The error term for lasso regression is shown in Equation 7. The error term is the same as that for linear regression with the addition of an L1 regularization term.

**Equation 7: Lasso Regression Error Term**

As increases, the coefficients of the model are forced to 0. An appropriate value of can remove collinear variables from the model while maximizing model performance. A large number of models are created as is varied from 0 to 1. Several of the coefficient values are plotted against and the result is shown in Figure 6.

**Figure 6: Lasso Coefficient Shrinkage**

The number of non-zero coefficients in the model is shown in Figure 7 for various values of . The color represents the of the fit. As the number of non-zero coefficients decreases, the prediction power of the model steadily worsens.

**Figure 7: Number of Nonzero Lasso Regression Coefficients**

is chosen to be 0.02. Performing training and testing 512 times, the average cross-validation loss is 0.689 with a standard deviation of 0.054. Lower values exist that further improve performance, but the introduction of more collinear terms makes interpretation more difficult. By further tuning , an average cross-validation loss of 0.680 with a standard deviation of 0.053 is achieved.

One additional model is fit to the entire dataset. The target values are plotted against the predicted values and the result is shown in Figure 8. The of the fit is 0.246.

**Figure 8: Predictions of the Facial Measurement Model**

The cross-validation loss suggests the model is able to predict the attractiveness score of a subject to roughly ± 0.68 of its true value (from 1-7). If a subject had an attractiveness score of 6.2, for instance, the model might predict a value between 5.52 and 6.88. An individual prediction might well fall outside of this range, but the number is descriptive of the overall performance of the model. The ability of a relatively simple linear model to predict attractiveness based on facial measurements is suggestive that objective measures of facial attractiveness may exist.

Again, the coefficient vectors of each of the 512 lasso regression models are recorded and analyzed. The average and standard deviation for each coefficient value is computed. In addition, the intervals between the minimum and maximum value for each coefficient are computed. The coefficients whose intervals do not contain 0 are listed for in Table 4.

Feature | Avg. | Std. | Min | Max |
---|---|---|---|---|

Pupil Lip L | -0.092 | +0.017 | -0.137 | -0.035 |

Noseshape | -0.072 | +0.012 | -0.116 | -0.036 |

Chin Length | -0.055 | +0.009 | -0.081 | -0.002 |

Lip Fullness | -0.047 | +0.025 | -0.141 | -0.012 |

Midbrow Hairline L | -0.037 | +0.014 | -0.089 | -0.003 |

Asymmetry Pupil Lip | -0.015 | +0.002 | -0.021 | -0.010 |

Luminance Median | +0.021 | +0.003 | +0.012 | +0.031 |

Cheekbone Prominence | +0.021 | +0.005 | +0.006 | +0.040 |

Pupil Top L | +0.028 | +0.007 | +0.007 | +0.046 |

Nose Width | +0.054 | +0.009 | +0.027 | +0.086 |

Midface Length | +0.077 | +0.027 | +0.001 | +0.142 |

**Table 4: Non-Zero Lasso Regression Coefficients for Γ = 0.0012**

Even using lasso regression, the effects of multicollinearity can still be seen. The correlation between pupil to lip length and midface length is 0.638. However, both features appear in the model with opposite signs. Gamma can be further increased to remove these counteracting effects, though model performance begins to suffer.

Asymmetry Pupil Lip | -0.062 | +0.006 | -0.082 | -0.044 |

Pupil Lip L | -0.050 | +0.014 | -0.080 | -0.001 |

Face Width Cheeks | -0.033 | +0.008 | -0.053 | -0.010 |

Luminance Median | +0.082 | +0.008 | +0.055 | +0.106 |

Cheekbone Height | +0.104 | +0.022 | +0.035 | +0.142 |

Nose Length | +0.151 | +0.011 | +0.119 | +0.193 |

**Table 5: Non-Zero Lasso Regression Coefficients for Γ = 0.02**

As seen in Table 5, the larger value of force more coefficients towards 0, resulting in a simpler model. The model appears to rate subjects with wider faces and longer pupil to lip length as being less attractive. Interestingly, the asymmetry measurement for the pupil to lip length has a significant negative effect. This provides some support for the influence of symmetry, though its effect is overshadowed by other variables. In the positive direction, the model appears to favor high cheekbones, longer noses, and more luminous faces. The distributions of these coefficients for each of the 512 lasso regression models are shown in Figures 9 and 10 along with the intervals containing their values.

**Figure 9: Significant Positive Measurement Features**

**Figure 10: Significant Negative Measurement Features**

In this case, there is a tradeoff between a higher performance model and one that is easy to interpret. Though few of the individual coefficients are significant, the model is able to acheive modest performance by combining a larger number of features. If only coefficients that are significant at the 95% confidence level are used, the of the fit decreases to 0.110. An intuitive explanation for this may be that facial attractiveness is a result of the combination of a wide variety of facial features.

Insight:Objective measures of attractiveness appear to exist. |

Regardless of the features that are most important, the lasso regression model is able to make a substantial improvement over the baseline model. This is reflected in the F-statistic for the model. For , the F-statistic is 6.48 with a corresponding . This suggests that facial measurements provide useful information in predicting attractiveness.

Next, the subjective features of the dataset are analyzed. These features are scores from 1 to 7 on a variety of perceived qualities such as attractiveness, masculinity, and femininity. Scores represent averages over a number of participants evaluating the images of the dataset. For instance, subject AF-200 has a masculinity score of 1.357 which is her average score given by 28 evaluators for that quality.

A lasso regression model is constructed which predicts attractiveness based on all other subjective features. Attractiveness is excluded for obvious reasons. Care must again be taken to account for multicollinearity. The top positive and negative correlations among the subjective features are shown in Table 6.

Positive Correlations | Negative Correlations | ||||
---|---|---|---|---|---|

R |
Feature i |
Feature j |
R |
Feature i |
Feature j |

+0.843 | Angry | Disgusted | -0.952 | Feminine | Masculine |

+0.834 | Angry | Threatening | -0.683 | Age | Babyface |

+0.734 | Dominant | Threatening | -0.631 | Threatening | Trustworthy |

+0.725 | Afraid | Sad | -0.606 | Angry | Happy |

+0.687 | Disgusted | Threatening | -0.587 | Angry | Trustworthy |

+0.683 | Happy | Trustworthy | -0.573 | Happy | Sad |

**Table 6: Most Correlated Subjective Features**

Thankfully and intuitively the correlations among the variables are weaker than those among the measurement variables.

Using the subjective features alone, the lasso regression model achieves an average cross-validation accuracy 0.459 with a standard deviations of 0.040. An additional model is fit to the entire dataset. The target values are plotted against the model predictions and the plot is labeled with the of the fit: 0.653. The result is shown in Figure 11.

**Figure 11: Predictions of the Subjective Feature Model**

Interestingly, this is a substantial improvement over the accuracy of the regression model based on the facial measurements. This implies subjective features are more useful overall in predicting attractiveness.

Insight:Subjective features are better predictors of attractiveness than facial measurements. |

Next, the coefficient vector of the regression model is analyzed in the same fashion as earlier. 512 lasso regression models are fit using separate cross-validation splits and the average and standard deviation of each feature coefficient is computed. The weights of the top positive and negative subjective features are shown in Table 7 along with their sign.

Positive Feature Weights | Negative Feature Weights | ||||
---|---|---|---|---|---|

Name | Avg. | Std. | Name | Avg. | Std. |

Feminine | +34.34% | +0.64% | Age | -7.37% | +0.31% |

Masculine | +26.34% | +0.65% | Sad | -5.54% | +0.33% |

Trustworthy | +5.99% | +0.48% | Threatening | -3.84% | +0.50% |

Dominant | +3.87% | +0.47% | Unusual | -3.06% | +0.24% |

Afraid | +3.04% | +0.36% | Babyface | -2.41% | +0.30% |

Angry | +0.35% | +0.47% | Surprised | -1.66% | +0.24% |

**Table 7: Most Influential Subjective Features**

The model scores people who appear old, sad, and threatening as being less attractive. It is important to note that the age variable represents the average age estimate made by the participant evaluators and not the true age of the subject. This implies that people perceived as youthful are also perceived as attractive. Somewhat paradoxically, subjects are rated as being less attractive for having a “babyface.” Nevertheless, there is a subtle distinction between the appearance of youth and having a babyface.

The model scores people who appear more feminine, masculine, and trustworthy as appearing more attractive. The relationship between femininity, masculinity, and attractiveness are more easily visualized using scatter plots. Figure 12 shows femininity plotted against attractiveness with separate trend lines for men and women.

**Figure 12: Relationship Between Attractive and Feminine**

There is a large difference between the femininity scores of men and women. There is also a large difference between the relationship of femininity and attractiveness between men and women. Attractiveness in women is very highly correlated with femininity. This intuitively makes sense, though deeper interpretation is somewhat ambiguous. Depending on the evaluator, femininity might be perceived as being attractive or attractiveness might be perceived as a quality of femininity. The subjective nature of these features make interpretation more difficult. In regards to men, femininity has little effect on attractiveness.

**Figure 13: Relationship Between Attractive and Masculine**

Figure 13 shows masculinity plotted against attractiveness with separate trend lines for men and women. Interestingly, masculinity has a stronger negative effect on attractiveness in women than it has a positive effect on men. From the coefficients seen earlier, the regression model appears to miss this effect. The above plots are combined into a 3D scatter plot which shows the interactions between the 3 variables and age.

**Figure 14: Relationship Between Attractive, Masculine, and Feminine**

Another important aspect of these figures is that there are fewer men who are rated as being attractive. This is despite the fact that the number male and female subjects is nearly equal with 290 male and 307 female samples. The majority of the data points for men are clustered in the first half of the range of attractiveness. This effect confounds the relationships presented in Table 7.

**Figure 15: Distribution of Attractiveness Scores for Men and Women**

The distribution of attractiveness scores for men and women are shown in Figure 15 along with their corresponding sampling distributions. Due to the large number of participants, there is almost no overlap between the sampling distributions. Performing a Welch’s t-test to compare the two means, . It appears that despite being asked to control for gender, the evaluators still rated men as being less attractive on average.

Insight:Men are rated as being less attractive than women on average. |

Statistically significant differences between men and women exist in several of the other subjective features. Men are more likely to be perceived as being masculine, threatening or dominant. Women are more likely to be perceived as being feminine, trustworthy, unusual, or sad. The distributions for several other of these features are shown in Figure 16.

**Figure 16: Distribution Differences for Men and Women**

The effects of several of the other subjective features are irrespective of gender. The appearance of trustworthiness, for example, is correlated with attractiveness in both genders. Figure 17 presents a scatter plot of attractiveness and trustworthiness with a single trend line for men and women.

**Figure 17: Relationship Between Attractive and Trustworthy**

It appears there is a modest positive relationship between appearing trustworthy and appearing attractive. This effect ties together the two observations that men are more likely to be rated both less attractive and less trustworthy than women.

A lasso regression model is constructed using both the facial measurements and the subjective features. is chosen to be 0.0075. The model achieves an average cross-validation loss of 0.427 with a standard deviation of 0.035. The of the fit is 0.672. If both dummy variables for race and gender and the subjective race and gender estimates are introduced, model accuracy is again improved.

To further improve performance, seperate models for men and women are constructed and the results of each are synthesized together. Finally cubic polynomial interaction terms for the measurement data are introduced. Cubic terms are used due to the cubic relationship between measurement length and 3D facial structure. The target values are plotted against the predictions of the final model and the result is shown in Figure 18.

**Figure 18: Final Model Prediction Scatter Plot**

Again, cross-validation splits are repeatedly formed and the performance of each model is evaluated. This process is repeated 512 times. In addition, one model is fit to the entire dataset to evaluate the overall goodness of fit. The results for each of the feature sets is shown in Table 8. The minimum cross-validation loss is 0.373 with a standard deviation of 0.036. The corresponding is 0.823.

Feature Set | Avg. Loss | Std. Loss | Adj. R² |
---|---|---|---|

Baseline | 0.770 | 0.060 | 0.000 |

Measurements | 0.680 | 0.053 | 0.246 |

Subjective | 0.459 | 0.040 | 0.626 |

Measurement + Subjective | 0.427 | 0.035 | 0.672 |

All | 0.397 | 0.040 | 0.718 |

All Seperate Gender | 0.380 | 0.036 | 0.802 |

All Seperate Gender + Cubic | 0.373 | 0.035 | 0.823 |

**Table 8: Lasso Regression Cross-Validation Performance**

As can be seen, the addition of each feature set provides more power in predicting attractiveness. This shows that the feature sets compliment each other, at least partially. For example, by using the facial measurements in addition to the subjective features, the model is able to achieve a substantial improvement in performance. This may suggest that while the majority of attractiveness is subjective, there are anatomical characteristics which are perceived as being attractive. For although the majority of variation in attractiveness is explained by the subjective features, facial measurements present additional useful information.

This explanation intuitively makes sense, though some caution should be exercised. There are correlations amongst the subjective features and facial measurements. For example, the correlation between the subjective feature “Surprised” and eye size is 0.348. This is less of a profound insight and more a description of a biological function. Nevertheless, the effect of eye size on attractiveness may be overshadowed by the effect of “Surprised,” though the measurement seems to be closer to the root cause.

The differences between attractiveness in men and women are explored further. The subjects are divided into two groups and 512 pairs of regression models are created with one model fit to men and one to women. The mean and standard deviation of each coefficient is computed separately for each group. The mean coefficient weights for men and women are shown in Figure 19 in two pie charts. Features with a positive effect are shown in blue while those with a negative effect are shown in red.

**Figure 19: Coefficient Effect Weights for Men and Women**

The influence of femininity dominates with women. This result agrees with the earlier seen scatter plot comparing femininity and attractiveness. The most influential effect for men is the appearance of trustworthiness. Masculinity is also of importance, but the effect is weaker than that of femininity on women. Also of note is that the most important features are all subjective. This reinforces the notion that the subjective features are better predictors of attractiveness than the facial measurements. In order to explore the differences among the measurement features, separate models are fit only to the measurement data. The results are shown in Figure 20.

**Figure 20: Measurement Feature Differences Between Men and Women **

From the plot, nose width is more important in determining attractiveness in men than in women. The converse is true with facial luminance. It is important to note that the bar plot only shows the magnitude of the effect and not the sign. The top 10 most influential features are listed for men and women in Table 9 along with their sign.

Weights for Men | Weights for Women | ||||
---|---|---|---|---|---|

Name | Avg. | Std. | Name | Avg. | Std. |

Cheekbone Height | +16.12% | +3.82% | Nose Length | +17.34% | +2.58% |

Nose Width | +14.86% | +3.16% | Bottom Lip Chin | -12.97% | +5.46% |

Nose Length | +11.69% | +2.93% | Luminance Median | +11.54% | +1.86% |

Bottom Lip Chin | -9.88% | +5.93% | Cheekbone Height | +7.99% | +3.38% |

Midcheek Chin L | +7.70% | +4.08% | Pupil Top R | +7.63% | +3.43% |

Forehead Height | +5.20% | +2.33% | Face Width Cheeks | -6.85% | +1.72% |

Lip Fullness | -5.01% | +2.70% | L Eye W | +6.21% | +2.33% |

Asymmetry Pupil Lip | -4.99% | +1.59% | Asymmetry Pupil Lip | -5.93% | +1.33% |

Chin Length | -4.32% | +5.36% | Chin Length | -4.38% | +5.36% |

Asymmetry Pupil Top | -3.79% | +1.66% | L Eye H | +3.00% | +2.55% |

**Table 9: Signed Feature Weights for Men and Women**

The table clarifies the directions of the relationships for several of the values. A number of the features have similar effects between men and women. Several exceptions to this are nose width, facial width at the cheeks, forehead height, and lip fullness . Facial luminance has an important positive affect on attractiveness in women that is not present in men.

Next, the image data from the data set is analyzed. A regression model is constructed which predicts an image given a vector of values. Each sample vector contains the subjective scores along with the race and gender variables. Each target vector is a flattened vector of the pixels in the image.

A *ridge regression* model is constructed. Ridge regression fits a linear function to the data using least-squares criterion along with an L2 regularization term. By regularizing the coefficients of the model, ridge regression can help to achieve better model performance when there are large numbers of correlated variables [3]. The error term for ridge regression is given in Equation 6. Ridge regression is used instead of lasso regression for performance reasons due to the high dimensionality of the data.

**Equation 6: Ridge Regression Error Term**

The function is the same as that for linear regression with the addition of an L2 regularization term. By trying a large number of values, the regularization constant is chosen to be 1.0.

The regression models fits an matrix of values, where is the number of features in the sample vector and is the number of pixels in the image. Each row of the matrix represents facial features which are related to the corresponding column in the input vectors. For example, if the first column of the input vector contains the age, then the first row in the matrix contains regression coefficients related to age. In this way, the regression model synthesizes a weighted average of different facial features to arrive at a final image.

The model’s predictions for two of the input samples are shown in Figure 21. The original image is shown on the left and the predicted image is shown on the right. The shadowing around the face is an artifact of the wide variety of hairstyles among the subjects.

**Figure 21: Image Predictions**

The rows of the coefficient matrix can be analyzed to determine the portions of the face that are strongly related to a given feature. Each row of coefficients is reshaped into an image and results for several of the features are shown in Figure 22.

**Figure 22: Feature Activations**

In the image, the regions of the face that contribute most to a feature are shown in lighter yellow. It appears that the shapes and positions of the eyes, nose, mouth, chin, and forehead are most important in determining attractiveness. Elongated curved eyebrows and lips appear to be more attractive. The definition and size of the base of the nose is influential as well. Also of interest, is definition in the chin and jowl region. There are also regions of activation on the forehead, implying that forehead shape is important. However, interpretation of this result is made difficult by the wide variety of hair styles in the dataset.

The regression model is also capable of providing several interesting functions. Since the model predicts the appearance of a person based on a vector of coefficients, semi-random faces can be created by generating random vectors of coefficients. Random standardized sample vectors are generated using a standard normal distribution. The dummy variables for race and gender must be handled carefully to ensure they are mutually exclusive. Several such randomly generated images are shown in Figure 23.

**Figure 23: Randomly Generated Faces with Low Variance**

By increasing the standard deviation of the distribution used to generate samples, more irregular images can be generated. Randomly generated images with a standard deviation of 3 are shown in Figure 24.

**Figure 24: Randomly Generated Faces with High Variance**

The model can also be used to manipulate images via transformations to the input vectors. A subject may be aged by increasing the age score in the corresponding vector. Or a subject may be made to look happier by modifying the appropriate value. Even the gender or race of a subject can be changed. Several examples follow.

Subject WF-022 is a white female evaluated to be roughly 20 years old. By modifying the age in the sample vector, the subject is aged by roughly 55 years. The result is shown in Figure 25.

**Figure 25: Age Modification**

Subject AF-242 is an Asian female with a happiness score of 1.93 on a scale of 1 to 7. The subject is made to look happier by setting her happiness and sadness scores to the maximum and minimum values attained in the dataset respectively. The subject is also made to look sadder by setting her happiness and sadness scores to the minimum and maximum values respectively. The results are shown in Figures 26 and 27.

**Figure 26: Happiness Modification**

**Figure 27: Sadness Modification**

Subject WF-022 is a white female. By modifying the race variables, the subject is transformed into a Latino female. The result is shown in Figure 28.

**Figure 28: Race Modification**

Subject LM-224 is a Latino male. By modifying the gender variables, the subject is transformed into a Latino female. The result is shown in Figure 29.

**Figure 29: Gender Modification**

Subject WM-220 is a white male with the lowest trustworthiness score observed in the study. Again, by manipulating the relevant variables, the subject is made to look more trustworthy and happy. The result is shown in Figure 30.

**Figure 30: Trustworthiness Modification**

The above functionality has several applications including simulated aging as seen on missing people reports. In addition, it can be used to visually evaluate the performance of the model. Image transformations that are less convincing indicate that the model has a more difficulty determining what is influential for the given feature. For example, if modification of the masculinity feature does not produce a convincing image transformation, it may indicate that the model has difficulty determining the features that makes a person look masculine.

Despite their simplicity, linear frequently perform well in practice and their results are relatively easy to interpret. More complicated models like kernel support vector machines and artificial neural networks can achieve better performance but perform *black box* prediction. The term black box implies that the model is used like an appliance. Samples are provided to one end of the black box and predictions come out the other end. The inner-workings of the model are opaque.

Linear models obviously break down after some point. For instance, a subject with an impossibly wide nose might be rated as being arbitrarily attractive by an earlier seen model. Human intuition disagrees. Using a model to predict values outside the range of its training data is known as *extrapolation*. It is necessary to understand the shortcomings of a model to properly interpret its results.

The linear models in this chapter provide several key insights.

- The effects of symmetry and averageness appear overstated.
- Objective measures of attractiveness appear to exist.
- Subjective features are better predictors of attractiveness than facial measurements.
- Men are rated as being less attractive than women on average.

Returning to the prompt for this topic, it appears that facial attractiveness has both a subjective and objective component. The relationships between these components are complex, but certain physical characteristics appear to predispose attractiveness.

[1] | Ma, D. S., Correll, J., & Wittenbrink, B. (2015). The Chicago face database: A free stimulus set of faces and norming data. Behavior research methods, 47(4), 1122-1135. |

[2] | Little AC, Jones BC, DeBruine LM. Facial attractiveness: evolutionary based research. Philosophical Transactions of the Royal Society B: Biological Sciences. 2011;366(1571):1638-1659. doi:10.1098/rstb.2010.0404. |

[3] | Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, pp. 241-249). New York: Springer series in statistics. |

All addresses on the Bitcoin network are queried. The number of addresses with at least one satoshi is 24,473,765 at the time of the query. The resulting addresses are sorted by the amount of Bitcoin they contain. The list is divided into quantiles and the wealth of each quantile is plotted in a bar plot.

Next, the wealth of the top percent is computed for several values of . The result is shown in a bar plot, along with the number of addresses in each group.

Then, the number of addresses with at least satoshi is computed for several values of . The result is shown in a bar plot along with the corresponding percentiles.

Finally, the addresses are grouped by their first character and the result is displayed in a pie chart.

As can be seen, there is a large imbalance of wealth among Bitcoin addresses. However, the true balance of wealth is obscured by the fact that a single wallet can generate multiple addresses.

]]>

First, the required modules are installed and imported. This code requires *numpy*, *tensorflow*, and *TFANN*.

import numpy as np from TFANN import ANNC

*TFANN* can be installed via *pip* or by copying the source code into a file named *TFANN.py*.

pip install TFANN

Next, an matrix of random data points is generated using *numpy*. Class labels are created following a polynomial inequality. The polynomial used is

.

The inequality used to generate class labels is

.

The equation is a downwards facing parabola centered on the y axis. Points below the parabola satisfy the inequality and are labeled as 1. Points above the curve are labeled as 0. Code to generate the data matrix and class labels follows.

def F(x, y): return - np.square(x) + .1 * x - .6 * y + .2 #Training data A1 = np.random.uniform(-1.0, 1.0, size = (1024, 2)) Y1 = (F(A1[:, 0], A1[:, 1]) > 0).astype(np.int) #Testing data A2 = np.random.uniform(-1.0, 1.0, size = (1024, 2)) Y2 = (F(A2[:, 0], A2[:, 1]) > 0).astype(np.int)

The function curve is shown in Figure 1 along with a scatter plot of the generated data matrix.

**Figure 1: The Generated Data**

The color indicates the value of and the curve is . The same plot colored instead with class labels in shown in Figure 2.

**Figure 2: Generated Data with Class Labels**

As can be seen above, the data is divided into two classes: *0* and *1*. The goal is to create a model which can determine if a data point belongs to class *0* or to class *1*. This is known as *binary classification* as there are two class labels.

Next, a multi-layer perceptron (MLP) network is fit to the data generated earlier. In this example, the function used to generate class labels is known. This is typically not the case. Instead, the model iteratively updates its parameters so as to reduce the value of a *loss function*.

A two layer MLP is constructed. The activation function *tanh* is used after the first hidden layer and the output layer uses linear activation (no activation function). The architecture of the network is illustrated in Figure 3.

**Figure 3: MLP Network Architecture**

The green dots on the neurons in the hidden layer indicate *tanh* activation. Next, this network architecture is specified in a format that TFANN accepts and an ANN classifier is constructed.

NA = [('F', 4), ('AF', 'tanh'), ('F', 2)]

The list of tuples is the network architecture. *F* indicates a fully-connected layer and the following number indicates the number of neurons in the layer. *AF* indicates an activation function and the following string indicates the name of the function. As can be seen, the network architecture specifies a fully-connected layer with 4 neurons which is followed by *tanh* which is followed by another fully-connected layer with 2 neurons. The final layer is the output layer.

The docstring for the *_CreateANN* function provides detailed information on the types of network operations that are currently supported by *TFANN*.

In [109]: help(TFANN._CreateANN) Help on function _CreateANN in module TFANN: _CreateANN(PM, NA, X) Sets up the graph for a convolutional neural network from a list of operation specifications like: [('C', [5, 5, 3, 64], [1, 1, 1, 1]), ('AF', 'tanh'), ('P', [1, 3, 3, 1], [1, 2, 2, 1]), ('F', 10)] Operation Types: AF: ('AF', <name>) Activation Function 'relu', 'tanh', etc C: ('C', [Filter Shape], [Stride]) 2d Convolution CT: 2d Convolution Transpose C1d: ('C1d', [Filter Shape], Stride) 1d Convolution C3d: ('C3d', [Filter Shape], [Stride]) 3d Convolution D: ('D', Probability) Dropout Layer F: ('F', Output Neurons) Fully-connected LRN: ('LRN') M: ('M', Dims) Average over Dims P: ('P', [Filter Shape], [Stride]) Max Pool P1d: ('P1d', [Filter Shape], Stride) 1d Max Pooling R: ('R', shape) Reshape S: ('S', Dims) Sum over Dims [Filter Shape]: (Height, Width, In Channels, Out Channels) [Stride]: (Batch, Height, Width, Channel) Stride: 1-D Stride (single value) PM: The padding method NA: The network architecture X: Input tensor

The final layer of a classification network requires that class labels be encoded as 1-hot vectors along the final axis of the output. Since the network predicts a single binary class label for each sample, the final layer should have 2 neurons. In this way, the final layer outputs a matrix of dimension . The function *argmax* is applied along the final dimension of the output matrix to obtain the index of the class label.

Next the network architecture is passed to the constructor of the *ANNC* class, along with the input shape and other parameters. ANNC is an abbreviation for *Artificial Neural Network for Classification*.

annc = ANNC(A1.shape[1:], NA, batchSize = 1024, maxIter = 4096, learnRate = 1e-3, verbose = True)

The first arguments to the *ANNC* constructor is the shape of a single input sample. In this case, the shape is a vector of length *2*. The *batchSize* argument indicates the number of samples to use at a time when training the network. The batch indices are selected randomly for each training iteration. The *learnRate* parameter specifies the learning rate used by the training method (which is the adam method by default). The *maxIter* argument limits the number of training iterations to some fixed amount. Finally, *verbose* controls whether the loss is displayed after each iteration of training. Detailed descriptions for the constructor arguments are available using *help(ANNC)*.

*TFANN* follows the *fit*, *predict*, *score* interface used by scikit-learn. Thus, fitting and scoring the network can be accomplished as follows.

annc.fit(A1, Y1) #Fit using training data only s1 = annc.score(A1, Y1) #Performance on training data s2 = annc.score(A2, Y2) #Testing data print('Train: {:04f}\tTest: {:04f}'.format(s1, s2)) YH = annc.predict(A2) #Predicted labels

The *score* method uses *accuracy* as the metric for classification models. This is the number of samples labeled correctly divided by the number of samples. Some care should be used with this metric in problems where class labels are imbalanced.

Due to the simple nature of the problem, the network is able to achieve very high accuracy on the cross-validation data. After *4096* iterations, the network achieves roughly *98%* accuracy. The predictions on the testing data are plotted below in Figure 4.

**Figure 4: Model Cross-Validation Predictions (Accuracy = 98.4%)**

The reader is encouraged to modify the data, network architecture, and parameters to explore the features provided by *TFANN*.

To facilitate rapid prediction, pricing information is queried using the web API of Poloniex. A URL is provided to the API and a JSON containing the historical price information of a specified cryptocurrency is returned.

import json import numpy as np import os import pandas as pd import urllib.request def JSONDictToDF(d): ''' Converts a dictionary created from json.loads to a pandas dataframe d: The dictionary ''' n = len(d) cols = [] if n > 0: #Place the column in sorted order cols = sorted(list(d[0].keys())) df = pd.DataFrame(columns = cols, index = range(n)) for i in range(n): for coli in cols: df.set_value(i, coli, d[i][coli]) return df def GetAPIUrl(cur): ''' Makes a URL for querying historical prices of a cyrpto from Poloniex cur: 3 letter abbreviation for cryptocurrency (BTC, LTC, etc) ''' u = 'https://poloniex.com/public?command=returnChartData¤cyPair=USDT_' + cur + '&start=1420070400&end=9999999999&period=7200' return u def GetCurDF(cur, fp): ''' cur: 3 letter abbreviation for cryptocurrency (BTC, LTC, etc) fp: File path (to save price data to CSV) ''' openUrl = urllib.request.urlopen(GetAPIUrl(cur)) r = openUrl.read() openUrl.close() d = json.loads(r.decode()) df = JSONDictToDF(d) df.to_csv(fp, sep = ',') return df #%%Path to store cached currency data datPath = 'CurDat/' if not os.path.exists(datPath): os.mkdir(datPath) #Different cryptocurrency types cl = ['BTC', 'LTC', 'ETH', 'XMR'] #Columns of price data to use CN = ['close', 'high', 'low', 'open', 'volume'] #Store data frames for each of above types D = [] for ci in cl: dfp = os.path.join(datPath, ci + '.csv') try: df = pd.read_csv(dfp, sep = ',') except FileNotFoundError: df = GetCurDF(ci, dfp) D.append(df) #%%Only keep range of data that is common to all currency types cr = min(Di.shape[0] for Di in D) for i in range(len(cl)): D[i] = D[i][(D[i].shape[0] - cr):]

After execution, *D[i]* is a pandas Dataframe containing historical price data for the cryptocurrency *cl[i]*.

New samples are constructed that pair sequences of samples with the subsequent samples. In this way, a regression model can be fit which predicts time periods into the future given data from the past . A helper class which accomplishes this follows.

import numpy as np class PastSampler: ''' Forms training samples for predicting future values from past value ''' def __init__(self, N, K): ''' Predict K future sample using N previous samples ''' self.K = K self.N = N def transform(self, A, Y = None): M = self.N + self.K #Number of samples per row (sample + target) #Matrix of sample indices like: {{1, 2..., M}, {2, 3, ..., M + 1}} I = np.arange(M) + np.arange(A.shape[0] - M + 1).reshape(-1, 1) B = A[I].reshape(-1, M * A.shape[1], *A.shape[2:]) ci = self.N * A.shape[1] #Number of features per sample return B[:, :ci], B[:, ci:] #Sample matrix, Target matrix

The above class is applied to the original time sequence data to obtain the desired sample and target matrices.

from PastSampler import PastSampler #%%Features are channels C = np.hstack((Di[CN] for Di in D))[:, None, :] HP = 16 #Holdout period A = C[0:-HP] SV = A.mean(axis = 0) #Scale vector C /= SV #Basic scaling of data #%%Make samples of temporal sequences of pricing data (channel) NPS, NFS = 256, 16 #Number of past and future samples ps = PastSampler(NPS, NFS) B, Y = ps.transform(A)

In the above code, the shapes of and are as . A holdout period is maintained to access the performance of the network. The number of time units in the period is controlled by *HP*.

The *TFANN* module is used to create an artificial neural network. *TFANN* can be installed using pip with the following command.

pip install TFANN

A 1D convolution neural network is constructed which transforms the input volume of historical data into predictions. The past *NPS* samples are transformed into a prediction about the next *NFS* samples. The *C1d* option in the network architecture specification indicates 1-dimensional convolution.

#%%Architecture of the neural network from TFANN import ANNR NC = B.shape[2] #2 1-D conv layers with relu followed by 1-d conv output layer ns = [('C1d', [8, NC, NC * 2], 4), ('AF', 'relu'), ('C1d', [8, NC * 2, NC * 2], 2), ('AF', 'relu'), ('C1d', [8, NC * 2, NC], 2)] #Create the neural network in TensorFlow cnnr = ANNR(B[0].shape, ns, batchSize = 32, learnRate = 2e-5, maxIter = 64, reg = 1e-5, tol = 1e-2, verbose = True) cnnr.fit(B, Y)

The architecture of the CNN is shown below in Figure 1. The top set of parenthesized values indicate the filter dimension while the bottom denote the stride.

**Figure 1: 1D CNN Architecture**

More information and the source code for the *ANNR* class are available on GitHub.

Using the above network, the next *NFS* time steps can be predicted. These predictions can in turn be used for subsequent predictions so that prediction can be made an arbitrary amount into the future. Code to accomplish this follows.

PTS = [] #Predicted time sequences P = B[[-1]] #Most recent time sequence for i in range(HP // NFS + 1): #Repeat prediction YH = cnnr.predict(P) P = np.concatenate([P[:, NFS:], YH], axis = 1) PTS.append(YH) PTS = np.hstack(PTS).transpose((1, 0, 2)) A = np.vstack([A, PTS]) #Combine predictions with original data A = np.squeeze(A) * SV #Remove unittime dimension and rescale C = np.squeeze(C) * SV

Using *PredictFull*, the outputs of intermediate layers in the network can be visualized. Figure 2 shows an input sample as it is transformed by subsequent layers of the network.

**Figure 2: Intermediate Layer Outputs**

import matplotlib.pyplot as mpl nt = 4 PF = cnnr.PredictFull(B[:nt]) for i in range(nt): fig, ax = mpl.subplots(1, 4, figsize = (16 / 1.24, 10 / 1.25)) ax[0].plot(PF[0][i]) ax[0].set_title('Input') ax[1].plot(PF[2][i]) ax[1].set_title('Layer 1') ax[2].plot(PF[4][i]) ax[2].set_title('Layer 2') ax[3].plot(PF[5][i]) ax[3].set_title('Output') fig.text(0.5, 0.06, 'Time', ha='center') fig.text(0.06, 0.5, 'Activation', va='center', rotation='vertical') mpl.show()

Notice how in subsequent layers the input data is reduced from *NPS* to *NFS* time units.

The result of the predictions can be visualized using matplotlib.

CI = list(range(C.shape[0])) AI = list(range(C.shape[0] + PTS.shape[0] - HP)) NDP = PTS.shape[0] #Number of days predicted for i, cli in enumerate(cl): fig, ax = mpl.subplots(figsize = (16 / 1.5, 10 / 1.5)) hind = i * len(CN) + CN.index('high') ax.plot(CI[-4 * HP:], C[-4 * HP:, hind], label = 'Actual') ax.plot(AI[-(NDP + 1):], A[-(NDP + 1):, hind], '--', label = 'Prediction') ax.legend(loc = 'upper left') ax.set_title(cli + ' (High)') ax.set_ylabel('USD') ax.set_xlabel('Time') ax.axes.xaxis.set_ticklabels([]) mpl.show()

The resulting plot is shown below in Figure 3.

**Figure 3: Cryptocurrency Predictions**

The network predicts a dip in the prices of each cryptocurrency followed by a rally. The predicted behavior is similar to Bitcoin’s price over the past few days. More up-to-date predictions are available from this Twitter bot.

]]>By definition, an economic bubble is a situation in which an asset is traded within a price range that far exceeds its intrinsic value. So, the question is: what *is* the intrinsic value of Bitcoin? The purpose of this post is to explain some of the technical details of Bitcoin so as to gain a better idea of its value.

Bitcoin is an electronic currency that is maintained by a peer-to-peer software application known as *Bitcoin Core*. Bitcoin transactions are broadcast over the peer-to-peer network and Bitcoin Core records every transaction ever issued in a ledger that is maintained by each node running the software.

These transactions are sealed with a cryptographic hash inside blocks which are linked together in a chain known as the blockchain. Each block in the chain contains the hash of the previous block making it infeasible to tamper with transactions once they are a part of the blockchain. Figure 1, taken from the original paper, illustrates the blockchain [1].

**Figure 1: The Blockchain**

The blockchain maintains the integrity of the transaction record and prevents double-spending and the repudiation of transactions.

The price of Bitcoin has historically been very erratic. Its frequent spikes and crashes have resulted in some interesting trends. Figure 2 shows the relationship of the price of Bitcoin in USD (taken from blockchain.info) to the popularity of two search trends (taking from Google Trends): *“Bitcoin Bubble”* and *“Bitcoin Scam”*. The correlations for each plot are 0.7704 and 0.9445, respectively.

**Figure 2: Bitcoin Search Trends**

The above trends hint that there is a lot of uncertainty and speculation about the present and future value of Bitcoin.

According to the pseudonymous creator of Bitcoin, Satoshi Nakamoto, the intrinsic value of Bitcoin stems from the resources that are expended to mine it: CPU time and electricity [1].

Bitcoin mining is the process of cryptographically sealing blocks and adding them to the blockchain in a way that prevents tampering. Mining is accomplished by searching for *nonces* (numbers) that cause the SHA256 hash of the current block to have a certain number of leading 0s. The number of leading zeros as of Block #491713 is 18 and the hexidecimal representation of the hash digest is

00000000000000000048fddd20e468a0c9fab27c81ccade0cfd4c91e857c74e3.

Now, all SHA256 digests have 64 hexidecimal digits. Further, the nature of the SHA256 function makes it so that the best way (presently known) to find a nonce which produces the required number of leading 0s is by trial and error. Thus, the probability of finding a number to create a valid block decreases exponentially with the number of 0s required.

If it is assumed that the appearance of digits in the digest is random, then the probability of finding an appropriate nonce is presently

.

This means that the expected number of nonces that will need to be tried before an appropriate hash is found is

.

This difficulty is what prevents malicious users from tampering with the transaction record. If a block is modified, a new nonce will need to be found for that block and for all subsequent blocks in the chain!

The present hash rate of the Bitcoin network is 9,935,312.21 tera () hashes per second (TH/s). The hash rate is the number of these nonces that are being tried every second across the network. Given this hash rate, a nonce can be expected to be found every

seconds or just under 8 minutes. The Bitcoin software adjusts the difficulty automatically so that blocks are produced roughly every 10 minutes.

Mining is currently performed on application specific integrated circuits (ASIC) that are specifically designed to search for nonces quickly and efficiently. Current ASIC miners are capable of 14 TH/s at an efficiency of ~0.098W per GH/s. The expected amount of time for a single miner to find a good nonce at this difficulty is

hours.

Thus, assuming a power supply with 90% efficiency, running a top of the line miner until a nonce is found takes roughly

kWh.

The average cost of electricity in the US is 12 cents per kilowatt-hour. Thus, the cost of mining 1 block is roughly

.

Using the above hash rate and power consumption numbers, the entire Bitcoin network currently consumes roughly

kW

of electricity. Some, such as Vitalik Buterin the creator of Ethereum, have raised concerns about the environmental impact of Bitcoin mining.

As the difficulty in mining increases, the cost of electricity can outweigh the reward for mining. Due to this, Bitcoin is primarily mined in locations with abundant and cheap electricity (next to hydro-electric dams, for example).

Presently, 12.5 BTC is mined for every new block that is created. Thus, the current reward in USD for mining 1 block is roughly

,

and the current electricity cost of mining a single Bitcoin is roughly

.

According to the estimated cost of electricity, Bitcoin is presently being traded at around

times its alleged intrinsic value, which may indicate the presence of a bubble.

As an aside, the reward for mining a block began at 50BTC. The reward is set within the Bitcoin Core software to be cut in half every 210,000 blocks. Figure 3 below shows the reward that is given for a specific block number.

**Figure 3: Bitcoin Mining Reward**

Now, the smallest fraction of a Bitcoin is 1 satoshi or BTC. The reward for mining one block will fall below one satoshi, and thus be 0, after being halved 33 times. This means that the maximum amount of Bitcoins that can be mined is

BTC.

Given the present value of 1 BTC this amounts to

.

Figure 4 shows the total supply of BTC after a given number of blocks have been mined.

**Figure 4: Bitcoin Supply**

If the intrinsic value of Bitcoin is indeed the cost of the electricity required to mine it, as Satoshi claimed, then it appears that Bitcoin may be overvalued at present. The above calculations do not factor in the value of CPU time, which is more challenging to estimate, but may further increase the intrinsic value.

Regardless, the idea of a currency backed by CPU time and electricity is a relatively new concept and it remains to see how Satoshi’s claim holds up to time.

**References**

[1] Nakamoto, Satoshi. “Bitcoin: A peer-to-peer electronic cash system.” (2008): 28.

]]>

OCR is the transformation of images of text into machine encoded text. A simple API to an OCR library might provide a function which takes as input an image and outputs a string. The following pseudo-code illustrates how this might be used.

#... img = GetImage() text = ImageToString(img) ProcessText(text) #...

Figure 1 illustrates the OCR transformation.

**Figure 1: The OCR Process**

The left and right sides depict an image of text and a string representation of the text respectively.

Traditional OCR techniques are typically multi-stage processes. For example, first the image may be divided into smaller regions that contain the individual characters, second the individual characters are recognized, and finally the result is pieced back together. A difficulty with this approach is to obtain a good division of the original image. This blog post explores using deep learning to provide a simplified OCR system.

A fully convolutional network is presented which transforms the input volume into a sequence of character predictions. These character predictions can then be transformed into a string. The architecture of the network is shown below in Figure 2.

**Figure 2: Deep OCR Architecture**

Where is the number of possible characters. In this example, there are 63 possible characters for uppercase and lowercase characters, digits, and a blank character. The parenthesized values in the convolutional layers are the filter sizes and stride values from top to bottom respectively. The values in the reshape layer are the reshaped dimension.

The input volume is a rectangular RGB image. This first height and width of this volume are reduced across the convolutional layers using striding. The 3rd dimension of this volume increases from 3 channels (RGB) to 1 channel for each character possible. Thus, the volume is transformed from an RGB image into a sequence of vectors. Applying *argmax* across the channel dimension gives a sequence of 1-hot encoded vectors which can be transformed into a string.

To facilitate training this network, a dataset is generated using the Python Imaging Library (PIL). Random strings consisting of alphanumeric characters are generated. Using PIL, images are generated for each random string. A CSV file is also generated which contains the file name and the associated random string. Some examples from the generated dataset are shown below in Figure 3.

**Figure 3: Generated Dataset Images**

Code to generate the dataset follows.

import numpy as np import string from PIL import Image, ImageFont, ImageDraw def MakeImg(t, f, fn, s = (100, 100), o = (16, 8)): ''' Generate an image of text t: The text to display in the image f: The font to use fn: The file name s: The image size o: The offest of the text in the image ''' img = Image.new('RGB', s, "black") draw = ImageDraw.Draw(img) draw.text(OFS, t, (255, 255, 255), font = f) img.save(fn) #The possible characters to use CS = list(string.ascii_letters) + list(string.digits) RTS = list(np.random.randint(10, 64, size = 8192)) + [64] #The random strings S = [''.join(np.random.choice(CS, i)) for i in RTS] #Get the font font = ImageFont.truetype("LiberationMono-Regular.ttf", 16) #The largest size needed MS = max(font.getsize(Si) for Si in S) #Computed offset OFS = ((640 - MS[0]) // 2, (32 - MS[1]) // 2) #Image size MS = (640, 32) Y = [] for i, Si in enumerate(S): MakeImg(Si, font, str(i) + '.png', MS, OFS) Y.append(str(i) + '.png,' + Si) #Write CSV file with open('Train.csv', 'w') as F: F.write('\n'.join(Y))

To train the network, the CSV file is parsed and the images are loaded into memory. Each target value for the training data is a sequence of 1-hot vectors. Thus the target matrix is a 3D matrix with the three dimensions corresponding to sample, character, and 1-hot encoding respectively.

import numpy as np import os import string import sys from skimage.io import imread from sklearn.model_selection import ShuffleSplit from TFANN import ANNC def LoadData(FP = '.'): ''' Loads the OCR dataset. A is matrix of images (NIMG, Height, Width, Channel). Y is matrix of characters (NIMG, MAX_CHAR) FP: Path to OCR data folder return: Data Matrix, Target Matrix, Target Strings ''' TFP = os.path.join(FP, 'Train.csv') A, Y, T, FN = [], [], [], [] with open(TFP) as F: for Li in F: FNi, Yi = Li.strip().split(',') #filename,string T.append(Yi) A.append(imread(os.path.join(FP, 'Out', FNi))) Y.append(list(Yi) + [' '] * (MAX_CHAR - len(Yi))) #Pad strings with spaces FN.append(FNi) return np.stack(A), np.stack(Y), np.stack(T), np.stack(FN)

Next the neural network is constructed using the artificial neural network classifier (ANNC) class from TFANN. The architecture described above is represented in the following lines of code using ANNC.

#Architecture of the neural network NC = len(string.ascii_letters + string.digits + ' ') MAX_CHAR = 64 IS = (14, 640, 3) #Image size for CNN ws = [('C', [4, 4, 3, NC // 2], [1, 2, 2, 1]), ('AF', 'relu'), ('C', [4, 4, NC // 2, NC], [1, 2, 1, 1]), ('AF', 'relu'), ('C', [8, 5, NC, NC], [1, 8, 5, 1]), ('AF', 'relu'), ('R', [-1, 64, NC])] #Create the neural network in TensorFlow cnnc = ANNC(IS, ws, batchSize = 64, learnRate = 5e-5, maxIter = 32, reg = 1e-5, tol = 1e-2, verbose = True) if not cnnc.RestoreModel('TFModel/', 'ocrnet'): #...

Softmax cross-entropy is used as the loss function which is performed over the 3rd dimension of the output.

Fitting the network and performing predictions is simple using the ANNC class. The prediction is split up using array_split from numpy to prevent out of memory errors.

#Fit the network cnnc.fit(A, Y) #The predictions as sequences of character indices YH = np.zeros((Y.shape[0], Y.shape[1]), dtype = np.int) for i in np.array_split(np.arange(A.shape[0]), 32): YH[i] = np.argmax(cnnc.predict(A[i]), axis = 2) #Convert from sequence of char indices to strings PS = [''.join(CS[j] for j in YHi) for YHi in YH] for PSi, Ti in zip(PS, T): print(Ti + '\t->\t' + PSi)

Training and cross-validation results are shown below in Figure 4 on the left and right respectively. The graphs are shown separately as the plots nearly coincide.

**Figure 4: Network Training Performance**

Figure 5 shows the performance of the network for several images from the dataset.

**Figure 5: Sample OCR Results**

The text beneath each image is the predicted text produced by the network. This code was run on a laptop with integrated graphics and so the amount of data and size of the network was constrained for performance reasons. Further improvements can likely be made on the performance with a larger dataset and network.

]]>More videos are available on my YouTube channel.

]]>- A Deep Learning Based AI for Path of Exile: A Series
- Calibrating a Projection Matrix for Path of Exile
- PoE AI Part 3: Movement and Navigation
- PoE AI Part 4: Real-Time Screen Capture and Plumbing
**AI Plays Path of Exile Part 5: Real-Time Obstacle and Enemy Detection using CNNs in TensorFlow**

As discussed in the first post of this series, the AI program takes a screenshot of the game and uses it to form predictions that are then used to update its internal state. In this post, methods for classifying and organizing information from visual input of the game screen is discussed. I have made the source code for this project available on my GitHub. Enjoy!

**Figure 1: Flowchart of AI Logic**

Recall from part 3 that the movement map maintains a dictionary of 3D points to labels. For example, at a given time, the bot might have the data shown in Table 1 in its internal map.

World Point | Projected Point |
---|---|

Open | |

Open | |

Obstacle | |

Obstacle | |

Open |

**Table 1: The Internal Map**

Also recall from part 2 that the projection map class allows for any pixel on the screen to be mapped to a 3D coordinate (assuming the player is always on the *xy* plane. This 3D coordinate is then quantized to some arbitrary precision so that the AI’s map of the world consists of an evenly spaced grid of points.

Thus, all that is needed is a method which can identify if a given pixel on the screen is part of an obstacle, enemy, item, etc. This task is, in essence, object detection. Real-time object detection is a difficult and computationally expensive problem. A simplified scheme which allows for a good trade-off between performance and accuracy is presented.

To simplify the object detection task, the game screen is divided up into equally sized rectangular regions. For a resolution of 800 by 600, a grid consisting of rows and columns is chosen. Twelve, four, and four pixels are removed from the bottom, left, and right edge of the screen so that the resulting sizes (792 and 588) are divisible by 9 and 7 respectively. Thus, each rectangle in the grid of the screen has a width and height of 88 and 84 pixels respectively. Figure 2 shows an image of the game screen divided using the above scheme.

**Figure 2: Division of the Game Screen**

One convolutional neural network (CNN) is used to classify if a cell on the screen contains an obstacle or is open. An obstacle means that there is something occupying the cell such that the player cannot stand there (for instance a boulder). Examples of open and closed cells are shown in Figure 3.

**Figure 3: Labeling of Image Cells**

A second CNN is used to identify items and enemies. Given a cell on the screen the CNN classifies the cell as either containing an enemy, an item, or nothing.

In order to only target living enemies a third CNN is used as a binary classifier for movement. Given a cell on the screen, the 3rd CNN determines if movement is occurring in the cell. Only cells that contain movement are passed into the second CNN. This CNN then predicts if these cells contain either items or enemies. Items labels are detected as movement by toggling item highlighting in successive screenshots.

Image data for movement detection is created by taking 2 images of the screen in rapid succession and only preserving the regions in the images that are significantly different. This is implemented using the *numpy.where* function (16 is chosen arbitrarily).

I1 = #... Image 1 I2 = #... Image 2 R = np.where(np.abs(I1 - I2) >= 16, I2, 0)

To summarize, screenshots are captured from the game screen and input to each of the 3 CNNs. The first CNN detects obstacles in the screen cells. The 3D grid points within each cell on the screen are then labeled accordingly in the movement map. The internal map keeps a tally of the predictions for each cell and reports the most frequent prediction class when a cell is queried. The second and third CNNs are used in conjunction to detect enemies and items.

Using screenshots taken with the *ScreenViewer* class, a training data set is manually constructed. Currently, the dataset only consists of data from the level *Dried Lake* in act 4 of the game. The dataset consists of over 14,000 files across 11 folders and is roughly 164MB in size. Screenshots of the dataset are shown in Figure 4.

**Figure 4: The Training Dataset**

In the dataset, images in *Closed* are cells containing obstacles. The first CNN uses the folders *Closed*, *Open*, and *Enemy*. The second CNN uses the folders *Open*, *Enemy*, and *Item*. The third CNN uses the folders *Move* and *NoMove*.

A somewhat modest CNN architecture is employed for the AI; two sequences of convolutional and pooling layers are followed by 3 fully-connected layers. The architecture is shown below in Figure 5.

**Figure 5: CNN Architecture**

Cross-validation accuracy in the mid to high 90s is acheived in roughly 20 to 30 epochs through the entire dataset. Epochs are performed by randomly sampling batches of size 32 from the training data until the appropriate number of samples are drawn. Training on a NVidia GTX 970 takes roughly 5 to 10 minutes.

To improve the performance of the AI, the CNN detection is performed concurrently. This allows for a speed-up as numpy and TensorFlow code avoids the global interpreter lock issue from which normal Python code suffers. Code to launch the classification thread for enemy targeting follows.

#Gets a current list of enemy positions in a thread safe manner def GetEnemyPositions(self): self.tlMut.acquire() lut = self.tlut #Last update time for enemy cell positions rv = self.ecp[:] #Make copy of current list to prevent race conditions self.tlMut.release() if self.tllct == lut: return [] #No update since last time self.tllct = lut #Note time for current results return rv def StartTargetLoop(self): self.ctl = True thrd = Thread(target = self.TargetLoop) thrd.start() return True def TargetLoop(self): while self.ctl: #Loop while continue targeting loop flag is set self.ts.DetectMovement(self.GetScreenDifference()) #Find cells that have movement self.ts.ClassifyDInput(self.sv.GetScreen()) #Determine which cells contain enemies or items tecp = self.ts.EnemyPositionsToTargets() #Obtain target locations self.tlMut.acquire() self.ecp = tecp #Update shared enemy position list self.tlut = time.time() #Update time self.tlMut.release() time.sleep(Bot.ENT_WT) def StopTargetLoop(self): self.ctl = False #Turn off continue targeting loop flag

**Figure 6: Thread Logical Organization**

Thus, classification is performed concurrently and data members containing the predictions are provided to the main thread in a thread-safe manner using mutex locks. Figure 6 illustrates the logical organization of the threads and mutex locks. In the figure, *ecp* and *pct* are the data members of the *Bot* class that contain the enemy cell positions and predicted cell types respectively.

The following video summarizes the project and contains over four minutes of the AI playing Path of Exile.

**Figure 7: PoE AI Footage**

More footage of the latest version of the bot is available on my YouTube channel.

]]>Project Gutenberg offers a large number of freely available works from many famous authors. The dataset for this post consists of books, taken from Project Gutenberg, written by each of the following authors:

- Austen
- Dickens
- Dostoyevsky
- Doyle
- Dumas
- Stevenson
- Stoker
- Tolstoy
- Twain
- Wells

The main idea of this post is to do the following:

- Use vectors of word frequencies as features
- Fit a random forest classifier for each author
- Analyze each random forest to determine important features
- Obtain words that correspond to important features
- Create a word cloud with word size determined by importance

A number of libraries are necessary for this post.

import numpy as np import matplotlib as mpl from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectKBest from nltk.corpus import wordnet as wn import os import roman from PIL import Image from wordcloud import WordCloud from treeinterpreter import treeinterpreter as ti

The books for each author are organized in the following structure:

Austen/ Book1.txt Book2.txt Book3.txt Book4.txt Dickens/ Book1.txt ...

The following code loads the books into memory. A number of strings are randomly generated from each book to increase the total number of samples. This prevents trees in the random forest from having height 0; a case which is not handled correctly in the *treeinterpreter* library. A potential fix for this which eliminates the need for sampling can be found here.

#%% Generate samples from each book NWORDS = 2 ** 15 A = ['Austen', 'Dickens', 'Dostoyevsky', 'Doyle', 'Dumas', 'Stevenson', 'Stoker', 'Tolstoy', 'Twain', 'Wells'] tp = r"(?u)\b\w{2,}\b" #Regex for words ctp = re.compile(tp) SW = GetSW() #Stop words BT = [] #List of randomly generated strings AL = [] #Author labels W = set() #Big set of all words for i, AFi in enumerate(A): for b in os.listdir(AFi): with open(os.path.join(AFi, b)) as F: ST = ctp.findall(F.read().lower()) #Tokenize book nSamp = np.ceil(len(ST) / NWORDS).astype(int) #Number of samples for Ri in np.random.choice(ST, (nSamp, NWORDS)): #Generate samples from book BT.append(' '.join(Ri)) #Form string from sample W.update(Ri) #Add any new words to vocabulary AL.append(AFi) #Class label for this sample #%% Form the vocabulary for Tfidf by removing invalid words/names/etc def WordFilter(W): return (W in SW) or (len(wn.synsets(W)) == 0) W = frozenset(Wi for Wi in W if not WordFilter(Wi)) AL = np.array(AL)

The *WordFilter* function utilizes wordnet to filter any invalid words. Wordnet is a very powerful library that can be used to analyze the meaning of natural language. For this code, wordnet is being used like a dictionary; if no entry is found for word *W* then *W* is filtered.

The next step is to extract numerical features from the strings. For this, a *TfidfVectorizer* is used. *Tfidf* stands for: **T**erm **Frequency** (times) **I**nverse **Document** **F**requency. In math notation, Tfidf features are computed as

,

where is the term (a word in this case), is the document (a randomly generated string), and is a function which counts the number of occurences of in . The function is computed as

,

where is the total number of documents and is the number of documents that contain .

The element aims to reduce the weight of terms which are common to all or most documents. With Tfidf features, *idf(t)* typically reduces the weight of for terms like *a*, *an*, *he*, and *she* to prevent these terms from dwarfing more useful ones.

The following code computes tfidf features using sklearn.

#%% Transform text into numerical features cv = TfidfVectorizer(token_pattern = tp, vocabulary = W) TM = cv.fit_transform(BT) #Lookup word by column index IV = np.zeros((len(cv.vocabulary_)), dtype = object) for k, v in cv.vocabulary_.items(): IV[v] = k

Next, a binary random forest classifier is fit for each author; each classifier determines if a document is sampled from the works of a certain author or it is not. A random forest consists of a collection of decision trees (hence forest). An example of a random forest having 2 decision trees is shown below in Figure 1.

**Figure 1: A Random Forest with 2 Trees**

In the above figure, denotes the -th element of the feature vector. The random forest depicted in Figure 1 predicts one of 3 class labels: *A*, *B*, or *C*.

A random forest with trees is created by forming random samples of size from the training data. In this case, each set of random samples consists of randomly selected rows of the data matrix *TM*. For each set, a decision tree is fit. A random forest performs classification by using a voting method. Each tree in the forest produces a class label and the class label that appears most is taken as the final prediction.

With a random forest fit to the training data, *treeinterpreter* is used to obtain the contribution of each feature in determining the predicted class label for each sample. The following code fits a random forest for each author and produces a dictionary which maps words to their importance. **Note:** It seems *treeinterpreter* does not use sparse features, so the amount of memory used in this snippet increases greatly with the number of trees in the random forest (*n_estimators*). Based on the amount of memory available, the number of trees and the number rows to predict per iteration can be adjusted.

#%% def SplitArr(n, m): i1, i2 = 0, n % m if (n % m != 0) else m while i1 < n: yield i1, i2 i1, i2 = i2, i2 + m AW = [] for i, Ai in enumerate(A): #Use boolean class labels L = AL == Ai #Use an RFC to find most important words for each author C = RandomForestClassifier(n_estimators = 256, n_jobs = -1) C.fit(TM, L) S1 = C.score(TM, L) print('Accuracy {:10s}: {:.4f}'.format(Ai, S1)) #Index of author class ACI = C.classes_.nonzero()[0][0] FR = csr_matrix((1, TM.shape[1])) #Iterate over rows in chunks to prevent out of memory NZR = L.nonzero()[0] for j1, j2 in SplitArr(len(NZR), 2): _, _, F = ti.predict(C, TM[j1:j2]) FR += csr_matrix(F[:, :, ACI].sum(axis = 0)) FR /= L.sum() AW.append({IV[j]: FR[0, j] for j in FR.nonzero()[1]})

Good results can also be obtained by using other feature selections techniques. *SelectKBest* from the *feature_extraction* module of sklearn, in particular, gives good results and is very efficient on sparse features. Code to use *SelectKBest* follows.

AW = [] for i, Ai in enumerate(A): L = AL == Ai C = SelectKBest(k = 1024) C.fit(TM, L) FR = C.scores_ AW.append({IV[j]: FR[j] for j in FR.nonzero()[0]})

*AW* now contains a dictionary of words mapped to their importance for each author. Next, a wordcloud is constructed using the wordcloud library. The *generate_from_frequencies* function allows a wordcloud to be constructed from a dictionary. Transparency masks are also used so that the words are filled into a pattern specified by the *mask* parameter.

cmi = 0 def MakeCF(): global cmi CM = [mpl.cm.Greens, mpl.cm.Oranges, mpl.cm.Blues][cmi % 3] cmi += 1 def color_func(word, font_size, position, orientation, random_state=None, **kwargs): return tuple(int(255 * j) for j in CM(np.random.rand() + .4)) return color_func #%% Create wordclouds fontPath = 'C:\\Windows\\Fonts\\SpecialElite.ttf' for i, Ai in enumerate(A): icon = Image.open(os.path.join('Masks', Ai + '.png')) mask = Image.new("RGB", icon.size, (255, 255, 255)) mask.paste(icon, icon) mask = np.array(mask) wc = WordCloud(background_color = None, color_func = MakeCF(), font_path = fontPath, mask = mask, max_font_size = 250, mode = 'RGBA') wc.generate_from_frequencies(AW[i]) wc.to_file(os.path.join('Wordclouds', Ai + '.png'))

In the above code, the word color is randomly chosen using colormaps from MatPlotLib. The font used is a Google Font called: *Special Elite*. The font is available for download here. The resulting wordclouds are shown below in Figures 2 and 3.

**Figure 2: Generated Wordclouds**

In order to prevent the wordclouds from being dominated by character names, a wide variety of books and authors should be used. The wordcloud for Doyle shows an example of this. Since Doyle writes almost exclusively about Sherlock Holmes, the classifier identifies words like *Sherlock*, *Holmes*, *Watson* as strong features. Authors that have different character names in their books do not suffer (or benefit) from this.

**Figure 3: Full Version of Featured Image**

A takeaway from the above is that gathering works from a wide variety of authors can improve results by making these types of uninteresting relationships more rare. For example, if only one author has a character named Tom, then the classifier may identify Tom as a strong feature for the author. This is technically correct, but probably not very interesting if the goal is to gain a deeper understanding of an author’s style. Introducing more authors and books makes it more likely that multiple authors have a character named Tom, thus decreasing the importance of Tom as a feature for any one author.

]]>