Traditional virus scanners use file *signatures* to detect malware. When a file is scanned, its signature is computed and this signature is compared to a database of known malicious file signatures. If there is a match, the file is flagged as being malicious. However, this approach may fail to detect newly created malware; it depends upon having an up-to-date database of signatures. By using machine learning, malicious applications can be detected without the need for a database of signatures.

Applications use system calls to interact with the operating system. These calls provide details about what applications are doing. Now, operating system hooks allow for applications to intercept system calls on the system. These intercepted calls can be passively logged. Process Monitor is an application created by Windows Sysinternals that does just this. Process Monitor intercepts and logs system calls that are being made on the system. A screenshot of Process Monitor is shown in Figure 1. Process monitor is available for download here.

**Figure 1: Windows Sysinternals Process Monitor**

The volume of information produced by Process Monitor can be overwhelming. Machine learning is applied to analyze these logs and detect anomalies. In this post, Process Monitor is used to generate data for the model. However, this process can be automated by installing hooks on the system and sending the data directly to the classifer.

A virtual machine running Windows 7 32-bit is created to collect training data. A keylogger is placed on the system. The keylogger intercepts key strokes using a hook installed with the system call *SetWindowsHookEx*. The application then logs the VK code of each keystroke to a file. Logging samples are taken from the infected machine using Process Monitor. These logs are exported to a CSV file.

**Figure 2: Keylogger Process Monitor Logs**

Inspecting the log file, the calls to CreateFile and WriteFile made by the keylogging application can be seen. These logs are shown in Figure 2.

This training log file is read into pandas using *read_csv*. Next the logs are grouped by process name and the individual fields are concatenated together. The following columns are used from the Process Monitor logs:

- Operation
- Path
- Result
- Detail
- Integrity
- Parent PID
- PID

Next, the logging calls are grouped into a blocks and the resulting strings are vectorized using a TfidfVectorizer. These vectors encode information about the number, type, and details of the system calls made on the computer. Tfidf vectorization is explained in further detail in this earlier blog post.

The rational behind this approach is that malware uses system calls to operate on a system. For example, accessing or modifying files or registry keys. Sending or receiving data over the network. By analyzing the frequencies of these actions, malicious behavior can be detected.

Next the vectors are assigned target values. A value of 1 indicates the application is malicious. A value of 0 indicates the application is not.

A binary label is chosen, though, with sufficient data, programs could be grouped into types. Labels for common types of malware could be constructed: keylogger, Trojan, virus, etc.

Next, a logistic regression model is fit to the data. Logistic regression is a method of performing regression on a dataset that has categorical target values. The logistic function is used to transform linear combinations of the explanatory variables into probabilities. The definition of the logistic function is as follows:

**Equation 1: The Logistic Function**

This function is used to transform the typical linear regression formula:

**Equation 2: Linear Regression Equation**

The resulting equation is shown in Equation 3. In this formula, represents the probability that an input sample belongs to the target . That is, the probably that an application is malicious given that it is making the observed system calls.

**Equation 3: Logistic Regression**

In this model, the target samples are the rows of a sparse matrix with a large number of columns. These columns encode the counts of all of the possible system calls, operations, paths, and results that can occur. Using sparse matrices improves the performance and memory consumption of the model. With sparse matrices, the runtime and memory usage is bound by the number of non-zero entries of the matrix and not the dimensions of the matrix.

Due to there only being 1 malicious application on the computer, the classes 0 and 1 are imbalanced. To improve the *recall* of the classifier, the classes are assigned the following weights:

- Class 0: 0.1
- Class 1: 0.9

This weighting scheme increases the importance of samples in the target class 1.

**Figure 3: Malware Detection System Flowchart**

To summarize, system calls are intercepted and passed to a pre-processor. The pre-processed data is vectorized and the resulting matrix is provided as input to a logistic regression model. The model decides if the process making the system calls is malicious or not. Figure 3 describes the system at a high level using a flowchart.

To test the model, another virtual machine is setup. Another keylogger is created which functions similarly to the first, with two main differences. The first difference is that the keylogger writes to a different log file. The second is that the keylogger is named differently.

Process Monitor logs are collected on the test machine. The logs are vectorized and target labels are predicted for the resulting samples. The accuracy of the model is evaluated. Samples belonging to the process *KeyLogger.exe* should be assigned a label of 1, while other processes should be assigned a label of 0.

The *confusion matrix* for the cross-validation data is shown below in Equation 4. The confusion matrix shows the true negatives, false positives, false negatives, and true positives from left to right and top to bottom.

**Equation 4: Classification Model Confusion Matrix**

The overall accuracy of the model on this test data is:

.

This post presents a proof-of-concept way to detect and classify malware using vectorization and logistic regression. This process can be improved by gathering Process Monitor logs for a large amount of malware. Malware can be assigned more detailed labels based on its category. Further, hooks can be installed which directly forward system call information to the model to remove the need for Process Monitor.

Process Monitor is the property of Windows Sysinternals. The author is not in any way affiliated with Windows Sysinternals. The views of the author do not reflect those of Windows Sysinternals.

]]>

The number of questions per week regarding Python has been steadily growing over the past decade. The number of questions has increased from about 10 per week in 2008 to about 3000 towards the present.

**Figure 1: Number of Questions per Week**

A histogram of the question scores is constructed. As seen in Figure 2, the vast majority of questions receive little to no upvotes. If you have ever spent hours typing up an in-depth question only to watch it fall flat, fear not! You are not alone.

**Figure 2: Histogram of Question Scores**

If you primarily access Stack Overflow questions from search engine results, this result may seem counter-intuitive. This seeming discrepancy is known as *survivorship bias*. Unpopular questions do not appear in search results and so it is difficult to estimate their number.

So how then to have a successful post? One potentially good approach is to go back in time!

**Figure 3: Average Question Score over Time**

Taking the scores and grouping then by week, the average weekly score for Python questions declines over time. The older questions have had more opportunity to receive upvotes. But, the same can be said about downvotes!

Another way to improve your chances is to craft a title with a specific length. Figure 4 shows average score plotted against title length.

**Figure 4: Question Score and Title Length**

It appears that questions having a title with roughly 30 to 40 characters perform better on average. Longer titles perform worse on average. The longest title has 172 characters and the corresponding question has a score of 1 (no upvotes or downvotes).

The distribution for the length of question bodies is shown in Figure 5. The distribution is left skewed.

**Figure 5: Distribution of Question Body Length**

The length of the body has less of a correspondence with score. Often times short questions can have large blocks of code or data tables. The question with the longest body has 48242 characters. However, the majority of these characters comes from two large data tables.

**Figure 6: Python Question Wordcloud**

Finally, a wordcloud is constructed from the text. The result is shown in Figure 6.

]]>

Figure 1 shows the distribution of age at death for all records. The plot shows a right-skewed distribution as expected. The leftmost bar stands out somewhat. This bar enumerates infant mortality.

**Figure 1: Distribution of Age at Death**

Considering only the data in this first bar, another histogram is constructed. This histogram shows the age in months for these records. Figure 2 shows that infant deaths occur most frequently after birth and sharply decline thereafter.

**Figure 2: Infant Mortality**

Next, the rightmost bars of the histogram are considered. These bars contain records for those older than 100 years old. The records are grouped by gender and race and displayed in a bar plot. The *y*-axis represents the percentage of centenarians for each race.

**Figure 3: Percentage of Centenarians by Race**

Figure 3 shows several things. The first is that the majority of people who live at least 100 years are women. In fact, females account for roughly 82% of this number. The second is that people of some races are more likely to survive their first century. Japanese and Chinese are significantly more likely to do so.

Next, records of all ages are grouped by gender. The distributions for men and women are plotted in both a line and bar chart. Figure 4 shows that men die earlier than women. This trend begins in the late teenage years and continues into adulthood. The count of female records outpaces men only near the end of the human lifespan.

**Figure 4: Age Distribution at Death by Gender**

The average lifespans of men and women are compared. It is found that men live roughly 6.6 years shorter than women. This difference is highly significant. A Welch’s t-test for the difference of means has .

Next, the average and standard deviation age at death is computed for each race. The result is shown in a bar plot.

**Figure 5: Age at Death by Race**

Figure 5 shows that white and Asian people live longer than other races on average. Japanese people have the longest average lifespan together with the lowest standard deviation. The low standard deviation suggests fewer Japanese people die early in life.

**Figure 6: Distribution of Age at Death by Race**

This is confirmed by plotting the distribution of several races side by side. Figure 6 shows that relatively fewer Japanese people die before reaching the end of the human lifespan.

Next, the manner of death is explored. The dataset classifies the manner of death into 7 categories. The categories and their counts are listed in Table 1.

Description | Class |
---|---|

Natural | 2212118 |

Unspecified | 294239 |

Accident | 160768 |

Suicide | 45155 |

Homicide | 20544 |

Unknown | 12467 |

TBD | 4573 |

**Table 1: Distribution of Age at Death by Race**

The average age at death for each category is shown in Figure 7. Deaths from natural causes have the greatest average age. Homicides have the least.

**Figure 7: Average Age at Death by Manner of Death**

Next, the records are grouped by manner of death and race. Bar charts for accidents, suicides, and homicides are constructed. The *y*-axis represents the percentage of all deaths for each race accounted for by a specific manner.

**Figure 8: Manner of Death by Race**

The chart for homicides show that Japanese and Chinese have the lowest homicide rates among all races. This factor contributes to the longevity of these races as death by homicide typically occurs earlier in life. Conversely, homicide rates are highest among blacks. This factor contributes to the relatively shorter average lifespan of the race.

Next, underlying cause of death is explored. Each record is labeled with an ICD-10 code indicating the underlying cause of death. The cumulative percentage of records accounted for by top diseases is computed and the result is shown in Figure 9. As can be seen, a small number of causes are responsible for a large number of deaths. Well over 60% of all deaths are the result of less than 50 causes of mortality.

**Figure 9: Cumulative Percentage of Deaths by Top Diseases**

Next, the records are grouped by ICD-10 code and the counts of each are computed. The result is shown in a bar chart in Figure 10. The corresponding ICD-10 codes are listed in Table 2.

**Figure 10: Leading Causes of Death**

ICD-10 | Age | Std. Age | Count | Desc |
---|---|---|---|---|

I251 | 79.9 | 12.9 | 161079 | Atherosclerotic Heart Disease |

C349 | 71.7 | 11.0 | 146786 | Malignant Neoplasm: Bronchus or Lung |

J449 | 77.2 | 11.0 | 116117 | Chronic Obstructive Pulmonary Disease |

G309 | 86.9 | 7.7 | 113096 | Alzheimer Disease |

I219 | 74.5 | 14.0 | 107594 | Acute Myocardial Infarction |

F03 | 87.4 | 7.9 | 100901 | Dementia |

I500 | 83.6 | 11.6 | 64439 | Congestive Heart Failure |

I250 | 71.9 | 14.8 | 62909 | Atherosclerotic Cardiovascular Disease |

I64 | 81.2 | 12.1 | 61818 | Stroke |

J189 | 79.7 | 14.3 | 42189 | Pneumonia |

**Table 2: Leading Causes of Death**

Heart disease accounts for the largest number of deaths. Atherosclerosis, the build-up of plaque on the arterial walls, is involved in several of the leading causes of death. Lung cancer and COPD are also responsible for a sizable portion of the records. Both of these pulmonary conditions are strongly associated with smoking.

Next, causes of death in those under the age of 50 are explored. A similar bar chart and table are constructed from these records.

**Figure 11: Leading Causes of Death under 50**

ICD-10 | Age | Std. Age | Count | Desc |
---|---|---|---|---|

X42 | 40.8 | 13.0 | 19167 | Accidental Poisoning by and Exposure to Narcotics |

X44 | 42.2 | 13.5 | 16872 | Accidental Poisoning by and Exposure to Unspecified Drugs |

X95 | 32.3 | 13.3 | 11466 | Assault by Unspecified Firearm Discharge |

X70 | 40.0 | 16.7 | 8425 | Intentional Self-Harm by Hanging Strangulation and Suffocation |

V892 | 43.0 | 21.5 | 7900 | Person Injured in a Motor-Vehicle Accident |

X74 | 50.1 | 19.6 | 6826 | Intentional Self-Harm by Unspecified Firearm Discharge |

I219 | 74.5 | 14.0 | 5375 | Acute Myocardial Infarction |

C509 | 68.7 | 14.9 | 4880 | Malignant Neoplasm: Breast |

R99 | 55.9 | 28.8 | 4873 | Other Ill-Defined and Unspecified Causes of Mortality |

I250 | 71.9 | 14.8 | 4284 | Atherosclerotic Cardiovascular Disease |

**Table 3: Leading Causes of Death Under 50**

The leading causes of death in those under 50 are not due to disease processes. Drug overdose, homicide, and suicide lead. The only diseases present in the top 10 leading causes of death are breast cancer and heart disease.

Next, deaths caused by cancer are considered for all ages. Lung cancer accounts for a clear majority of deaths due to cancer. The large number of deaths due to pancreatic cancer are presumed due to the present difficulty in treating it. The prognosis for breast cancer is better, though it is a more common diesease.

**Figure 12: Leading Causes of Death by Cancer**

ICD-10 | Age | Std. Age | Count | Desc |
---|---|---|---|---|

C349 | 71.7 | 11.0 | 146786 | Malignant Neoplasm: Bronchus or lung |

C259 | 71.8 | 11.9 | 42121 | Malignant Neoplasm: Pancreas |

C509 | 68.7 | 14.9 | 41913 | Malignant Neoplasm: Breast |

C189 | 72.0 | 14.0 | 39249 | Malignant Neoplasm: Colon |

C61 | 78.6 | 10.5 | 30396 | Malignant Neoplasm: Prostate |

C80 | 72.4 | 13.4 | 27845 | Malignant Neoplasm: Unspecified Site |

C679 | 77.7 | 11.6 | 16586 | Malignant Neoplasm: Bladder |

C719 | 64.1 | 16.1 | 15303 | Malignant Neoplasm: Brain |

C159 | 69.5 | 11.9 | 15285 | Malignant Neoplasm: Esophagus |

C56 | 69.8 | 13.0 | 14242 | Malignant Neoplasm: Ovary |

**Table 4: Leading Causes of Death by Cancer**

Next, methods of suicide are considered. A similar table and bar chart are constructed only from records due to suicide.

**Figure 13: Most Common Methods of Suicide**

ICD-10 | Age | Std. Age | Count | Desc |
---|---|---|---|---|

X74 | 50.1 | 19.6 | 13948 | Intentional Self-Harm by Unspecified Firearm Discharge |

X70 | 40.0 | 16.7 | 11682 | Intentional Self-Harm by Hanging Strangulation and Suffocation |

X72 | 50.3 | 19.8 | 6116 | Intentional Self-Harm by Handgun Discharge |

X64 | 50.3 | 15.1 | 3241 | Intentional Self-Poisoning by and Exposure to Unspecified Drugs |

X73 | 47.7 | 19.5 | 2892 | Intentional Self-Harm by Rifle, Shotgun and Larger Firearm Discharge |

X67 | 47.9 | 16.7 | 1369 | Intentional Self-Poisoning by Exposure to Gases (CO2, Helium, etc) |

X80 | 43.3 | 18.0 | 1123 | Intentional Self-Harm by Jumping from a High Place |

X61 | 49.1 | 15.5 | 1064 | Intentional Self-Poisoning by Exposure to Sedatives |

**Table 5: Most Common Methods of Suicide**

The most common method of suicide, by a significant margin, is via firearm. Hanging is also prevalant. Intentional poisoning is a distant third.

Finally, records are grouped by education level. Education level is recorded as a categorical variable with 9 categories based on different educational milestones. The descriptions for each of the categories are shown in Table 5.

Category | Description |
---|---|

1 | 8th grade or less |

2 | 9 – 12th Grade, No Diploma |

3 | High School Graduate or GED Completed |

4 | Some College Credit, but No Degree |

5 | Associate Degree |

6 | Bachelors Degree |

7 | Masters Degree |

8 | Doctorate or Professional Degree |

9 | Unknown |

**Table 6: Education Levels with Categorical Labels**

The categories increase with level of education. Records with unknown education level are discarded; they account for less than 2% of all records.

A bar chart is constructed of the average age of each group. The result is shown in Figure 14. People who complete at least a bachelor’s degree live longer on average.

**Figure 14: Average Age at Death by Education Level**

The bar chart also suggests a modest increasing trend with education. To further explore this trend, a scatter plot is constructed from the data points. A trend line is fit to the data and the coefficient of determination is computed.

**Figure 15: Relationship Between Education and Lifespan**

The of the fit is 0.641, the general F-statistic of the model is 12.512 with a corresponding p-value = 0.008. The coefficient of age is 0.907 and is significant. The coefficient indicates that average lifespan increases by roughly 1 year for each educational milestone completed.

]]>

CMoerae:

- Supports 20 major cryptocurrencies
- Obtains up-to-date market information from the internet
- Dynamically makes and displays predictions
- Features a simple user interface
- Supports Mac, Windows, and Linux and all major browsers

Figure 1 shows a screenshot of the application as rendered in the Firefox browser.

**Figure 1: Screenshot of CMoerae**

Currently, CMoerae supports predictions for Augur, Bitcoin, Bitcoin Cash, Bitcoin Dark, Dash, DogeCoin, Ethereum, Ethereum Classic, Gas, Litecoin, Monero, NxtCoin, Ripple, Stellar, ZCash and more. A trial version of the software is available on the software page of this site.

]]>

Figure 1 shows the number of records and unique names per year. There are spikes in the number of records around 1910 and 1940. The total number of unique names increases progressively more rapidly into the 20th century. In the most recent several years, both trends reverse to an extent.

**Figure 1: Total Number of Names and Records**

Despite spikes in the number of records, name diversity increases more steadily. The result is that a large number of people born in the middle 20th century have common names. The most popular names for each year are shown in Figure 2.

**Figure 2: Most Popular Names Over Time**

The number of occurrences and ranking of a particular name can be visualized over time. The distributions for the names “Nicholas”, “Alice”, and “Dwight” are shown in Figure 3. The green bar and red dot highlights the ranking at a specific birth year. A lower ranking indicates a more popular name, with 0 being the most popular name.

**Figure 3: Occurrences and Popularity of Several Names**

The graph for the name “Nicholas” shows an interesting trend. The number of occurrences of the name sharply declines from around the year 2000. The ranking of the name declines as well, but at a slower rate.

To investigate this trend, the number of occurrences of the 1st and 5th most popular names are computed for each year. The result is plotted in Figure 4 alongside the total number of records for each year.

**Figure 4: Occurrences of Top Names**

The total number of records and occurrences follow a similar pattern initially. The lines diverge around 1980. After this point, it requires fewer occurrences to be one of the top names. This is the reason that the ranking of the name “Nicholas” declines more slowly than its number of occurrences around the time.

The percentage of records accounted for by the top names is computed for various values of . The result is shown in Figure 5.

**Figure 5: Percentage of Records with Common Names**

A small percentage of popular names represent the majority of records. The rapid increase in the number of names around 1910 reduces the percentage accounted for by the most popular names. The top names regain popularity in the middle century before declining into the 21st century.

**Figure 6: Records with Common Names in 1950 and 2010**

This trend is examined further by arranging the top four hundred names into groups of 100. The result is displayed in a pie chart. In 1950, the top 100 names represent almost 60% of names. In 2010, they represent less than half of that.

Next, the number of names in several of the top percentiles of popularity are computed. The result is shown in Figure 7. The rapid increase starting around 1990 indicates more diversity in the top percentiles.

**Figure 7: Number of Names in Top Popularity Percentiles**

The sharp increase in records around 1940 is not accompanied by a sharp increase in names. In fact, the opposite is true. The result is a large number of records with common names and few with rare ones.

Over the remainder of the century, preferences shift to more uncommon names as diversity increases. Many of the most common names of the 1950s are still popular, but they share an increasing amount of their popularity with less traditional ones. In the 21st century, it requires fewer occurrences to be one of the top names and the top percentiles of popularity contain more names than in the past.

The numbers of male and female records are computed for each year. The result is shown in Figure 8. The chart shows a sizable imbalance between male and female records.

**Figure 8: Count of Male and Female Records**

There are several factors that may contribute to this imbalance. The first is that records with names that occur fewer than five times in a year are omitted. The second is that the data includes immigrants as well as natural born citizens. The effect of the first can be approximated while that of the second is more difficult.

Next, the percentage of records with common names is computed for each year and gender. A common name is defined as one of the top 100 names. The charts reveal that men are typically given more common names than women after 1910.

**Figure 9: Percentage of Records with Common Names by Gender**

Unfortunately, the dataset does not contain names that occur fewer than 5 times in the nation per year. However, the SSA also maintains statewide data. Names are excluded from the statewide data if they occur fewer than 5 times in each state. By taking the difference between the national and state data, the number of these rare names is approximated. The number of records missing from the state data for each gender is shown in Figure 10.

**Figure 10: Records Missing in State Data**

As can be seen, there are more women with names that are excluded from the state counts but included in the national counts due to their rarity. It seems reasonable that the imbalance in male and female records is largely a result of the relative prevalence of very rare female names.

Next the statewide data is considered. The ratio of records to unique names is computed and displayed in a choropleth map. States with higher name diversity are shown in lighter colors.

**Figure 11: Name Diversity by State in 2016**

By making successive maps, the popularity of a name can be visualized over time. The popularity of the name Mary is shown in Figure 12. The name peaks in popularity in the middle 20th century before entering a decline.

**Figure 12: Name Popularity over Time**

The gender imbalance in each state is computed for records with a birth year of 2016. The imbalance is computed as the difference between the number of male and female records divided by the total number of records. The result is shown in a choropleth map in Figure 13. The map is similar to the map for name diversity.

**Figure 13: Gender Imbalance Among Records by State in 2016**

To investigate the similarity further, a scatter plot is constructed which plots gender imbalance against state name diversity. Next, a trend line is fit to the data. The of the fit is 0.894 and the F-statistic is 419.9 with a corresponding p-value 2.8e-26.

**Figure 14: Relationship Between Gender Imbalance and Name Diversity**

There is clear evidence showing that states with more diverse names have more male records in the dataset. Further, there are more male records with common names than female records overall. Together, these facts support the notion that the gender imbalance is due to the omission of more female records than male records. It appears that more women have especially rare names than men.

]]>*An intermediate activation volume produced by a convolutional neural network predicting the attractiveness of a person.*

Does beauty truly lie in the eye of its beholder? This chapter explores the complex array of factors that influence facial attractiveness to answer that question or at least to understand it better.

The Chicago Face Database is a dataset created at the University of Chicago by Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink [1]. It consists of portrait photos of several hundred participants along with corresponding measurement and questionnaire data. Each of the participants had several of their facial features measured. For example, the measurements include:

- The distance between the outer edge of the subject’s nostrils at the widest point.
- The average distance between the inner and outer corner of the subject’s eye.
- The distance between the subject’s pupil center to hairline.

**Figure 1: Chicago Face Dataset Subject AM 238**

Subject | Attractive | Feminine | Masculine | Face Width Cheek | Average Eye Height |
---|---|---|---|---|---|

AM-238 | 3.120 | 1.769 | 4.292 | 634 | 46.250 |

AF-200 | 4.111 | 5.630 | 1.357 | 676 | 65.250 |

LM-243 | 2.778 | 1.179 | 4.857 | 653 | 48.750 |

**Table 1: Chicago Face Dataset Sample Data**

In addition to the numerical data, a questionnaire was given to another set of participants along with the images. The participants were asked to rate several qualities of each subject on a scale from 1 to 7. These qualities include:

- How attractive the subject is with respect to people of the same race and gender.
- How feminine the subject is with respect to people of the same race and gender.
- How masculine the subject is with respect to people of the same race and gender.

An example of an image from the dataset is shown in Figure 1. A sample of several rows and columns of the dataset is listed in Table 1.

The literature often quotes facial symmetry, averageness, and secondary sexual characteristics as influencing attractiveness. In an often-cited paper, Little et. al describe plausible reasons why these particular features are considered attractive from an evolutionary perspective [2]. Further, they claim that despite there being cultural differences to facial attractiveness, favorable cross-cultural traits exist.

Symmetry is stated to be attractive as it represents the ideal outcome of development. Asymmetry is asserted to arise from both genetic and environmental issues. The issues cited include genetic defects, mutations, nutrition, and infections. Humans are claimed to have evolved to find these features less attractive as they suggest less than ideal fitness.

Averageness is stated to be found attractive as an indicator of genetic diversity. Genetic diversity improves fitness in regards to immunity and reduced risk of genetic issues in offspring. Again, humans are claimed to have evolved to find averageness to be more attractive.

Secondary sexual characteristics represent the features that distinguish male faces from female faces. These traits develop in part due to hormonal differences between men and women. Again, these differences are declared to be found attractive in that they indicate genetic robustness. Little et al. state that there is a solid evidence for a positive relationship between femininity and attractiveness in women. The link between masculinity and attractiveness in men is claimed to be less pronounced.

The effect of averageness is examined in the Chicago Face Dataset. The matrix of measurement data is normalized so that each column has mean 0 and variance 1. Next, the L2 norm of each sample is computed. The result is a single value representing the total irregularity of a subject with respect to all facial measurements. Next, linear regression is used to analyze the relationship between attractiveness and irregularity.

The purpose of linear regression is to find a vector of coefficients together with a bias that minimize the prediction error of the model. The formulas for linear regression and prediction error are shown in Equations 1 and 2. In these equations, is the matrix of sample data, is the matrix of targets, are the predictions, and denotes the L2 norm.

**Equation 1: Linear Regression**

**Equation 2: Linear Regression Error Term**

A linear regression model is fit which regresses the attractiveness rating against irregularity. The result is shown in Figure 2.

**Figure 2: Irregularity Scatter Plot and Trend Line**

target variable that is explained by the regular variable(s). Its formula is shown in Equation 3, with being the number of data points, the -th target vector, and the vector of target column means.

**Equation 3: The Coefficient of Determination**

As can be seen, there is a minor negative relationship between irregularity and attractiveness. However, an of 0.052 does not provide substantial evidence for a relationship between the two variables. This number implies that only 5% of variation in attractiveness can be explained by averageness. It is important to note that the relationship is inverted here as the -axis represents distance from average or irregularity. A negative relationship shows that attractiveness increases as the feature measurements move closer to average.

Two sorted lists are constructed from the sample data. In one list, the samples are sorted by their averageness. In the other list, the samples are sorted by their attractiveness. The average absolute difference between these two orderings is 175.7 with a standard deviation of 126.9. If the subjects stand in a line ordered by averageness, each subject would move past 175 people on average to re-order the line by attractiveness. In a dataset of roughly 600 people, this is anything but impressive. Figure 3 shows the distance each subject needs to move to reorder the line.

**Figure 3: Averageness Ordering to Attractiveness Ordering**

Next, the effect of symmetry is evaluated. The dataset contains several separate measurements for the left and right portions of the face. The absolute differences between the left and right measurements are computed. The result is 6 features measuring facial asymmetry. A multiple regression model is constructed which predicts attractiveness from these 6 derived features. Figure 4 shows a scatter plot of the target values against the predictions.

**Figure 4: Scatter Plot of Predictions for Symmetry Model**

The plot is labeled with the adjusted of the fit. The adjustment to is made to account for the fact that models with more dependent variables spuriously obtain higher values. The formula is shown in Equation 4, where is the number of parameters in the model. In this case, the model has 6 parameters. is a more robust estimate of model performance when multiple explanatory variables are involved.

**Equation 4: The Adjusted Coefficient of Determination**

The scatter plot does not show a clear relationship between the predictions and attractiveness. This is reflected in the negative . An of -0.006 implies that the model has no explanatory power in predicting attractiveness.

This model is compared to a baseline model that always predicts the mean standardized attractiveness: 0. The general linear *F-statistic* is used to perform the comparison. The F-statistic and associated *p-value* quantify the amount of additional variance in attractiveness that is explained by a model that is not explained by a baseline model. The formula for the F-statistic is shown in Equation 5 with the number of samples and the degrees of freedom for the model. The computed F-statistic is 0.42 with an associated *p-value* of 0.87. There is no evidence that the model provides a better fit than one that merely predicts the mean standardized attractiveness.

**Equation 5: The F-Statistic for Multiple Regression**

The lack of a significant relationship here does not prove that symmetry is useless in predicting attractiveness. There are many other possible explanations for this lack including poor features, poor data, or even random variation. Regardless, the lack of strong relationships in the above models demonstrates the notion that there are many aspects to facial attractiveness. Relying too heavily on any one aspect can affect model performance. The reality is that real world data is often noisy and full of complex and unintuitive relationships.

Insight:The effects of symmetry and averageness appear overstated. |

A multiple regression model is constructed which predicts the attractiveness of a sample given the 40 facial measurement features. This model is subsequently analyzed to gain insight about facial features which affect attractiveness. In order to fairly compare the effects of each variable, each measurement is first standardized. If this is not performed, then measurements with larger values may appear to be weighted less heavily. For example, the measurements for facial width at the cheek are much larger than those for eye height. Thus, each column is centered by subtracting its mean and then scaled by its standard deviation. The result is that each entry is now a t-score for that column.

A baseline model is first constructed for the sake of comparison. The baseline model always predicts the mean standardized attractiveness: 0. The baseline model achieves a *root mean squared error* (RMSE) of 0.77. The formula the RMSE is shown in Equation 6.

**Equation 6: The Root Mean Squared Error**

*Cross-validation* is used to further access the performance of the model. Figure 5 shows a depiction of a train/test cross-validation split on a hypothetical dataset of 100 samples.

**Figure 5: Cross-Validation Split**

In Figure 5, each cell represents an entry in the dataset. By dividing the dataset into training and testing sets, the performance of the model can be evaluated on samples with which it has not been trained. This validation is needed to ensure the model is not simply memorizing the target values and has the ability generalize. When performing both standardization and cross-validation, care must be taken to prevent *data leakage*. Data leakage is providing your model with information about the cross-validation data. To avoid this, the column means and standard deviations must only be computed on the training data.

The dataset is repeatedly divided into training and testing sets and then standardized. The model is fit using the training set and then scored using the testing set. Scoring is accomplished by taking the predicted value of the model and comparing it to the real value from the dataset. Performing training and testing 512 times, the average cross-validation loss is 0.691 with a standard deviation of 0.054. An additional model is fit to the entire data set. The target values are plotted against the predicitons of this model and the result is shown in Figure 6. The of the fit is 0.247.

**Figure 6: Scatter Plot of Predictions for Facial Measurement Model**

The coefficient vectors of each of the 512 linear regression models are recorded and analyzed. The average and standard deviation for each coefficient value is computed and the results of the top 6 most influential positive and negative features are listed in Table 2.

Positive Feature Weights | Negative Feature Weights | ||||
---|---|---|---|---|---|

Name | Avg. | Std. | Name | Avg. | Std. |

L Eye H | +8.09% | +2.93 | Avg Eye Height | -14.48% | +5.67% |

R Eye H | +7.69% | +2.88 | Lip Fullness | -4.89% | +2.15% |

Lip Thickness | +4.42% | +2.13 | Chin Length | -4.27% | +2.02% |

Cheekbone Height | +3.82% | +1.55 | Forehead | -4.18% | +2.30% |

Midface Length | +3.76% | +1.88 | Pupil Lip L | -3.47% | +1.29% |

Upper Head Length | +2.98% | +1.60 | Faceshape | -2.79% | +1.75% |

**Table 2: Most Influential Linear Regression Coefficients**

Features with negative coefficients decrease attractiveness as they increase in value; those with positive coefficients do the opposite. The complicated relationship amongst the variables is illustrated in the table. Individual eye height measurements positively affect attractiveness while average eye height negatively affects it. It appears that the effects of these two coefficients cancel out at least somewhat. A similar paradox is apparent with lip fullness and lip thickness. Due to this, it is difficult to determine the true importance of the various features.

This situation arises from *multicollinearity* in the input data. Multicollinearity exists when there are strong linear relationships between the explanatory variables. When this occurs, the related variables do not change in isolation. Change in one variable typically results in a proportional change in the related variables. These proportional changes can have additive or subtractive effects to the change induced by the original variable. It is this behavior that makes interpretation difficult.

With facial measurements, the cause of this behavior is intuitive. If the height of the left eye increases, then the measurements for the average height also increases. Lip fullness and thickness are similarly defined. A table of the top negative and positive correlation among facial measurements is shown in Table 3.

Positive Correlations | Negative Correlations | ||||
---|---|---|---|---|---|

R |
Feature i |
Feature j |
R |
Feature i |
Feature j |

+0.985 | Midcheek Chin R | Cheeks avg | -0.825 | Face Width Mouth | Heart Shapeness |

+0.984 | Midcheek Chin L | Cheeks avg | -0.811 | Cheekbone Prominence | Face Roundness |

+0.977 | L Eye H | Avg Eye Height | -0.783 | Face Length | Faceshape |

+0.976 | R Eye H | Avg Eye Height | -0.761 | Face Width Mouth | Cheekbone Prominence |

+0.975 | R Eye W | Avg Eye Width | -0.752 | Heart Shapeness | Face Roundness |

+0.973 | L Eye W | Avg Eye Width | -0.731 | Nose Length | Noseshape |

+0.969 | Pupil Lip R | Pupil Lip L | -0.697 | Pupil Lip L | f WHR |

+0.954 | Lip Thickness | Lip Fullness | -0.695 | Pupil Lip R | fWHR |

**Table 3: Most Correlated Measurement Features**

A *lasso regression* model is used to address the multicollinearity. The term lasso is an abbreviation for “least absolute shrinkage and selection operator.” Lasso regression penalizes the absolute value of the regression coefficients to help prevent situations where one coefficient cancels the effect of another. The error term for lasso regression is shown in Equation 7. The error term is the same as that for linear regression with the addition of an L1 regularization term.

**Equation 7: Lasso Regression Error Term**

As increases, the coefficients of the model are forced to 0. An appropriate value of can remove collinear variables from the model while maximizing model performance. A large number of models are created as is varied from 0 to 1. Several of the coefficient values are plotted against and the result is shown in Figure 6.

**Figure 6: Lasso Coefficient Shrinkage**

The number of non-zero coefficients in the model is shown in Figure 7 for various values of . The color represents the of the fit. As the number of non-zero coefficients decreases, the prediction power of the model steadily worsens.

**Figure 7: Number of Nonzero Lasso Regression Coefficients**

is chosen to be 0.02. Performing training and testing 512 times, the average cross-validation loss is 0.689 with a standard deviation of 0.054. Lower values exist that further improve performance, but the introduction of more collinear terms makes interpretation more difficult. By further tuning , an average cross-validation loss of 0.680 with a standard deviation of 0.053 is achieved.

One additional model is fit to the entire dataset. The target values are plotted against the predicted values and the result is shown in Figure 8. The of the fit is 0.246.

**Figure 8: Predictions of the Facial Measurement Model**

The cross-validation loss suggests the model is able to predict the attractiveness score of a subject to roughly Â± 0.68 of its true value (from 1-7). If a subject had an attractiveness score of 6.2, for instance, the model might predict a value between 5.52 and 6.88. An individual prediction might well fall outside of this range, but the number is descriptive of the overall performance of the model. The ability of a relatively simple linear model to predict attractiveness based on facial measurements is suggestive that objective measures of facial attractiveness may exist.

Again, the coefficient vectors of each of the 512 lasso regression models are recorded and analyzed. The average and standard deviation for each coefficient value is computed. In addition, the intervals between the minimum and maximum value for each coefficient are computed. The coefficients whose intervals do not contain 0 are listed for in Table 4.

Feature | Avg. | Std. | Min | Max |
---|---|---|---|---|

Pupil Lip L | -0.092 | +0.017 | -0.137 | -0.035 |

Noseshape | -0.072 | +0.012 | -0.116 | -0.036 |

Chin Length | -0.055 | +0.009 | -0.081 | -0.002 |

Lip Fullness | -0.047 | +0.025 | -0.141 | -0.012 |

Midbrow Hairline L | -0.037 | +0.014 | -0.089 | -0.003 |

Asymmetry Pupil Lip | -0.015 | +0.002 | -0.021 | -0.010 |

Luminance Median | +0.021 | +0.003 | +0.012 | +0.031 |

Cheekbone Prominence | +0.021 | +0.005 | +0.006 | +0.040 |

Pupil Top L | +0.028 | +0.007 | +0.007 | +0.046 |

Nose Width | +0.054 | +0.009 | +0.027 | +0.086 |

Midface Length | +0.077 | +0.027 | +0.001 | +0.142 |

**Table 4: Non-Zero Lasso Regression Coefficients for Î“ = 0.0012**

Even using lasso regression, the effects of multicollinearity can still be seen. The correlation between pupil to lip length and midface length is 0.638. However, both features appear in the model with opposite signs. Gamma can be further increased to remove these counteracting effects, though model performance begins to suffer.

Asymmetry Pupil Lip | -0.062 | +0.006 | -0.082 | -0.044 |

Pupil Lip L | -0.050 | +0.014 | -0.080 | -0.001 |

Face Width Cheeks | -0.033 | +0.008 | -0.053 | -0.010 |

Luminance Median | +0.082 | +0.008 | +0.055 | +0.106 |

Cheekbone Height | +0.104 | +0.022 | +0.035 | +0.142 |

Nose Length | +0.151 | +0.011 | +0.119 | +0.193 |

**Table 5: Non-Zero Lasso Regression Coefficients for Î“ = 0.02**

As seen in Table 5, the larger value of force more coefficients towards 0, resulting in a simpler model. The model appears to rate subjects with wider faces and longer pupil to lip length as being less attractive. Interestingly, the asymmetry measurement for the pupil to lip length has a significant negative effect. This provides some support for the influence of symmetry, though its effect is overshadowed by other variables. In the positive direction, the model appears to favor high cheekbones, longer noses, and more luminous faces. The distributions of these coefficients for each of the 512 lasso regression models are shown in Figures 9 and 10 along with the intervals containing their values.

**Figure 9: Significant Positive Measurement Features**

**Figure 10: Significant Negative Measurement Features**

In this case, there is a tradeoff between a higher performance model and one that is easy to interpret. Though few of the individual coefficients are significant, the model is able to acheive modest performance by combining a larger number of features. If only coefficients that are significant at the 95% confidence level are used, the of the fit decreases to 0.110. An intuitive explanation for this may be that facial attractiveness is a result of the combination of a wide variety of facial features.

Insight:Objective measures of attractiveness appear to exist. |

Regardless of the features that are most important, the lasso regression model is able to make a substantial improvement over the baseline model. This is reflected in the F-statistic for the model. For , the F-statistic is 6.48 with a corresponding . This suggests that facial measurements provide useful information in predicting attractiveness.

Next, the subjective features of the dataset are analyzed. These features are scores from 1 to 7 on a variety of perceived qualities such as attractiveness, masculinity, and femininity. Scores represent averages over a number of participants evaluating the images of the dataset. For instance, subject AF-200 has a masculinity score of 1.357 which is her average score given by 28 evaluators for that quality.

A lasso regression model is constructed which predicts attractiveness based on all other subjective features. Attractiveness is excluded for obvious reasons. Care must again be taken to account for multicollinearity. The top positive and negative correlations among the subjective features are shown in Table 6.

Positive Correlations | Negative Correlations | ||||
---|---|---|---|---|---|

R |
Feature i |
Feature j |
R |
Feature i |
Feature j |

+0.843 | Angry | Disgusted | -0.952 | Feminine | Masculine |

+0.834 | Angry | Threatening | -0.683 | Age | Babyface |

+0.734 | Dominant | Threatening | -0.631 | Threatening | Trustworthy |

+0.725 | Afraid | Sad | -0.606 | Angry | Happy |

+0.687 | Disgusted | Threatening | -0.587 | Angry | Trustworthy |

+0.683 | Happy | Trustworthy | -0.573 | Happy | Sad |

**Table 6: Most Correlated Subjective Features**

Thankfully and intuitively the correlations among the variables are weaker than those among the measurement variables.

Using the subjective features alone, the lasso regression model achieves an average cross-validation accuracy 0.459 with a standard deviations of 0.040. An additional model is fit to the entire dataset. The target values are plotted against the model predictions and the plot is labeled with the of the fit: 0.653. The result is shown in Figure 11.

**Figure 11: Predictions of the Subjective Feature Model**

Interestingly, this is a substantial improvement over the accuracy of the regression model based on the facial measurements. This implies subjective features are more useful overall in predicting attractiveness.

Insight:Subjective features are better predictors of attractiveness than facial measurements. |

Next, the coefficient vector of the regression model is analyzed in the same fashion as earlier. 512 lasso regression models are fit using separate cross-validation splits and the average and standard deviation of each feature coefficient is computed. The weights of the top positive and negative subjective features are shown in Table 7 along with their sign.

Positive Feature Weights | Negative Feature Weights | ||||
---|---|---|---|---|---|

Name | Avg. | Std. | Name | Avg. | Std. |

Feminine | +34.34% | +0.64% | Age | -7.37% | +0.31% |

Masculine | +26.34% | +0.65% | Sad | -5.54% | +0.33% |

Trustworthy | +5.99% | +0.48% | Threatening | -3.84% | +0.50% |

Dominant | +3.87% | +0.47% | Unusual | -3.06% | +0.24% |

Afraid | +3.04% | +0.36% | Babyface | -2.41% | +0.30% |

Angry | +0.35% | +0.47% | Surprised | -1.66% | +0.24% |

**Table 7: Most Influential Subjective Features**

The model scores people who appear old, sad, and threatening as being less attractive. It is important to note that the age variable represents the average age estimate made by the participant evaluators and not the true age of the subject. This implies that people perceived as youthful are also perceived as attractive. Somewhat paradoxically, subjects are rated as being less attractive for having a “babyface.” Nevertheless, there is a subtle distinction between the appearance of youth and having a babyface.

The model scores people who appear more feminine, masculine, and trustworthy as appearing more attractive. The relationship between femininity, masculinity, and attractiveness are more easily visualized using scatter plots. Figure 12 shows femininity plotted against attractiveness with separate trend lines for men and women.

**Figure 12: Relationship Between Attractive and Feminine**

There is a large difference between the femininity scores of men and women. There is also a large difference between the relationship of femininity and attractiveness between men and women. Attractiveness in women is very highly correlated with femininity. This intuitively makes sense, though deeper interpretation is somewhat ambiguous. Depending on the evaluator, femininity might be perceived as being attractive or attractiveness might be perceived as a quality of femininity. The subjective nature of these features make interpretation more difficult. In regards to men, femininity has little effect on attractiveness.

**Figure 13: Relationship Between Attractive and Masculine**

Figure 13 shows masculinity plotted against attractiveness with separate trend lines for men and women. Interestingly, masculinity has a stronger negative effect on attractiveness in women than it has a positive effect on men. From the coefficients seen earlier, the regression model appears to miss this effect. The above plots are combined into a 3D scatter plot which shows the interactions between the 3 variables and age.

**Figure 14: Relationship Between Attractive, Masculine, and Feminine**

Another important aspect of these figures is that there are fewer men who are rated as being attractive. This is despite the fact that the number male and female subjects is nearly equal with 290 male and 307 female samples. The majority of the data points for men are clustered in the first half of the range of attractiveness. This effect confounds the relationships presented in Table 7.

**Figure 15: Distribution of Attractiveness Scores for Men and Women**

The distribution of attractiveness scores for men and women are shown in Figure 15 along with their corresponding sampling distributions. Due to the large number of participants, there is almost no overlap between the sampling distributions. Performing a Welch’s t-test to compare the two means, . It appears that despite being asked to control for gender, the evaluators still rated men as being less attractive on average.

Insight:Men are rated as being less attractive than women on average. |

Statistically significant differences between men and women exist in several of the other subjective features. Men are more likely to be perceived as being masculine, threatening or dominant. Women are more likely to be perceived as being feminine, trustworthy, unusual, or sad. The distributions for several other of these features are shown in Figure 16.

**Figure 16: Distribution Differences for Men and Women**

The effects of several of the other subjective features are irrespective of gender. The appearance of trustworthiness, for example, is correlated with attractiveness in both genders. Figure 17 presents a scatter plot of attractiveness and trustworthiness with a single trend line for men and women.

**Figure 17: Relationship Between Attractive and Trustworthy**

It appears there is a modest positive relationship between appearing trustworthy and appearing attractive. This effect ties together the two observations that men are more likely to be rated both less attractive and less trustworthy than women.

A lasso regression model is constructed using both the facial measurements and the subjective features. is chosen to be 0.0075. The model achieves an average cross-validation loss of 0.427 with a standard deviation of 0.035. The of the fit is 0.672. If both dummy variables for race and gender and the subjective race and gender estimates are introduced, model accuracy is again improved.

To further improve performance, seperate models for men and women are constructed and the results of each are synthesized together. Finally cubic polynomial interaction terms for the measurement data are introduced. Cubic terms are used due to the cubic relationship between measurement length and 3D facial structure. The target values are plotted against the predictions of the final model and the result is shown in Figure 18.

**Figure 18: Final Model Prediction Scatter Plot**

Again, cross-validation splits are repeatedly formed and the performance of each model is evaluated. This process is repeated 512 times. In addition, one model is fit to the entire dataset to evaluate the overall goodness of fit. The results for each of the feature sets is shown in Table 8. The minimum cross-validation loss is 0.373 with a standard deviation of 0.036. The corresponding is 0.823.

Feature Set | Avg. Loss | Std. Loss | Adj. RÂ² |
---|---|---|---|

Baseline | 0.770 | 0.060 | 0.000 |

Measurements | 0.680 | 0.053 | 0.246 |

Subjective | 0.459 | 0.040 | 0.626 |

Measurement + Subjective | 0.427 | 0.035 | 0.672 |

All | 0.397 | 0.040 | 0.718 |

All Seperate Gender | 0.380 | 0.036 | 0.802 |

All Seperate Gender + Cubic | 0.373 | 0.035 | 0.823 |

**Table 8: Lasso Regression Cross-Validation Performance**

As can be seen, the addition of each feature set provides more power in predicting attractiveness. This shows that the feature sets compliment each other, at least partially. For example, by using the facial measurements in addition to the subjective features, the model is able to achieve a substantial improvement in performance. This may suggest that while the majority of attractiveness is subjective, there are anatomical characteristics which are perceived as being attractive. For although the majority of variation in attractiveness is explained by the subjective features, facial measurements present additional useful information.

This explanation intuitively makes sense, though some caution should be exercised. There are correlations amongst the subjective features and facial measurements. For example, the correlation between the subjective feature “Surprised” and eye size is 0.348. This is less of a profound insight and more a description of a biological function. Nevertheless, the effect of eye size on attractiveness may be overshadowed by the effect of “Surprised,” though the measurement seems to be closer to the root cause.

The differences between attractiveness in men and women are explored further. The subjects are divided into two groups and 512 pairs of regression models are created with one model fit to men and one to women. The mean and standard deviation of each coefficient is computed separately for each group. The mean coefficient weights for men and women are shown in Figure 19 in two pie charts. Features with a positive effect are shown in blue while those with a negative effect are shown in red.

**Figure 19: Coefficient Effect Weights for Men and Women**

The influence of femininity dominates with women. This result agrees with the earlier seen scatter plot comparing femininity and attractiveness. The most influential effect for men is the appearance of trustworthiness. Masculinity is also of importance, but the effect is weaker than that of femininity on women. Also of note is that the most important features are all subjective. This reinforces the notion that the subjective features are better predictors of attractiveness than the facial measurements. In order to explore the differences among the measurement features, separate models are fit only to the measurement data. The results are shown in Figure 20.

**Figure 20: Measurement Feature Differences Between Men and Women **

From the plot, nose width is more important in determining attractiveness in men than in women. The converse is true with facial luminance. It is important to note that the bar plot only shows the magnitude of the effect and not the sign. The top 10 most influential features are listed for men and women in Table 9 along with their sign.

Weights for Men | Weights for Women | ||||
---|---|---|---|---|---|

Name | Avg. | Std. | Name | Avg. | Std. |

Cheekbone Height | +16.12% | +3.82% | Nose Length | +17.34% | +2.58% |

Nose Width | +14.86% | +3.16% | Bottom Lip Chin | -12.97% | +5.46% |

Nose Length | +11.69% | +2.93% | Luminance Median | +11.54% | +1.86% |

Bottom Lip Chin | -9.88% | +5.93% | Cheekbone Height | +7.99% | +3.38% |

Midcheek Chin L | +7.70% | +4.08% | Pupil Top R | +7.63% | +3.43% |

Forehead Height | +5.20% | +2.33% | Face Width Cheeks | -6.85% | +1.72% |

Lip Fullness | -5.01% | +2.70% | L Eye W | +6.21% | +2.33% |

Asymmetry Pupil Lip | -4.99% | +1.59% | Asymmetry Pupil Lip | -5.93% | +1.33% |

Chin Length | -4.32% | +5.36% | Chin Length | -4.38% | +5.36% |

Asymmetry Pupil Top | -3.79% | +1.66% | L Eye H | +3.00% | +2.55% |

**Table 9: Signed Feature Weights for Men and Women**

The table clarifies the directions of the relationships for several of the values. A number of the features have similar effects between men and women. Several exceptions to this are nose width, facial width at the cheeks, forehead height, and lip fullness . Facial luminance has an important positive affect on attractiveness in women that is not present in men.

Next, the image data from the data set is analyzed. A regression model is constructed which predicts an image given a vector of values. Each sample vector contains the subjective scores along with the race and gender variables. Each target vector is a flattened vector of the pixels in the image.

A *ridge regression* model is constructed. Ridge regression fits a linear function to the data using least-squares criterion along with an L2 regularization term. By regularizing the coefficients of the model, ridge regression can help to achieve better model performance when there are large numbers of correlated variables [3]. The error term for ridge regression is given in Equation 6. Ridge regression is used instead of lasso regression for performance reasons due to the high dimensionality of the data.

**Equation 6: Ridge Regression Error Term**

The function is the same as that for linear regression with the addition of an L2 regularization term. By trying a large number of values, the regularization constant is chosen to be 1.0.

The regression models fits an matrix of values, where is the number of features in the sample vector and is the number of pixels in the image. Each row of the matrix represents facial features which are related to the corresponding column in the input vectors. For example, if the first column of the input vector contains the age, then the first row in the matrix contains regression coefficients related to age. In this way, the regression model synthesizes a weighted average of different facial features to arrive at a final image.

The model’s predictions for two of the input samples are shown in Figure 21. The original image is shown on the left and the predicted image is shown on the right. The shadowing around the face is an artifact of the wide variety of hairstyles among the subjects.

**Figure 21: Image Predictions**

The rows of the coefficient matrix can be analyzed to determine the portions of the face that are strongly related to a given feature. Each row of coefficients is reshaped into an image and results for several of the features are shown in Figure 22.

**Figure 22: Feature Activations**

In the image, the regions of the face that contribute most to a feature are shown in lighter yellow. It appears that the shapes and positions of the eyes, nose, mouth, chin, and forehead are most important in determining attractiveness. Elongated curved eyebrows and lips appear to be more attractive. The definition and size of the base of the nose is influential as well. Also of interest, is definition in the chin and jowl region. There are also regions of activation on the forehead, implying that forehead shape is important. However, interpretation of this result is made difficult by the wide variety of hair styles in the dataset.

The regression model is also capable of providing several interesting functions. Since the model predicts the appearance of a person based on a vector of coefficients, semi-random faces can be created by generating random vectors of coefficients. Random standardized sample vectors are generated using a standard normal distribution. The dummy variables for race and gender must be handled carefully to ensure they are mutually exclusive. Several such randomly generated images are shown in Figure 23.

**Figure 23: Randomly Generated Faces with Low Variance**

By increasing the standard deviation of the distribution used to generate samples, more irregular images can be generated. Randomly generated images with a standard deviation of 3 are shown in Figure 24.

**Figure 24: Randomly Generated Faces with High Variance**

The model can also be used to manipulate images via transformations to the input vectors. A subject may be aged by increasing the age score in the corresponding vector. Or a subject may be made to look happier by modifying the appropriate value. Even the gender or race of a subject can be changed. Several examples follow.

Subject WF-022 is a white female evaluated to be roughly 20 years old. By modifying the age in the sample vector, the subject is aged by roughly 55 years. The result is shown in Figure 25.

**Figure 25: Age Modification**

Subject AF-242 is an Asian female with a happiness score of 1.93 on a scale of 1 to 7. The subject is made to look happier by setting her happiness and sadness scores to the maximum and minimum values attained in the dataset respectively. The subject is also made to look sadder by setting her happiness and sadness scores to the minimum and maximum values respectively. The results are shown in Figures 26 and 27.

**Figure 26: Happiness Modification**

**Figure 27: Sadness Modification**

Subject WF-022 is a white female. By modifying the race variables, the subject is transformed into a Latino female. The result is shown in Figure 28.

**Figure 28: Race Modification**

Subject LM-224 is a Latino male. By modifying the gender variables, the subject is transformed into a Latino female. The result is shown in Figure 29.

**Figure 29: Gender Modification**

Subject WM-220 is a white male with the lowest trustworthiness score observed in the study. Again, by manipulating the relevant variables, the subject is made to look more trustworthy and happy. The result is shown in Figure 30.

**Figure 30: Trustworthiness Modification**

The above functionality has several applications including simulated aging as seen on missing people reports. In addition, it can be used to visually evaluate the performance of the model. Image transformations that are less convincing indicate that the model has a more difficulty determining what is influential for the given feature. For example, if modification of the masculinity feature does not produce a convincing image transformation, it may indicate that the model has difficulty determining the features that makes a person look masculine.

Despite their simplicity, linear frequently perform well in practice and their results are relatively easy to interpret. More complicated models like kernel support vector machines and artificial neural networks can achieve better performance but perform *black box* prediction. The term black box implies that the model is used like an appliance. Samples are provided to one end of the black box and predictions come out the other end. The inner-workings of the model are opaque.

Linear models obviously break down after some point. For instance, a subject with an impossibly wide nose might be rated as being arbitrarily attractive by an earlier seen model. Human intuition disagrees. Using a model to predict values outside the range of its training data is known as *extrapolation*. It is necessary to understand the shortcomings of a model to properly interpret its results.

The linear models in this chapter provide several key insights.

- The effects of symmetry and averageness appear overstated.
- Objective measures of attractiveness appear to exist.
- Subjective features are better predictors of attractiveness than facial measurements.
- Men are rated as being less attractive than women on average.

Returning to the prompt for this topic, it appears that facial attractiveness has both a subjective and objective component. The relationships between these components are complex, but certain physical characteristics appear to predispose attractiveness.

[1] | Ma, D. S., Correll, J., & Wittenbrink, B. (2015). The Chicago face database: A free stimulus set of faces and norming data. Behavior research methods, 47(4), 1122-1135. |

[2] | Little AC, Jones BC, DeBruine LM. Facial attractiveness: evolutionary based research. Philosophical Transactions of the Royal Society B: Biological Sciences. 2011;366(1571):1638-1659. doi:10.1098/rstb.2010.0404. |

[3] | Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1, pp. 241-249). New York: Springer series in statistics. |

All addresses on the Bitcoin network are queried. The number of addresses with at least one satoshi is 24,473,765 at the time of the query. The resulting addresses are sorted by the amount of Bitcoin they contain. The list is divided into quantiles and the wealth of each quantile is plotted in a bar plot.

Next, the wealth of the top percent is computed for several values of . The result is shown in a bar plot, along with the number of addresses in each group.

Then, the number of addresses with at least satoshi is computed for several values of . The result is shown in a bar plot along with the corresponding percentiles.

Finally, the addresses are grouped by their first character and the result is displayed in a pie chart.

As can be seen, there is a large imbalance of wealth among Bitcoin addresses. However, the true balance of wealth is obscured by the fact that a single wallet can generate multiple addresses.

]]>

First, the required modules are installed and imported. This code requires *numpy*, *tensorflow*, and *TFANN*.

import numpy as np from TFANN import ANNC

*TFANN* can be installed via *pip* or by copying the source code into a file named *TFANN.py*.

pip install TFANN

Next, an matrix of random data points is generated using *numpy*. Class labels are created following a polynomial inequality. The polynomial used is

.

The inequality used to generate class labels is

.

The equation is a downwards facing parabola centered on the y axis. Points below the parabola satisfy the inequality and are labeled as 1. Points above the curve are labeled as 0. Code to generate the data matrix and class labels follows.

def F(x, y): return - np.square(x) + .1 * x - .6 * y + .2 #Training data A1 = np.random.uniform(-1.0, 1.0, size = (1024, 2)) Y1 = (F(A1[:, 0], A1[:, 1]) > 0).astype(np.int) #Testing data A2 = np.random.uniform(-1.0, 1.0, size = (1024, 2)) Y2 = (F(A2[:, 0], A2[:, 1]) > 0).astype(np.int)

The function curve is shown in Figure 1 along with a scatter plot of the generated data matrix.

**Figure 1: The Generated Data**

The color indicates the value of and the curve is . The same plot colored instead with class labels in shown in Figure 2.

**Figure 2: Generated Data with Class Labels**

As can be seen above, the data is divided into two classes: *0* and *1*. The goal is to create a model which can determine if a data point belongs to class *0* or to class *1*. This is known as *binary classification* as there are two class labels.

Next, a multi-layer perceptron (MLP) network is fit to the data generated earlier. In this example, the function used to generate class labels is known. This is typically not the case. Instead, the model iteratively updates its parameters so as to reduce the value of a *loss function*.

A two layer MLP is constructed. The activation function *tanh* is used after the first hidden layer and the output layer uses linear activation (no activation function). The architecture of the network is illustrated in Figure 3.

**Figure 3: MLP Network Architecture**

The green dots on the neurons in the hidden layer indicate *tanh* activation. Next, this network architecture is specified in a format that TFANN accepts and an ANN classifier is constructed.

NA = [('F', 4), ('AF', 'tanh'), ('F', 2)]

The list of tuples is the network architecture. *F* indicates a fully-connected layer and the following number indicates the number of neurons in the layer. *AF* indicates an activation function and the following string indicates the name of the function. As can be seen, the network architecture specifies a fully-connected layer with 4 neurons which is followed by *tanh* which is followed by another fully-connected layer with 2 neurons. The final layer is the output layer.

The docstring for the *_CreateANN* function provides detailed information on the types of network operations that are currently supported by *TFANN*.

In [109]: help(TFANN._CreateANN) Help on function _CreateANN in module TFANN: _CreateANN(PM, NA, X) Sets up the graph for a convolutional neural network from a list of operation specifications like: [('C', [5, 5, 3, 64], [1, 1, 1, 1]), ('AF', 'tanh'), ('P', [1, 3, 3, 1], [1, 2, 2, 1]), ('F', 10)] Operation Types: AF: ('AF', <name>) Activation Function 'relu', 'tanh', etc C: ('C', [Filter Shape], [Stride]) 2d Convolution CT: 2d Convolution Transpose C1d: ('C1d', [Filter Shape], Stride) 1d Convolution C3d: ('C3d', [Filter Shape], [Stride]) 3d Convolution D: ('D', Probability) Dropout Layer F: ('F', Output Neurons) Fully-connected LRN: ('LRN') M: ('M', Dims) Average over Dims P: ('P', [Filter Shape], [Stride]) Max Pool P1d: ('P1d', [Filter Shape], Stride) 1d Max Pooling R: ('R', shape) Reshape S: ('S', Dims) Sum over Dims [Filter Shape]: (Height, Width, In Channels, Out Channels) [Stride]: (Batch, Height, Width, Channel) Stride: 1-D Stride (single value) PM: The padding method NA: The network architecture X: Input tensor

The final layer of a classification network requires that class labels be encoded as 1-hot vectors along the final axis of the output. Since the network predicts a single binary class label for each sample, the final layer should have 2 neurons. In this way, the final layer outputs a matrix of dimension . The function *argmax* is applied along the final dimension of the output matrix to obtain the index of the class label.

Next the network architecture is passed to the constructor of the *ANNC* class, along with the input shape and other parameters. ANNC is an abbreviation for *Artificial Neural Network for Classification*.

annc = ANNC(A1.shape[1:], NA, batchSize = 1024, maxIter = 4096, learnRate = 1e-3, verbose = True)

The first arguments to the *ANNC* constructor is the shape of a single input sample. In this case, the shape is a vector of length *2*. The *batchSize* argument indicates the number of samples to use at a time when training the network. The batch indices are selected randomly for each training iteration. The *learnRate* parameter specifies the learning rate used by the training method (which is the adam method by default). The *maxIter* argument limits the number of training iterations to some fixed amount. Finally, *verbose* controls whether the loss is displayed after each iteration of training. Detailed descriptions for the constructor arguments are available using *help(ANNC)*.

*TFANN* follows the *fit*, *predict*, *score* interface used by scikit-learn. Thus, fitting and scoring the network can be accomplished as follows.

annc.fit(A1, Y1) #Fit using training data only s1 = annc.score(A1, Y1) #Performance on training data s2 = annc.score(A2, Y2) #Testing data print('Train: {:04f}\tTest: {:04f}'.format(s1, s2)) YH = annc.predict(A2) #Predicted labels

The *score* method uses *accuracy* as the metric for classification models. This is the number of samples labeled correctly divided by the number of samples. Some care should be used with this metric in problems where class labels are imbalanced.

Due to the simple nature of the problem, the network is able to achieve very high accuracy on the cross-validation data. After *4096* iterations, the network achieves roughly *98%* accuracy. The predictions on the testing data are plotted below in Figure 4.

**Figure 4: Model Cross-Validation Predictions (Accuracy = 98.4%)**

The reader is encouraged to modify the data, network architecture, and parameters to explore the features provided by *TFANN*.

To facilitate rapid prediction, pricing information is queried using the web API of Poloniex. A URL is provided to the API and a JSON containing the historical price information of a specified cryptocurrency is returned.

import numpy as np import os import pandas as pd import urllib.request def GetAPIUrl(cur, sts = 1420070400): ''' Makes a URL for querying historical prices of a cyrpto from Poloniex cur: 3 letter abbreviation for cryptocurrency (BTC, LTC, etc) ''' return 'https://poloniex.com/public?command=returnChartData¤cyPair=USDT_{:s}&start={:d}&end=9999999999&period=7200'.format(cur, sts) def GetCurDF(cur, fp): ''' cur: 3 letter abbreviation for cryptocurrency (BTC, LTC, etc) fp: File path (to save price data to CSV) ''' openUrl = urllib.request.urlopen(GetAPIUrl(cur)) r = openUrl.read() openUrl.close() df = pd.read_json(r.decode()) df['date'] = df['date'].astype(np.int64) // 1000000000 return df #%%Path to store cached currency data datPath = 'CurDat/' if not os.path.exists(datPath): os.mkdir(datPath) #Different cryptocurrency types cl = ['BTC', 'LTC', 'ETH', 'XMR'] #Columns of price data to use CN = ['close', 'high', 'low', 'open', 'volume'] #Store data frames for each of above types D = [] for ci in cl: dfp = os.path.join(datPath, ci + '.csv') try: df = pd.read_csv(dfp, sep = ',') except FileNotFoundError: df = GetCurDF(ci, dfp) D.append(df) #%%Only keep range of data that is common to all currency types cr = min(Di.shape[0] for Di in D) for i in range(len(cl)): D[i] = D[i][(D[i].shape[0] - cr):]

After execution, *D[i]* is a pandas Dataframe containing historical price data for the cryptocurrency *cl[i]*.

New samples are constructed that pair sequences of samples with the subsequent samples. In this way, a regression model can be fit which predicts time periods into the future given data from the past . A helper class which accomplishes this follows.

import numpy as np class PastSampler: ''' Forms training samples for predicting future values from past value ''' def __init__(self, N, K): ''' Predict K future sample using N previous samples ''' self.K = K self.N = N def transform(self, A, Y = None): M = self.N + self.K #Number of samples per row (sample + target) #Matrix of sample indices like: {{1, 2..., M}, {2, 3, ..., M + 1}} I = np.arange(M) + np.arange(A.shape[0] - M + 1).reshape(-1, 1) B = A[I].reshape(-1, M * A.shape[1], *A.shape[2:]) ci = self.N * A.shape[1] #Number of features per sample return B[:, :ci], B[:, ci:] #Sample matrix, Target matrix

The above class is applied to the original time sequence data to obtain the desired sample and target matrices.

from PastSampler import PastSampler #%%Features are channels C = np.hstack((Di[CN] for Di in D))[:, None, :] HP = 16 #Holdout period A = C[0:-HP] SV = A.mean(axis = 0) #Scale vector C /= SV #Basic scaling of data #%%Make samples of temporal sequences of pricing data (channel) NPS, NFS = 256, 16 #Number of past and future samples ps = PastSampler(NPS, NFS) B, Y = ps.transform(A)

In the above code, the shapes of and are as . A holdout period is maintained to access the performance of the network. The number of time units in the period is controlled by *HP*.

The *TFANN* module is used to create an artificial neural network. *TFANN* can be installed using pip with the following command.

pip install TFANN

A 1D convolution neural network is constructed which transforms the input volume of historical data into predictions. The past *NPS* samples are transformed into a prediction about the next *NFS* samples. The *C1d* option in the network architecture specification indicates 1-dimensional convolution.

#%%Architecture of the neural network from TFANN import ANNR NC = B.shape[2] #2 1-D conv layers with relu followed by 1-d conv output layer ns = [('C1d', [8, NC, NC * 2], 4), ('AF', 'relu'), ('C1d', [8, NC * 2, NC * 2], 2), ('AF', 'relu'), ('C1d', [8, NC * 2, NC], 2)] #Create the neural network in TensorFlow cnnr = ANNR(B[0].shape, ns, batchSize = 32, learnRate = 2e-5, maxIter = 64, reg = 1e-5, tol = 1e-2, verbose = True) cnnr.fit(B, Y)

The architecture of the CNN is shown below in Figure 1. The top set of parenthesized values indicate the filter dimension while the bottom denote the stride.

**Figure 1: 1D CNN Architecture**

More information and the source code for the *ANNR* class are available on GitHub.

Using the above network, the next *NFS* time steps can be predicted. These predictions can in turn be used for subsequent predictions so that prediction can be made an arbitrary amount into the future. Code to accomplish this follows.

PTS = [] #Predicted time sequences P, YH = B[[-1]], Y[[-1]] #Most recent time sequence for i in range(HP // NFS): #Repeat prediction P = np.concatenate([P[:, NFS:], YH], axis = 1) YH = cnnr.predict(P) PTS.append(YH) PTS = np.hstack(PTS).transpose((1, 0, 2)) A = np.vstack([A, PTS]) #Combine predictions with original data A = np.squeeze(A) * SV #Remove unittime dimension and rescale C = np.squeeze(C) * SV

Using *PredictFull*, the outputs of intermediate layers in the network can be visualized. Figure 2 shows an input sample as it is transformed by subsequent layers of the network.

**Figure 2: Intermediate Layer Outputs**

import matplotlib.pyplot as mpl nt = 4 PF = cnnr.PredictFull(B[:nt]) for i in range(nt): fig, ax = mpl.subplots(1, 4, figsize = (16 / 1.24, 10 / 1.25)) ax[0].plot(PF[0][i]) ax[0].set_title('Input') ax[1].plot(PF[2][i]) ax[1].set_title('Layer 1') ax[2].plot(PF[4][i]) ax[2].set_title('Layer 2') ax[3].plot(PF[5][i]) ax[3].set_title('Output') fig.text(0.5, 0.06, 'Time', ha='center') fig.text(0.06, 0.5, 'Activation', va='center', rotation='vertical') mpl.show()

Notice how in subsequent layers the input data is reduced from *NPS* to *NFS* time units.

The result of the predictions can be visualized using matplotlib.

CI = list(range(C.shape[0])) AI = list(range(C.shape[0] + PTS.shape[0] - HP)) NDP = PTS.shape[0] #Number of days predicted for i, cli in enumerate(cl): fig, ax = mpl.subplots(figsize = (16 / 1.5, 10 / 1.5)) hind = i * len(CN) + CN.index('high') ax.plot(CI[-4 * HP:], C[-4 * HP:, hind], label = 'Actual') ax.plot(AI[-(NDP + 1):], A[-(NDP + 1):, hind], '--', label = 'Prediction') ax.legend(loc = 'upper left') ax.set_title(cli + ' (High)') ax.set_ylabel('USD') ax.set_xlabel('Time') ax.axes.xaxis.set_ticklabels([]) mpl.show()

The resulting plot is shown below in Figure 3.

**Figure 3: Cryptocurrency Predictions**

The network predicts a dip in the prices of each cryptocurrency followed by a rally. The predicted behavior is similar to Bitcoin’s price over the past few days.

For a user-friendly application that integrates cryptocurrency predictions with market information, please see:

Also, follow *RoboInsights* on Twitter for daily predictions about Bitcoin (BTC), Ethereum (ETH), Litecoin (LTC), and Monero (XMR):

By definition, an economic bubble is a situation in which an asset is traded within a price range that far exceeds its intrinsic value. So, the question is: what *is* the intrinsic value of Bitcoin? The purpose of this post is to explain some of the technical details of Bitcoin so as to gain a better idea of its value.

Bitcoin is an electronic currency that is maintained by a peer-to-peer software application known as *Bitcoin Core*. Bitcoin transactions are broadcast over the peer-to-peer network and Bitcoin Core records every transaction ever issued in a ledger that is maintained by each node running the software.

These transactions are sealed with a cryptographic hash inside blocks which are linked together in a chain known as the blockchain. Each block in the chain contains the hash of the previous block making it infeasible to tamper with transactions once they are a part of the blockchain. Figure 1, taken from the original paper, illustrates the blockchain [1].

**Figure 1: The Blockchain**

The blockchain maintains the integrity of the transaction record and prevents double-spending and the repudiation of transactions.

The price of Bitcoin has historically been very erratic. Its frequent spikes and crashes have resulted in some interesting trends. Figure 2 shows the relationship of the price of Bitcoin in USD (taken from blockchain.info) to the popularity of two search trends (taking from Google Trends): *“Bitcoin Bubble”* and *“Bitcoin Scam”*. The correlations for each plot are 0.7704 and 0.9445, respectively.

**Figure 2: Bitcoin Search Trends**

The above trends hint that there is a lot of uncertainty and speculation about the present and future value of Bitcoin.

According to the pseudonymous creator of Bitcoin, Satoshi Nakamoto, the intrinsic value of Bitcoin stems from the resources that are expended to mine it: CPU time and electricity [1].

Bitcoin mining is the process of cryptographically sealing blocks and adding them to the blockchain in a way that prevents tampering. Mining is accomplished by searching for *nonces* (numbers) that cause the SHA256 hash of the current block to have a certain number of leading 0s. The number of leading zeros as of Block #491713 is 18 and the hexidecimal representation of the hash digest is

00000000000000000048fddd20e468a0c9fab27c81ccade0cfd4c91e857c74e3.

Now, all SHA256 digests have 64 hexidecimal digits. Further, the nature of the SHA256 function makes it so that the best way (presently known) to find a nonce which produces the required number of leading 0s is by trial and error. Thus, the probability of finding a number to create a valid block decreases exponentially with the number of 0s required.

If it is assumed that the appearance of digits in the digest is random, then the probability of finding an appropriate nonce is presently

.

This means that the expected number of nonces that will need to be tried before an appropriate hash is found is

.

This difficulty is what prevents malicious users from tampering with the transaction record. If a block is modified, a new nonce will need to be found for that block and for all subsequent blocks in the chain!

The present hash rate of the Bitcoin network is 9,935,312.21 tera () hashes per second (TH/s). The hash rate is the number of these nonces that are being tried every second across the network. Given this hash rate, a nonce can be expected to be found every

seconds or just under 8 minutes. The Bitcoin software adjusts the difficulty automatically so that blocks are produced roughly every 10 minutes.

Mining is currently performed on application specific integrated circuits (ASIC) that are specifically designed to search for nonces quickly and efficiently. Current ASIC miners are capable of 14 TH/s at an efficiency of ~0.098W per GH/s. The expected amount of time for a single miner to find a good nonce at this difficulty is

hours.

Thus, assuming a power supply with 90% efficiency, running a top of the line miner until a nonce is found takes roughly

kWh.

The average cost of electricity in the US is 12 cents per kilowatt-hour. Thus, the cost of mining 1 block is roughly

.

Using the above hash rate and power consumption numbers, the entire Bitcoin network currently consumes roughly

kW

of electricity. Some, such as Vitalik Buterin the creator of Ethereum, have raised concerns about the environmental impact of Bitcoin mining.

As the difficulty in mining increases, the cost of electricity can outweigh the reward for mining. Due to this, Bitcoin is primarily mined in locations with abundant and cheap electricity (next to hydro-electric dams, for example).

Presently, 12.5 BTC is mined for every new block that is created. Thus, the current reward in USD for mining 1 block is roughly

,

and the current electricity cost of mining a single Bitcoin is roughly

.

According to the estimated cost of electricity, Bitcoin is presently being traded at around

times its alleged intrinsic value, which may indicate the presence of a bubble.

As an aside, the reward for mining a block began at 50BTC. The reward is set within the Bitcoin Core software to be cut in half every 210,000 blocks. Figure 3 below shows the reward that is given for a specific block number.

**Figure 3: Bitcoin Mining Reward**

Now, the smallest fraction of a Bitcoin is 1 satoshi or BTC. The reward for mining one block will fall below one satoshi, and thus be 0, after being halved 33 times. This means that the maximum amount of Bitcoins that can be mined is

BTC.

Given the present value of 1 BTC this amounts to

.

Figure 4 shows the total supply of BTC after a given number of blocks have been mined.

**Figure 4: Bitcoin Supply**

If the intrinsic value of Bitcoin is indeed the cost of the electricity required to mine it, as Satoshi claimed, then it appears that Bitcoin may be overvalued at present. The above calculations do not factor in the value of CPU time, which is more challenging to estimate, but may further increase the intrinsic value.

Regardless, the idea of a currency backed by CPU time and electricity is a relatively new concept and it remains to see how Satoshi’s claim holds up to time.

**References**

[1] Nakamoto, Satoshi. “Bitcoin: A peer-to-peer electronic cash system.” (2008): 28.

]]>