Increases in Circulatory Death During the Coronavirus Pandemic

This post takes a closer look at Multiple Cause-of-Death records during the first year of the Coronavirus Pandemic. In this post, changes in mortality records involving the circulatory system (i.e. ICD-10 codes starting with I) are analyzed in more detail. These codes cover commonly occurring diseases like heart attacks, strokes, and other disease of the cardiovascular and, more broadly, the circulatory system.

Read more

Embedding Recipes using Kernel PCA

The previous post discusses Kernel PCA and recipes, or formulae, for deriving new kernels from known good kernels. This post applies these approaches to generate vector embeddings in a specific domain: culinary recipes. The idea is to find a low-dimensional representation of recipes such that points in the embedding space are neighbors to similar recipes.
Read more

Kernel Recipes and Kernel PCA

One strength of kernel methods is their ability to operate directly on non-numerical objects like sets. As seen in the previous post, the Jaccard index on sets satisfies Mercer’s condition and thus is a valid kernel. The process of proving a similarity measure is a valid kernel is somewhat involved, but thankfully several theorems can be employed to get more mileage out of the set of known good kernels. This post outlines some recipes for producing new valid kernels and introduces a method for obtaining numerical representations of samples using kernel methods.
Read more

ICP In Practice

This post explores the iterative constrained pathways rule ensemble (ICPRE) method introduced in an earlier post using the Titanic dataset popularized by Kaggle [1]. The purpose of the text is to introduce the features and explore the behavior of the library.

Some of the code snippets in this post are shortened for brevity sake. To obtain the full source and data, please see the ICPExamples GitHub page [2].
Read more

Simulated Genetic Drift in Populations

This video shows a simulation of genetic drift in a synthetically generated population as a result of sexual reproduction. A fixed population size is divided into different distinct sub-populations with differing allele frequencies. Starting with an initial population, 80 simulated epochs are passed in which each population member is replaced via sexual reproduction from two randomly selected parents. Pairs of parents are chosen such that the probability they come from the same sub-population is higher than the probability that they from come different sub-populations.

Figure 1: Simulated Genetic Drift in Populations

In the animation, the high-dimensional allele features for each population member are represented in two dimensions using PCA. The size and color of each marker encodes information about sub-population makeup and may be used to help distinguish highly mixed samples.

Although the data is entirely synthetic, it demonstrates the middling affect that cross-breeding has on the genetic makeup of the overall population. As the number of sub-populations grows and the overall population size remains fixed, the impact of cross-breeding is intensified and the population as a whole converges more rapidly.

CoV-2 Genome Explorer

At the time of this writing, NCBI Virus has over 40,000 complete CoV-2 sequences available. I made an interactive webpage to help visualize a random subset of these genomes using principal component analysis.

Figure 1: CoV-2 Explorer Interface

The x, y, size, and color axes respectively represent the first four principal axes. The plot markers identify viruses with unique nucleotide sequences in the GU280_gp02 (spike glycoprotein) gene region. The range sliders can be used to visualize shift in the viral genomes over time. The bar chart in the bottom right counts the frequency of SNP mutations in the sequences present in the specified date range.

Ancestry Determination Part 3: openSNP Data Evaluation

The website openSNP allows users to share and discuss genetic information [1]. The purpose of the site is perhaps best summarized by its bio on Twitter: “crowdsourcing genome wide association studies” [2]. This post uses the machine learning and statistical techniques developed in previous blog entries to analyze the ancestry of openSNP members who have genotyping data associated to their accounts. By comparing the derived results with the self-reported answers from openSNP users, the accuracy of these methods on real-world data is assessed.
Read more