Data Visualization – nicholastsmith

Posted on December 23, 2022 by Nicholas T Smith

Cause of Death Records in 2021

This post explores Multiple Cause-of-Death records for the year 2021, taken from the U.S. Division of Vital Statistics. Specifically, the impact of the second full year of the coronavirus pandemic is explored in more detail.
Read more

Posted on December 19, 2022 by Nicholas T Smith

Increases in Circulatory Death During the Coronavirus Pandemic

This post takes a closer look at Multiple Cause-of-Death records during the first year of the Coronavirus Pandemic. In this post, changes in mortality records involving the circulatory system (i.e. ICD-10 codes starting with I) are analyzed in more detail. These codes cover commonly occurring diseases like heart attacks, strokes, and other disease of the cardiovascular and, more broadly, the circulatory system.

Posted on December 6, 2022December 6, 2022 by Nicholas T Smith

Cause of Death in the USA: 1959-2020

This post takes another look at Mortality Multiple Cause-of-Death records from the U.S. Division of Vital Statistics. Previous posts have analyzed records for the year 2016 and also deaths due to Influenza & Pneumonia in 2017. In this post, records from 1959 to the 2020, the latest year currently available, are analyzed.
Read more

Posted on November 3, 2022December 11, 2022 by Nicholas T Smith

Embedding Recipes using Kernel PCA

The previous post discusses Kernel PCA and recipes, or formulae, for deriving new kernels from known good kernels. This post applies these approaches to generate vector embeddings in a specific domain: culinary recipes. The idea is to find a low-dimensional representation of recipes such that points in the embedding space are neighbors to similar recipes.
Read more

Posted on October 31, 2022November 1, 2022 by Nicholas T Smith

Kernel Recipes and Kernel PCA

One strength of kernel methods is their ability to operate directly on non-numerical objects like sets. As seen in the previous post, the Jaccard index on sets satisfies Mercer’s condition and thus is a valid kernel. The process of proving a similarity measure is a valid kernel is somewhat involved, but thankfully several theorems can be employed to get more mileage out of the set of known good kernels. This post outlines some recipes for producing new valid kernels and introduces a method for obtaining numerical representations of samples using kernel methods.
Read more

Posted on May 29, 2021December 11, 2022 by Nicholas T Smith

ICP In Practice

This post explores the iterative constrained pathways rule ensemble (ICPRE) method introduced in an earlier post using the Titanic dataset popularized by Kaggle [1]. The purpose of the text is to introduce the features and explore the behavior of the library.

Some of the code snippets in this post are shortened for brevity sake. To obtain the full source and data, please see the ICPExamples GitHub page [2].
Read more

Posted on February 19, 2021February 19, 2021 by Nicholas T Smith

Simulated Genetic Drift in Populations

This video shows a simulation of genetic drift in a synthetically generated population as a result of sexual reproduction. A fixed population size is divided into different distinct sub-populations with differing allele frequencies. Starting with an initial population, 80 simulated epochs are passed in which each population member is replaced via sexual reproduction from two randomly selected parents. Pairs of parents are chosen such that the probability they come from the same sub-population is higher than the probability that they from come different sub-populations.

Figure 1: Simulated Genetic Drift in Populations

In the animation, the high-dimensional allele features for each population member are represented in two dimensions using PCA. The size and color of each marker encodes information about sub-population makeup and may be used to help distinguish highly mixed samples.

Although the data is entirely synthetic, it demonstrates the middling affect that cross-breeding has on the genetic makeup of the overall population. As the number of sub-populations grows and the overall population size remains fixed, the impact of cross-breeding is intensified and the population as a whole converges more rapidly.

Posted on February 5, 2021December 11, 2022 by Nicholas T Smith

CoV-2 Genome Explorer

At the time of this writing, NCBI Virus has over 40,000 complete CoV-2 sequences available. I made an interactive webpage to help visualize a random subset of these genomes using principal component analysis.

Figure 1: CoV-2 Explorer Interface

The x, y, size, and color axes respectively represent the first four principal axes. The plot markers identify viruses with unique nucleotide sequences in the GU280_gp02 (spike glycoprotein) gene region. The range sliders can be used to visualize shift in the viral genomes over time. The bar chart in the bottom right counts the frequency of SNP mutations in the sequences present in the specified date range.

Posted on January 26, 2021January 26, 2021 by Nicholas T Smith

Ancestry Determination Part 3: openSNP Data Evaluation

The website openSNP allows users to share and discuss genetic information [1]. The purpose of the site is perhaps best summarized by its bio on Twitter: “crowdsourcing genome wide association studies” [2]. This post uses the machine learning and statistical techniques developed in previous blog entries to analyze the ancestry of openSNP members who have genotyping data associated to their accounts. By comparing the derived results with the self-reported answers from openSNP users, the accuracy of these methods on real-world data is assessed.
Read more

Posted on January 2, 2021January 2, 2021 by Nicholas T Smith

Identifying Family Relationships using Genetic Similarity Measures

In general, are parents more genetically similar to their children or to their siblings? Why do some siblings look alike while others look completely different? How can the genetic similarity between two individuals be computed? This post uses biostatistics to explore genetic similarity and tries to answer these and other questions.