The previous post discusses Kernel PCA and recipes, or formulae, for deriving new kernels from known good kernels. This post applies these approaches to generate vector embeddings in a specific domain: culinary recipes. The idea is to find a low-dimensional representation of recipes such that points in the embedding space are neighbors to similar recipes.
Read more
Kernel Recipes and Kernel PCA
One strength of kernel methods is their ability to operate directly on non-numerical objects like sets. As seen in the previous post, the Jaccard index on sets satisfies Mercer’s condition and thus is a valid kernel. The process of proving a similarity measure is a valid kernel is somewhat involved, but thankfully several theorems can be employed to get more mileage out of the set of known good kernels. This post outlines some recipes for producing new valid kernels and introduces a method for obtaining numerical representations of samples using kernel methods.
Read more
The Jaccard Kernel and its Implied Feature Space
Kernel methods leverage the kernel trick to implicitly perform inner products in often high and even infinite dimensional feature spaces. For instance, the radial basis function (RBF) kernel can be shown to map to an infinite-dimensional feature space. In general, if a similarity function can be shown to satisfy Mercer’s Condition, one may operate on the finite-dimensional Gram matrix induced by the function, while receiving the benefit of mathematical guarantees about the implied transformation.
On The Importance of Centering in PCA
The previous post presents methods for efficiently performing principal component analysis (PCA) on certain rectangular sparse matrices. Since routines for performing the singular value decomposition (SVD) on sparse matrices are readily available (e.g. svds and TruncatedSVD), it is reasonable to investigate the influence centering has on the resulting transformation.
Read more
Weighted Sparse PCA for Rectangular Matrices
Consider a sparse mxn rectangular matrix , where either or . Performing a principal component analysis (PCA) on involves computing the eigenvectors of its covariance matrix. This is often accomplished using the singular value decomposition (SVD) of the centered matrix . But, with large sparse matrices, this centering step is frequently intractable. If it is tractable, however, to compute the eigendecomposition of either an mxm or an nxn dense matrix in memory, other approaches are possible.
Weighted PCA
Consider an m by n matrix and an m by 1 vector of integers . Now, consider the matrix , where is formed by taking , that is the i-th row of , and repeating it times. This new matrix has dimension s by n, where
.
If , then it is undesirable to compute the full matrix and perform principal component analysis (PCA) on it. Instead, the highly structured form of , suggests there should be a more efficient method to perform this decomposition. This method is known as weighted PCA and is the topic of this post.
ICP In Practice
This post explores the iterative constrained pathways rule ensemble (ICPRE) method introduced in an earlier post using the Titanic dataset popularized by Kaggle [1]. The purpose of the text is to introduce the features and explore the behavior of the library.
Some of the code snippets in this post are shortened for brevity sake. To obtain the full source and data, please see the ICPExamples GitHub page [2].
Read more
Ancestry Determination Part 3: openSNP Data Evaluation
The website openSNP allows users to share and discuss genetic information [1]. The purpose of the site is perhaps best summarized by its bio on Twitter: “crowdsourcing genome wide association studies” [2]. This post uses the machine learning and statistical techniques developed in previous blog entries to analyze the ancestry of openSNP members who have genotyping data associated to their accounts. By comparing the derived results with the self-reported answers from openSNP users, the accuracy of these methods on real-world data is assessed.
Read more
Identifying Family Relationships using Genetic Similarity Measures
In general, are parents more genetically similar to their children or to their siblings? Why do some siblings look alike while others look completely different? How can the genetic similarity between two individuals be computed? This post uses biostatistics to explore genetic similarity and tries to answer these and other questions.
Decorrelation Redux
Consider the typical statement of the least squares problem.
,
where is the m x n data matrix, is the n x 1 vector of regression coefficients, and is the m x 1 vector of target values.
Read more