{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Simulating data and power analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tom Ellis, August 2017" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created using FAPS version 2.6.6.\n" ] } ], "source": [ "import numpy as np\n", "import faps as fp\n", "import matplotlib.pylab as plt\n", "import pandas as pd\n", "from time import time, localtime, asctime\n", "\n", "print(\"Created using FAPS version {}.\".format(fp.__version__))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before committing to the time and cost of genotyping samples for a paternity study, it is always sensible to run simulations to test the likely statistical power of your data set. This can help with important questions regaridng study design, such as finding an appropriate balance between the number of families vs offspring per family, or identifying a minimum number of loci to type. Simulated data can also be useful in verifying the results of an analysis.\n", "\n", "FAPS provides tools to run such simulations. In this notebook we look look at:\n", "\n", "1. Basic tools for simulating genotype data.\n", "2. Automated tools for power analysis.\n", "3. Crafting custom simulations for specialised purposes.\n", "4. Simulations using emprical datasets (under construction).\n", "\n", "It is worth noting that I relied on loops for a lot of these tools, for the purely selfish reason that it was easy to code. Loops are of course slow, so if you work with these tools a lot there is ample scope for speeding things up (see especially the functions `make_offspring`, `make_sibships` and `make_power`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulation building blocks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating `genotypeArray` objects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simulations are built using `genotypeArrays`. See the section on these [here](http://localhost:8889/notebooks/docs/02%20Genotype%20data.ipynb) for more information.\n", "\n", "`make_parents` generates a population of reproductive adults from population allele frequencies.\n", "This example creates ten individuals.\n", "Note that this population will be in Hardy-Weinberg equilibrium, but yours may not." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "np.random.seed(37)\n", "allele_freqs = np.random.uniform(0.2, 0.5, 50)\n", "adults = fp.make_parents(10, allele_freqs, family_name='adult')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are multiple ways to mate adults to generate offspring. If you supply a set of adults and an integer number of offspring, `make_offspring` mates adults at random." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['adult_7/adult_2', 'adult_1/adult_6', 'adult_8/adult_3',\n", " 'adult_8/adult_0', 'adult_0/adult_7'], dtype='" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "eventab, evenclusters = fp.make_power(\n", " replicates = r, \n", " nloci = nloci,\n", " allele_freqs = allele_freqs,\n", " candidates = nadults,\n", " sires = sires,\n", " offspring = offspring, \n", " missing_loci=0,\n", " mu_real = mu, \n", " unsampled_input=0.01,\n", " return_clusters=True,\n", " verbose=False\n", ")\n", "even_famsizes = np.array([evenclusters[i].family_size() for i in range(len(evenclusters))])\n", "\n", "plt.plot(even_famsizes.mean(0))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Custom simulations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you are familiar with the basic building blocks for generating data and running analysis, creating your own simulations if largely a case of setting up combinations of parameters, and looping over them.\n", "Given the vast array of possible scenarios you could want to simulate, it is impossible to be comprehensive here, so it must suffice to given a couple of examples for inspiration." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Likelihood for missing sires" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example is was interested in the performance of the likelihood estimator for a sire being absent.\n", "This is the likelihood of generating the offspring genotype if paternal alleles come from population allele frequencies.\n", "This is what the attribute `lik_abset` in a `paternityArray` tells you.\n", "\n", "Ideally this likelihood should be below the likelihood of paternity for the true sire, but higher than that of the other candidates. I suspected this would not be the case when minor allele frequency is low and there are many candidates.\n", "\n", "This cell sets up the simulation. I'm considering 50 loci, and mu=0.0015, but varying sample size and allele frequency." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Common simulation parameters\n", "nreps = 10 # number of replicates\n", "nloci = [50] # number of loci\n", "allele_freqs = [0.1, 0.2, 0.3, 0.4, 0.5] # draw allele frequencies \n", "nadults = [10, 100, 250, 500, 750, 1000] # size of the adults population\n", "mu_list = [0.0015] #genotype error rates\n", "nsims = nreps * len(nloci) * len(allele_freqs) * len(nadults) * len(mu_list) # total number of simulations to run\n", "dt = np.zeros([nsims, 7]) # empty array to store data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This cell simulates genotype data and clusters the offspring into full sibships.\n", "The code pulls out the mean probability that each sire is absent, and the rank of the likelihood for a missing sire among the likelihoods of paternity for the candidates." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Beginning simulations on Wed Aug 18 11:12:20 2021.\n", "Completed in 0.03 hours.\n" ] } ], "source": [ "t0 = time()\n", "counter = 0\n", "\n", "print(\"Beginning simulations on {}.\".format(asctime(localtime(time()) )))\n", "\n", "for r in range(nreps):\n", " for l in range(len(nloci)):\n", " for a in range(len(allele_freqs)):\n", " for n in range(len(nadults)):\n", " for m in range(len(mu_list)):\n", " af = np.repeat(allele_freqs[a], nloci[l])\n", " adults = fp.make_parents(nadults[n], af)\n", " progeny = fp.make_offspring(adults, 100)\n", " mi = progeny.parent_index('m', adults.names) # maternal index\n", " mothers = adults.subset(mi)\n", " patlik = fp.paternity_array(progeny, mothers, adults, mu_list[m], missing_parents=0.01)\n", " # Find the rank of the missing term within the array.\n", " rank = [np.where(np.sort(patlik.prob_array()[i]) == patlik.prob_array()[i,-1])[0][0] for i in range(progeny.size)]\n", " rank = np.array(rank).mean() / nadults[n]\n", " # get the posterior probabilty fir the missing term.\n", " prob_misisng = np.exp(patlik.prob_array()[:, -1]).mean()\n", " #export data\n", " dt[counter] = np.array([r, nloci[l], allele_freqs[a], nadults[n], mu_list[m], rank, prob_misisng])\n", " # update counters\n", " counter += 1\n", "\n", "print(\"Completed in {} hours.\".format(round((time() - t0)/3600,2)))\n", "\n", "head = ['rep', 'nloci', 'allele_freqs', 'nadults', 'mu', 'rank', 'prob_missing']\n", "dt = pd.DataFrame(dt, columns=head)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 1 }