{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dealing with multiple half-sib families" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tom Ellis, March 2018, updated June 2020" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created using FAPS version 2.6.6.\n" ] } ], "source": [ "import numpy as np\n", "import faps as fp\n", "import matplotlib.pyplot as plt\n", "\n", "print(\"Created using FAPS version {}.\".format(fp.__version__))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous sections on [genotype arrays](https://fractional-analysis-of-paternity-and-sibships.readthedocs.io/en/latest/tutorials/02_genotype_data.html), [paternity arrays](https://fractional-analysis-of-paternity-and-sibships.readthedocs.io/en/latest/tutorials/03_paternity_arrays.html) and [sibship clustering](https://fractional-analysis-of-paternity-and-sibships.readthedocs.io/en/latest/tutorials/04_sibship_clustering.html) we considered only a single half-sibling array. In most real-world situations, you would probably have multiple half-sibling arrays from multiple mothers.\n", "\n", "FAPS assumes that these families are independent, which seems a reasonable assumption for most application, so dealing with multiple families boils down to performing the same operation on these families through a loop. This guide outlines some tricks to automate this.\n", "\n", "This notebook will examine how to:\n", "\n", "1. Divide a dataset into multiple families\n", "2. Perform sibship clustering on those families\n", "3. Extract information from objects for multiple families\n", "\n", "To illustrate this we will use data from wild-pollinated seed capsules of the snapdragon *Antirrhinum majus*. Each capsule represents a single maternal family, which may contain mixtures of full- and half-siblings. Each maternal family can be treated as independent.\n", "\n", "These are the raw data described in Ellis *et al.* (2018), and are available from the [IST Austria data repository](https://datarep.app.ist.ac.at/id/eprint/95) (DOI:10.15479/AT:ISTA:95). For the analysis presented in that paper we did extensive data cleaning and checking, which is given as a [case study](https://fractional-analysis-of-paternity-and-sibships.readthedocs.io/en/latest/tutorials/08_data_cleaning_in_Amajus.html) later in this guide. Here, we will skip this process, since it primarily concerns accuracy of results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Divide genotype data into families" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two ways to divide data into families: by splitting up a `genotypeArray` into families, and making a `paternityArray` for each, or create a single `paternityArray` and split up that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Frequently offspring have been genotyped from multiple half-sibling arrays, and it is convenient to store these data together in a single file on disk. However, it (usually) only makes sense to look for sibling relationships *within* known half-sib families, so we need to split those data up into half-sibling famililes.\n", "\n", "First, import the required packages and data for the sample of candidate fathers and the progeny." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "source": [ "%pylab inline\n", "\n", "adults = fp.read_genotypes('../../data/parents_2012_genotypes.csv', genotype_col=1)\n", "progeny = fp.read_genotypes('../../data/offspring_2012_genotypes.csv', genotype_col=2, mothers_col=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity, let's restrict the progeny to those offspring belonging to three maternal families." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ix = np.isin(progeny.mothers, ['J1246', 'K1809', 'L1872'])\n", "progeny = progeny.subset(individuals=ix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need to define an array of genotypes for the mothers, and a genotyping error rate." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "mothers = adults.subset(individuals=np.unique(progeny.mothers))\n", "mu= 0.0015" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pull out the numbers of adults and progeny in the dataset, as well as the number of maternal families." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2124\n", "76\n", "3\n" ] } ], "source": [ "print(adults.size)\n", "print(progeny.size)\n", "print(len(np.unique(progeny.mothers)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most maternal families are between 20 and 30, with some either side." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split up the `genotypeArray`" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "In the data import we specified that the ID of the mother of each offspring individual was given in column 2 of the data file (i.e. column 1 for Python, which starts counting from zero). Currently this information is contained in `progeny.mothers`.\n", "\n", "To separate a `genotypeArray` into separate families you can use `split`, and the vector of maternal names. This returns a **dictionary** of `genotypeArray` objects for each maternal family." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "progeny2 = progeny.split(progeny.mothers)\n", "mothers2 = mothers.split(mothers.names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we inspect `progeny2` we can see the structure of the dictionary. Python dictionaries are indexed by a **key**, which in this case is the maternal family name. Each key refers to some **values**, which in this case is a `genotypeArray` object for each maternal family." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'J1246': ,\n", " 'K1809': ,\n", " 'L1872': }" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "progeny2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can pull attributes about an individual family by indexing the key like you would for any other python dictionary." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "progeny2[\"J1246\"].size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do this for all families you can iterate with a **dictionary comprehension**, or loop over the dictionary. Here are three ways to get the number of offspring in each maternal family:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'J1246': 25, 'K1809': 25, 'L1872': 26}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{k: v.size for k,v in progeny2.items()} # the .items() suffix needed to separate keys and values" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'J1246': 25, 'K1809': 25, 'L1872': 26}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{k : progeny2[k].size for k in progeny2.keys()} # using only the keys." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "J1246 25\n", "K1809 25\n", "L1872 26\n" ] } ], "source": [ "# Using a for loop.\n", "for k,v in progeny2.items():\n", " print(k, v.size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `paternityArray` objects with multiple families" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paternity from a dictionary of `genotypeArray` objects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The previous section divided up a `genotypeArray` containing data for offspring from multiple mothers and split that up into maternal families. You can then pass this dictionary of `genotypeArray` objects to `paternity_array` directly, just as if they were single objects. `paternity_array` detects that these are dictionaries, and returns a dictionary of `paternityArray` objects." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.04 s, sys: 77.9 ms, total: 1.12 s\n", "Wall time: 1.12 s\n" ] } ], "source": [ "%time patlik1 = fp.paternity_array(progeny2, mothers2, adults, mu, missing_parents=0.01)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split up an existing paternity array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The alternative way to do this is to pass the entire arrays for progeny and mothers to \n", "`paternity_array`. A word of caution is needed here, because `paternity_array` is quite memory hungry, and for large datasets there is a very real chance you could exhaust the RAM on your computer and the machine will grind to a halt. By splitting up the genotype data first you can deal with small chunks at a time." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 980 ms, sys: 344 ms, total: 1.32 s\n", "Wall time: 1.32 s\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mothers_full = adults.subset(progeny.mothers)\n", "\n", "%time patlik2 = fp.paternity_array(progeny, mothers_full, adults, mu, missing_parents=0.01)\n", "patlik2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There doesn't seem to be any difference in speed the two methods, although in other cases I have found that creating a single `paternityArray` is slower. Your mileage may vary.\n", "\n", "We split up the `paternity_array` in the same way as a `genotype_array`. It returns a list of `paternityArray` objects." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'J1246': ,\n", " 'K1809': ,\n", " 'L1872': }" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "patlik3 = patlik2.split(progeny.mothers)\n", "patlik3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would hope that `patlik` and `patlik3` are identical lists of `paternityArray` objects. We can inspect family J1246 to check:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['J1246_221', 'J1246_222', 'J1246_223', 'J1246_224', 'J1246_225',\n", " 'J1246_226', 'J1246_227', 'J1246_228', 'J1246_229', 'J1246_230',\n", " 'J1246_231', 'J1246_232', 'J1246_233', 'J1246_241', 'J1246_615',\n", " 'J1246_616', 'J1246_617', 'J1246_618', 'J1246_619', 'J1246_620',\n", " 'J1246_621', 'J1246_622', 'J1246_623', 'J1246_624', 'J1246_625'],\n", " dtype=',\n", " 'K1809': ,\n", " 'L1872': }" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "sc = fp.sibship_clustering(patlik1)\n", "sc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time there is quite a substantial speed advantage to performing sibship clustering on each maternal family separately rather than on all individuals together. This advanatge is modest here, but gets substantial quickly as you add more families and offspring, because the number of *pairs* of relationships to consider scales quadratically. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 637 ms, sys: 12 µs, total: 638 ms\n", "Wall time: 636 ms\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time fp.sibship_clustering(patlik2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can index any single family to extract information about it in the same way as was explained in the section on [sibship clustering](https://fractional-analysis-of-paternity-and-sibships.readthedocs.io/en/latest/tutorials/04_sibship_clustering.html). For example, the posterior distribution of full-sibship sizes for the first maternal family." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([4.98363699e-01, 0.00000000e+00, 2.50818150e-01, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", " 0.00000000e+00, 0.00000000e+00, 1.09643257e-03, 2.45353454e-01,\n", " 4.36749302e-03, 7.70743245e-07, 0.00000000e+00, 0.00000000e+00,\n", " 1.58450339e-56])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sc['J1246'].family_size()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with `genotypeArray` objects, to extract information about each `sibshipCluster` object it is straightforward to set up a list comprehension. For example, this cell pulls out the number of partition structures for each maternal family." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'J1246': 15, 'K1809': 15, 'L1872': 11}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{k : v.npartitions for k,v in sc.items()}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Paternity for many arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since paternity is likely to be a common aim, there is a handy function for calling the `sires` method for [individual `sibshipCluster` objects](https://fractional-analysis-of-paternity-and-sibships.readthedocs.io/en/latest/tutorials/04_sibship_clustering.html#mating-events) on an entire dictionary of `sibshipCluster` objects." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
motherfatherlog_probproboffspring
0J1246M0551-1.319064e-020.9868960.973894
1J1246M1103-1.541488e-060.9999980.999881
2J1246M0147-5.206256e+000.0054820.000030
3J1246M00250.000000e+001.0000003.000000
4J1246nan0.000000e+001.00000020.026195
5K1809M0551-3.966770e-030.9960410.992776
6K1809M2047-3.327946e+000.0358670.016896
7K1809M1971-9.120326e-010.4017070.189237
8K1809L06622.220446e-161.0000001.000000
9K1809M0392-3.327946e+000.0358670.016896
10K1809nan2.220446e-161.00000022.784195
11L1872L01240.000000e+001.0000002.000000
12L1872M0107-3.686856e+000.0250510.000628
13L1872M08190.000000e+001.0000009.000000
14L1872nan0.000000e+001.00000014.999372
\n", "
" ], "text/plain": [ " mother father log_prob prob offspring\n", "0 J1246 M0551 -1.319064e-02 0.986896 0.973894\n", "1 J1246 M1103 -1.541488e-06 0.999998 0.999881\n", "2 J1246 M0147 -5.206256e+00 0.005482 0.000030\n", "3 J1246 M0025 0.000000e+00 1.000000 3.000000\n", "4 J1246 nan 0.000000e+00 1.000000 20.026195\n", "5 K1809 M0551 -3.966770e-03 0.996041 0.992776\n", "6 K1809 M2047 -3.327946e+00 0.035867 0.016896\n", "7 K1809 M1971 -9.120326e-01 0.401707 0.189237\n", "8 K1809 L0662 2.220446e-16 1.000000 1.000000\n", "9 K1809 M0392 -3.327946e+00 0.035867 0.016896\n", "10 K1809 nan 2.220446e-16 1.000000 22.784195\n", "11 L1872 L0124 0.000000e+00 1.000000 2.000000\n", "12 L1872 M0107 -3.686856e+00 0.025051 0.000628\n", "13 L1872 M0819 0.000000e+00 1.000000 9.000000\n", "14 L1872 nan 0.000000e+00 1.000000 14.999372" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fp.summarise_sires(sc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output is similar to that of the `sires()` method, except that it gives labels for mother and father separately, replacing the `label` column. The `prob` and `offspring` columns have the same interpretation as for single `sibshipCluster` objects." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }