Sibship clustering

Tom Ellis, March 2017, updated June 2020

[34]:
import faps as fp
import numpy as np
import pandas as pd

print("Created using FAPS version {}.".format(fp.__version__))

Created using FAPS version 2.6.6.

FAPS uses information in a paternityArray to generate plausible full-sibship configurations. This information is stored as a sibshipCluster object, and is the basis for almost all inference about biological processes in FAPS.

This notebook will examine how to:

  1. Use a paternityArray to cluster offspring into plausible full-sibships.

  2. Compare the relative support for different partitions structures

  3. Make some basic inferences about the size and number of full sibships, and who is related to whom.

Note that this tutorial only deals with the case where you have a single maternal family. If you have multiple families, you can apply what is here to each one, but you’ll have to iterate over those families. See the specific tutorial on that.

Generating a sibshipCluster object

We will begin by generating a population of 100 adults with 50 loci.

[3]:
np.random.seed(867)
allele_freqs = np.random.uniform(0.3,0.5,50)
adults = fp.make_parents(100, allele_freqs, family_name='a')

We take the first individal as the mother and mate her to four males, to create three full sibships of five offspring. We then generate a paternityArray based on the genotype data.

[4]:
progeny = fp.make_sibships(adults, 0, [1,2,3], 5, 'x')
mothers = adults.subset(progeny.mothers)
patlik  = fp.paternity_array(progeny, mothers, adults, mu = 0.0015, missing_parents=0.01)

It is straightforward to cluster offspring into full sibships. For now we’ll stick with the default number of Monte Carlo draws.

[5]:
sc = fp.sibship_clustering(patlik)

The default number of Monte Carlo draws is 1000, which seems to work in most cases. I have dropped it to 100 in cases where I wanted to call sibship_clustering many times, such as in an MCMC loop, when finding every possible candidate wasn’t a priority. You could also use more draws if you really wanted to be sure you had completely sampled the space of compatible candidate fathers. Speeds are unlikely to increase linearly with number of draws:

[6]:
%timeit fp.sibship_clustering(patlik, ndraws=100)
%timeit fp.sibship_clustering(patlik, ndraws=1000)
%timeit fp.sibship_clustering(patlik, ndraws=10000)
17.5 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
35.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
312 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We discussed this in figure 5 of the FAPS paper should you be interested in more on this.

Sibling relationships

Sibship clustering calculates likelihoods that each pair of offspring are full siblings, then builds a dendrogram from this. We can visualise this dendrogram if we so wish, although the output is not pretty.

[7]:
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

dendrogram(sc.linkage_matrix)
plt.show()
../_images/tutorials_04_sibship_clustering_16_0.png

Offspring individuals are labelled by their index in the array. Since full sibships are of size five we should expect to see clusters of {0,1,2,3,4}, {5,6,7,8,9} and {10,11,12,13,14}. This is indeed what we do see. What is difficult to see on the dendrogram are the branches connecting full siblings at the very bottom of the plot. If we bisect this dendrogram at different places on the y-axis we can infer different ways to partition the offspring into full siblings.

sc is an object of class sibshipCluster that contains various information about the array. Of primary interest are the set of partition structures inferred from the dendrogram. There are sixteen partitions - one for each individual in the array (i.e. one for each bifurcation in the dendrogram).

[8]:
sc.partitions
[8]:
array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  1,  1,  1,  1,  1],
       [ 2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  1,  1,  1,  1,  1],
       [ 3,  3,  3,  3,  3,  4,  4,  4,  4,  4,  1,  2,  1,  1,  1],
       [ 3,  3,  3,  3,  3,  4,  4,  4,  5,  4,  1,  2,  1,  1,  1],
       [ 3,  3,  4,  3,  3,  5,  5,  5,  6,  5,  1,  2,  1,  1,  1],
       [ 3,  3,  4,  3,  3,  5,  5,  5,  7,  6,  1,  2,  1,  1,  1],
       [ 3,  3,  4,  3,  3,  5,  5,  6,  8,  7,  1,  2,  1,  1,  1],
       [ 3,  4,  5,  3,  3,  6,  6,  7,  9,  8,  1,  2,  1,  1,  1],
       [ 3,  5,  6,  4,  3,  7,  7,  8, 10,  9,  1,  2,  1,  1,  1],
       [ 4,  6,  7,  5,  4,  8,  8,  9, 11, 10,  2,  3,  1,  1,  1],
       [ 4,  7,  8,  6,  5,  9, 10, 11, 13, 12,  2,  3,  1,  1,  1],
       [ 5,  8,  9,  7,  6, 10, 11, 12, 14, 13,  3,  4,  1,  1,  2],
       [ 6,  9, 10,  8,  7, 11, 12, 13, 15, 14,  4,  5,  1,  2,  3]],
      dtype=int32)

What is key about partition structures is that each symbol represents a unique but arbitrary family identifier. For example in the third row we see the true partition structure, with individuals grouped into three groups of five individuals.

[9]:
sc.partitions[2]
[9]:
array([2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1], dtype=int32)

Beyond denoting who is in a family with whom, the labels are arbitrary, with no meaningful order. This partition would be identical to [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2] or [10,10,10,10,10,7,7,7,7,7,22,22,22,22,22] for example.

Each partition is associated with a log likelihood and equivalent log probability. We can see from both scores that the third partition is most consistent with the data. This is of course the true partition.

[10]:
print(sc.lik_partitions) # log likelihood of each partition
print(np.exp(sc.prob_partitions)) # probabilities of each partition
[-4.71859462e+02 -3.07811673e+02 -1.54187774e-12            -inf
            -inf            -inf            -inf            -inf
            -inf            -inf            -inf            -inf
            -inf            -inf]
[1.18587580e-205 2.08491863e-134 1.00000000e+000 0.00000000e+000
 0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000
 0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000
 0.00000000e+000 0.00000000e+000]

We also see that the first and second partitions have non-zero, but small likelihoods. Parititons 5-8 have negative infinity log likelihood - they are incompatible with the data. These partitions split up true full siblings, and there is no way to reconcile this with the data. In real world situations such partitions might have non-zero likelihoods if they were an unrelated candidate male compatible with one or more offspring through chance alone.

In some cases there can be rounding error when log probabilities are exponentiated and probabilities do not sum to one. This is classic machine error, and the reason it is good to work with log values wherever possible. We can check:

[11]:
np.exp(sc.prob_partitions).sum()
[11]:
1.0

You can directly call the most likely partition. This is somewhat against the spirit of fractional analyses though…

[12]:
sc.mlpartition
[12]:
array([2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1], dtype=int32)

For information about fine scale relationships, sc.full_sib_matrix() returns an \(n*n\) matrix, where \(n\) is the number of offspring. Each element describes the (log) probability that a pair of individuals are full siblings, averaged over partition structures and paternity configurations. If we plot this using a heatmap you can clearly see the five full sibships jump out as blocks of yellow (>90% probability of being full siblings) against a sea of purple (near zero probability of being full siblings).

[13]:
sibmat = sc.full_sib_matrix()
plt.pcolor(np.exp(sibmat))
plt.colorbar()
plt.show()
../_images/tutorials_04_sibship_clustering_31_0.png

Note that real datasets seldom look this tidy!

Inferring family structure

For this section we will simulate a slightly more interesting family structure. This block of code creates a half-sib array of 15 offspring from five fathers, where each father contributes five, four, three, two and one offspring respectively. It then performs sibship clustering on the array. We use 1000 candidate males and 50 loci.

[14]:
# Lists indexing sires and dams
sires = [1]*5 + [2]*4 + [3]*3 + [4]*2 +[5]*1
dam   = [0] * len(sires)

np.random.seed(542)
allele_freqs = np.random.uniform(0.3,0.5,30)
adults  = fp.make_parents(1000, allele_freqs, family_name='a')
progeny = fp.make_offspring(adults, dam_list=dam, sire_list=sires)
mothers = adults.subset(progeny.mothers)

patlik  = fp.paternity_array(progeny, mothers, adults, mu= 0.0015, missing_parents=0.01)
sc = fp.sibship_clustering(patlik)

Number of families

We saw before that we could call a list of valid partitions for sc using sc.partitions. The output is not terribly enlightening on its own, however. We could instead ask how probable it is that there are x full sibships in the array, integrating over all partition structures. Here each number is the probability that there are 1, 2, …, 15 families.

[15]:
sc.nfamilies()
[15]:
array([9.34086609e-146, 3.27492546e-105, 1.31380725e-086, 3.81521349e-035,
       7.66449138e-001, 2.09575936e-001, 2.36188753e-002, 3.35259124e-004,
       2.07912635e-005, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 0.00000000e+000])

We could show the same information graphically. Its clear that almost all the probability denisty is around \(x=5\) families.

[16]:
%matplotlib inline
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111)
#ax.bar(np.arange(0.5, len(sc.nfamilies())+0.5), sc.nfamilies())
ax.bar(np.arange(1,16), sc.nfamilies())
ax.set_xlabel('Number of full sibships')
ax.set_ylabel('Posterior probability')
plt.show()
../_images/tutorials_04_sibship_clustering_40_0.png

Family size

We can also get the distribution of family sizes within the array, averaged over all partitions. This returns a vector of the same length as the number of offspring in the array. family_size returns the posterior probability of observing one or more families of size 1, 2, … , 15. It will be clear that we are unable to distinguish a single sibship with high probability from multiple families of the same size, each with low probability; this is the price we pay for being able to integrate out uncertainty in partition structure.

[17]:
sc.family_size()
[17]:
array([2.33452339e-001, 1.95053525e-001, 1.88263368e-001, 2.29940941e-001,
       1.53289828e-001, 0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
       4.37935750e-087, 0.00000000e+000, 1.63746273e-105, 0.00000000e+000,
       0.00000000e+000, 0.00000000e+000, 9.34086609e-146])

Plotting this shows that we are roughly equally likely to observe a family of sizes one, two, three, four and five.

[18]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(np.arange(len(sires))+0.5, sc.family_size())

plt.show()
../_images/tutorials_04_sibship_clustering_45_0.png

Identifying fathers

In the section on paternityArray objects we used the method prob_array() to create an matrix of probabilities that each candidate is the father of each offspring, or that the father was missing from a paternityArray object. That was based only on comparing alleles shared between the mother, the candidate father and individual offspring.

We can get a similar matrix of probabilities of paternity after accounting for information about paternity that is shared between siblings by calling posterior_paternity_matrix on the sibshipCluster object sc. For example, father ‘a_1’ is the true sire of the furst five individuals in this toy dataset. We can compare his probabilities of paternity for his real offspring to the other unrelated progeny. We see that there is an increase in support for him as the father of the first five individuals after accounting for information shared between siblings, and a decrease for the other progeny.

[43]:
np.exp(
    pd.DataFrame({
        "Before clustering" : patlik.prob_array()[:, 1],
        "After clustering"  : sc.posterior_paternity_matrix()[:, 1]
    })
)
[43]:
Before clustering After clustering
0 8.842832e-01 1.000000e+00
1 1.000000e+00 1.000000e+00
2 8.630426e-01 9.999512e-01
3 9.696970e-01 1.000000e+00
4 7.852761e-01 9.498510e-01
5 3.906250e-72 4.088879e-273
6 1.367187e-83 4.088879e-273
7 8.750000e-60 4.088879e-273
8 8.750000e-60 4.088879e-273
9 1.070804e-83 2.226337e-88
10 7.567568e-83 1.814319e-84
11 1.296296e-71 2.695164e-76
12 1.869202e-72 7.826369e-132
13 3.926282e-60 7.826369e-132
14 1.361111e-82 1.361111e-82

Since posterior_paternity_matrix returns a matrix, some wrangling is required to get the output into a format that is useful for a human to read. The following sections describe two helper functions that use posterior_paternity_matrix to create (1) a summary of probable mating events and (2) a summary of the paternity of each offspring.

Mating events

We very frequently want to know who the fathers of the offspring were to say something about mating events. There are several levels of complexity. Firstly, you can use the sires method to return a list of all the males who could possibly have mated with the mother. This is essentially identifying mating events, but doesn’t say anything about the paternity of individual offspring. For many applications, that may be all you need because it’s the mating events that are the unit of interest, not the number of offspring per se.

Once you have a sibshipCluster object, doing this is easy:

[18]:
sc.sires()
[18]:
position label log_prob prob offspring
0 1 a_1 0.000000e+00 1.000000e+00 4.949802e+00
1 2 a_2 0.000000e+00 1.000000e+00 4.000000e+00
2 3 a_3 0.000000e+00 1.000000e+00 2.996754e+00
3 4 a_4 -1.110223e-16 1.000000e+00 2.000000e+00
4 5 a_5 -1.151638e-01 8.912201e-01 8.888889e-01
5 87 a_87 -2.218429e+00 1.087799e-01 1.111111e-01
6 152 a_152 -1.078098e+01 2.079126e-05 1.163148e-06
7 254 a_254 -2.209107e+00 1.097987e-01 2.292585e-02
8 257 a_257 -3.677553e+00 2.528478e-02 5.731309e-03
9 376 a_376 -2.908523e+00 5.455628e-02 1.146262e-02
10 388 a_388 -3.677656e+00 2.528218e-02 5.731309e-03
11 631 a_631 -5.118017e+00 5.987884e-03 1.432977e-03
12 639 a_639 -4.750292e+00 8.649166e-03 1.295942e-03
13 668 a_668 -7.947764e+00 3.534515e-04 3.841083e-05
14 680 a_680 -4.542409e+00 1.064773e-02 1.295942e-03
15 693 a_693 -4.370803e+00 1.264109e-02 2.865655e-03
16 852 a_852 -1.286042e+01 2.598908e-06 6.001692e-07
17 963 a_963 -5.364877e+00 4.678034e-03 6.479710e-04
18 1000 nan -1.977494e+02 1.313807e-86 2.826406e-10

The columns in the output tell you several bits of information. The most interesting of these are:

  1. label is the name of the candidate father

  2. prob is the probability that the male sired at least one offspring with the mother, as a weighted average over partition structures. For example,

  3. offspring shows the expected number of offspring sired by the male, as a weighted average over partition structures. Specifically, it’s the sum over rows from posterior_paternity_matrix; see below.

Note that if you have multiple maternal families the output will look a bit different. See the tutorial on multiple maternal families for more.

We can check this table makes sense by reviewing who the real fathers really are. This snippet gives a list of the names of the five true fathers, followed by the number of offspring sired by each.

[19]:
np.unique(patlik.fathers, return_counts=True)
[19]:
(array(['a_1', 'a_2', 'a_3', 'a_4', 'a_5'], dtype='<U5'),
 array([5, 4, 3, 2, 1]))

The first five rows of the table above show that these fathers have posterior probabilities of paternity of one or close to one, and seem to have sired the correct numbers of offspring each. Of note is that although a_1 to a_4 have posterior probabilities of exactly one, the posterior probability for a_5 is slightly less than one. This is because the first four fathers sired multiple offspring, and there is shared information between siblings about the father, but this is not the case for father a_5.

After the true fathers there are a long list of extra candidates with very low posterior probabilities of paternity. In this case we know they are not true fathers, but in real life we would not, and we would like to account for this uncertainty.

Paternity of individuals

If you are interested in the paternity of individual offspring we can look at the output of the paternity method of sibshipCluster objects. This returns a data frame with a row for each offspring, and the name and log posterior probability of paternity for the four most likely candidate fathers.

[21]:
sc.paternity()
[21]:
progeny candidate_1 logprob_1 candidate_2 logprob_2 candidate_3 logprob_3 candidate_4 logprob_4
0 offs_0 a_1 -1.229779e-01 a_839 -2.202419 a_852 -5.668155 a_631 -6.361302
1 offs_1 a_1 -1.199290e-10 missing -22.852463 a_14 -28.101025 a_773 -29.710463
2 offs_2 a_1 -1.472913e-01 a_668 -2.226733 a_839 -3.613027 a_852 -6.385616
3 offs_3 a_1 -3.077166e-02 a_343 -3.496508 missing -21.443367 a_605 -27.661793
4 offs_4 a_1 -2.417199e-01 a_254 -2.321161 a_376 -3.014309 a_388 -3.707456
5 offs_5 a_2 -1.355254e-10 missing -22.732634 a_329 -28.457700 a_951 -29.017315
6 offs_6 a_2 -6.884449e-11 missing -23.405092 a_378 -28.794172 a_368 -30.403610
7 offs_7 a_2 -4.156320e-11 missing -23.909696 a_813 -30.180466 a_712 -30.403610
8 offs_8 a_2 -7.621637e-11 missing -23.310463 a_951 -29.017315 a_792 -29.150847
9 offs_9 a_3 -1.108144e-01 a_152 -2.883403 a_159 -3.576550 a_368 -4.269697
10 offs_10 a_3 -1.451820e-01 a_680 -2.917771 a_639 -2.917771 a_963 -3.610918
11 offs_11 a_3 -5.324451e-02 a_254 -3.518980 a_711 -4.212128 a_950 -4.905275
12 offs_12 a_4 -6.866347e-10 missing -21.107519 a_39 -27.071405 a_382 -27.407878
13 offs_13 a_4 -6.429435e-02 a_604 -2.836883 a_27 -5.609472 missing -21.792526
14 offs_14 a_5 -1.177830e-01 a_87 -2.197225 missing -23.714176 a_852 -32.377691

In this case, the data are simulated, so we know who the real fathers are, and can print them with progeny.fathers. We see that the top candidate for all offspring is indeed the true father, with strong support (log probabilitites close to zero).

[25]:
progeny.fathers
[25]:
array(['a_1', 'a_1', 'a_1', 'a_1', 'a_1', 'a_2', 'a_2', 'a_2', 'a_2',
       'a_3', 'a_3', 'a_3', 'a_4', 'a_4', 'a_5'], dtype='<U5')

We also see that in most cases the second most-probable suggestion is that the father is not in the samples of candidates. That’s a good sign, because it means that if the real father had been missed for some reason, FAPS would have recognised this.

By default, paternity will return information on the top four candidates, but this can be changed with the n_candidates option (for example, sc.paternity(n_candidates = 5)). It goes somewhat against the spirit of fractional paternity to consider only the top candidate fathers, because we are interested in the uncertainty in paternity. In practice however, I have found that when paternity is done with SNP data, top candidates usually have very strong support, but I would be interested to learn if users have different experiences.