Genetic architecture and evolution of the S locus supergene in Primula vulgaris

Darwin's studies on heterostyly in Primula described two floral morphs, pin and thrum, with reciprocal anther and stigma heights that promote insect-mediated cross-pollination. This key innovation evolved independently in several angiosperm families. Subsequent studies on heterostyly in Primula contributed to the foundation of modern genetic theory and the neo-Darwinian synthesis. The established genetic model for Primula heterostyly involves a diallelic S locus comprising several genes, with rare recombination events that result in self-fertile homostyle flowers with anthers and stigma at the same height. Here we reveal the S locus supergene as a tightly linked cluster of thrum-specific genes that are absent in pins. We show that thrums are hemizygous not heterozygous for the S locus, which suggests that homostyles do not arise by recombination between S locus haplotypes as previously proposed. Duplication of a floral homeotic gene 51.7 million years (Myr) ago, followed by its neofunctionalization, created the current S locus assemblage which led to floral heteromorphy in Primula. Our findings provide new insights into the structure, function and evolution of this archetypal supergene.

H eterostyly evolved independently in at least 28 families of animal-pollinated angiosperms 1 . In the Primulaceae the majority of species 2 produce dimorphic flowers 3 , a characteristic inherited as a simple Mendelian trait; alleles are defined as S (Short style) and s (long style) 4 . The two floral forms are known as pin and thrum; thrums behave as heterozygous S/s and pins homozygous s/s 5 . Classical genetic studies on mutation, linkage and recombination by Bateson and Gregory 5 , Bridges 6 , Ernst 7,8 , De Winton and Haldane 9 , Darlington 10

and others established
Primula as an early genetic model, and led to the definition of a co-adapted linkage group of three genes at the S locus, G (Griffel (style) length), P (Pollen size) and A (Antheren (anther) position) 7 , which control distinct aspects of heteromorphic flower development; this locus defined the archetypal supergene 11 . Studies of heterostyly in Primula contributed significantly to the foundation of modern genetic theory and the neo-Darwinian synthesis. Supergenes have subsequently been shown to control other multitrait complex phenotypes in plants, animals and fungi 12 .
Pin flowers have a long style and low anthers, and thrum flowers have a short style and high anthers (Fig. 1a). This reciprocal herkogamy promotes insect-mediated cross-pollination between floral morphs, which actively enhances efficiency of reciprocal pollen transfer 13 ; such biotic pollination is associated with an elevated speciation rate in angiosperms 14 . Differences in stigma shape, papillae length, pollen size and corolla mouth diameter characterize dimorphic Primula flowers 3,15 ; a sporophytic self-incompatibility system 16 inhibits intramorph pollination 1 , with different efficacy in different Primula species 17 . Self-fertile homostyle flowers (Fig. 1a) occasionally occur 13 ; although originally considered mutants 8 , later studies led to the widely accepted view that self-fertile homostyles arise by recombination in heterozygous thrums between dominant (GPA) and recessive ( gpa) haplotypes, with associated disruption of coupling between male and female self-incompatibility functions (for example, gPA and Gpa) 18,19 . This interpretation defined the order of genes at the Primula S locus, and has formed the backdrop to the last 60 years of research into the S locus supergene, including models on the evolution of heterostyly 20,21 and population genetic analyses [22][23][24] in natural homostyle populations 25,26 .
More recent studies aimed at identifying S locus genes involved examination of flower development 27 , analysis of differentially expressed floral genes 28 , characterization of S locus-linked sequences 29,30 , molecular genetic analysis of S-linked mutant phenotypes [31][32][33] , creation of genetic and physical maps 34,35 , assembly of a partial genome sequence 36 and construction of bacterial artificial chromosome (BAC) contigs spanning the S locus 35 . Despite these extensive investigations, the genetic architecture of the S locus has, until now, been an unresolved enigma. Here we compare the S haplotype sequences from pin, thrum, long homostyle and short homostyle plants. The s haplotype lacks a 278 kb sequence containing five thrum-specific genes present in thrum and homostyles; thrums are therefore hemizygous not heterozygous for the S locus. We demonstrate that this 278 kb region is the only thrum-specific genomic region transcribed in flowers, and by genetic and natural population analyses demonstrate complete linkage to the S locus; our data indicate that homostyles cannot occur by recombination as proposed. We also provide an estimate of the evolutionary age of assembly for the S locus supergene.

Identification and assembly of the S locus
We previously used four S-linked probes to assemble two BAC contigs flanking the S locus; these were integrated into a genetic map with the gap between contigs predicted to contain some, or all, of the S locus genes 35 . A fifth S-linked probe 33 , GLO T , also identified a BAC clone that we could not position relative to our S locus map 35 . In parallel, we initiated the de novo assembly of a P. vulgaris reference genome using a self-fertile homozygous long homostyle (S LH1 /S LH1 ) from the Somerset population identified previously 25 . We also generated genomic sequencing data from individual pin (s/s) and thrum (S/s) plants, pools of their pin and thrum progeny and a short homostyle (S SH1 /s) 35 (Supplementary Table 1a). Figure 1a shows relevant floral phenotypes and genotypes.
Using the GLO T BAC (BAC70F11) we searched a long homostyle genome assembly (Supplementary Table 1b) to identify and link two genome sequence contigs. This step initiated the assembly of a contiguous 455,881 bp sequence encompassing the entire S LH1 haplotype from this highly homozygous inbred line ( Supplementary Fig. 1a). This assembly contains a 278,470 bp sequence which is absent from pins and flanked by a ∼3 kb tandem repeat that is present only as a single copy in the s haplotype (Fig. 1); each repeat contains a Cyclin-like F box (CFB) gene. We therefore focused on this region as the presumptive S locus. Sequences flanking the S LH1 278,470 bp region on the left (75,084 bp) and right (96,327 bp) share extensive similarity to the s haplotype ( Fig. 1b and Supplementary Fig. 1b).
Next, we designed polymerase chain reaction (PCR) primers for left-and right-border regions of S LH1 , and separate s haplotypespecific primers (Supplementary Table 2). Analyses with pin, thrum, long and short homostyle genomic DNA confirmed pin as s/s, and long homostyle as homozygous S LH1 /S LH1 (Fig. 1c).
Supported by sequence alignment (Supplementary Sequence Analyses 1 and 2), these data also show that thrum and the short homostyle share the same left-and right-border sequences as the long homostyle, and that they are both heterozygous for the s and S LHI flanking markers (Fig. 1c). The established model defines homostyles as recombinants between S and s haplotypes; if this is the case, long and short homostyles should possess reciprocal combinations of s and S haplotype left-and right-border sequences, but they do not (Fig. 1c).

Comparative analysis of S haplotypes
We then focused on the 278 kb region from S LH1 that is absent from the s haplotype. This region contains five predicted gene models, CCM T , GLO T , CYP T , PUM T and KFB T , which were manually curated and are supported by RNA-Seq data as thrum-specific in expression; four other models identify transposon sequences which were discounted as functional S locus genes and excluded from further analysis ( Supplementary Fig. 2a,b). CCM T (Conserved Cysteine Motif ) encodes a protein with a carboxy (C)-terminal domain that is conserved in monocots and dicots (Supplementary Sequence Analysis 3); proteins containing this novel domain are rich in either proline or negatively charged amino acids. One of these, PIG93 from Petunia x hybrida, is a partner of PSK8, a protein involved in brassinolide signalling 37 . A second CCM-like gene with 90% sequence similarity is found in both pin and thrum genomes. GLO T was originally defined as a thrum-specific allele 33 of P. vulgaris GLO, a floral homeotic gene responsible for the S locus-linked mutant phenotype Hose in Hose 32,35 . These data show GLO and GLO T as distinct loci; the encoded proteins share 82% sequence identity but the Hose in Hose mutation, in which GLO is dominantly upregulated, does not affect heterostyly 32 . CYP T encodes a cytochrome P450 similar to Arabidopsis CYP72B1, a brassinolide 26-hydroxylase 38 . CYP T is one of four CYP72 class genes in the P. vulgaris thrum genome. The other three are present in both pin and thrum; the closest encodes a protein with 65% sequence identity to CYP T . PUM T encodes a Pumilio-like 39 RNA-binding protein, and KFB T encodes a protein with similarity to the Arabidopsis Kiss-Me-Deadly Kelch repeat F Box protein involved in regulating cytokinin activity 40 . Both PUM T and KFB T are unique to the 278 kb region with no homologues found in our pin genome assembly. The tandemly duplicated sequences flanking the S locus contain Cyclin-like F Box genes, CFB TL and CFB TR ( Supplementary Fig. 2a); in pin, a single CFB P exists. Gene model predictions also identified seven genes in the 75 kb to the left of CFB TL , and eight genes in the 96 kb to the right of CFB TR , designated S Flanking Gene Left (SFG L ) and Right (SFG R ) ( Supplementary Fig. 2a); these genes are present in both pin and thrum.
To investigate S haplotype differences further, we aligned thrum and the short homostyle genome contigs to the 455 kb S LH1 region ( Supplementary Fig. 3a,b); although S and S SH1 assemblies are not contiguous, they show homology across the 278 kb region. We also aligned genomic sequencing reads from pin, thrum, long 25 and short 35 homostyle to the S LH1 assembly and plotted sequencing read depth across the 455 kb region (Fig. 2a). Sequences flanking the 278 kb insertion show a read depth of ∼60 in all four genomes. However, between CFB TL and CFB TR we see differences; the long homostyle (S LH1 /S LH1 ) behaves as a homozygote, but both thrum (S/s) and the short homostyle (S SH1 /s) have half the associated read depth (Fig. 2a); they behave genetically as heterozygotes, but our data show they are hemizygous for a region that is absent in pin (s/s). Alignment of all four genomes over this region further shows that thrum, long and short homostyles share the same boundary regions (Figs 1c, 2a and Supplementary Figs 2, 3). This detail, coupled with the presence of all five S locus genes in thrum, long and short homostyle (Fig. 2a) shows that these homostyles did not arise by recombination as proposed 18,19 , and that S, S LH1 and S SH1 haplotypes all reside within an equivalent region that is absent from pins.
To determine whether the 278 kb region is the only thrumspecific region in the genome, we searched for additional thrumspecific genomic sequences encoding genes. In two parallel analyses we identified transcripts that were only expressed in thrums, and also mapped pin genomic sequencing reads to a thrum genome assembly. We then examined the depth and breadth of pin genomic reads mapped to the thrum genome in the regions defined by the thrum-specific transcripts; k-means clustering analysis resolved the transcribed regions into two clusters (Fig. 2b); deep and broad read coverage defined the presence of the region in both pin and thrum genomes; low read depth or low coverage identified a region as thrum-specific, with pin sequence alignments representing erroneously mapped sequencing reads (Supplementary Table 3). Nine thrum-specific regions were thus identified; these define four of the five thrum-specific genes from the 278 kb region; GLO T , CYP T , PUM T and KFB T (Fig. 2b). CCM T is expressed at a low level (see below) and is the only gene from the cluster not represented. Identification of three contigs for GLO T and KFB T , and two for CYP T , is due to the use of a non-scaffolded thrum genome assembly (Supplementary Table 1b) and the length of GLO T and CYP T (see below). We conclude that there are no other flower-expressed genes unique to thrums and that the 278 kb sequence is the only thrum-specific genomic region. Significantly, these data show that the thrum S haplotype does not contain any additional genes compared with the long homostyle S LH1 haplotype. These analyses revealed 391 gene models that are uniquely expressed in thrums, and 270 gene models that are uniquely expressed in pins, but present in both pin and thrum genomes; these are candidates for direct or indirect targets of the S locus genes that control pin and thrum flower development.
Next we investigated whether the sequences flanking the thrumspecific 278 kb region could also be part of the S locus that contained pin-and thrum-specific alleles of genes involved in the control of heterostyly. If sequences flanking the thrum-specific region contain genes that also contribute to S locus function, restriction of recombination between pin and thrum alleles would be required to maintain integrity and functionality of the locus. However, if these flanking regions are freely recombining this would indicate that the thrum-specific region alone contains the entire S locus gene cluster. We therefore undertook a recombination analysis investigating the pattern of single nucleotide polymorphisms across the flanking sequences, comparing the alleles present in a pin and a thrum plant. These data (Supplementary Fig. 4) reveal that sequences flanking the thrum-specific region contain blocks of significantly reduced polymorphism, which is consistent with recent recombination events. The sequences flanking the thrumspecific 278 kb region thus seem to be homogenized by recombination between pin and thrum alleles, suggesting they are not involved in the control of the heterostyly phenotype.
Linkage of GLO T and the S locus GLO T was initially identified as thrum-specific in a small segregating population 33 . To demonstrate unequivocal linkage of GLO T , and therefore the S LH1 assembly, to the S locus, we revisited a three-point cross with 2,075 progeny 35 Table 3). Transcript regions in contigs to which pin progeny pool genomic sequencing reads map with high depth and breadth (blue).   35 . The S haplotype 278 kb region (red) between duplicated ∼3 kb CFB loci (yellow) is shown relative to sequenced BAC contigs 35 with 75 kb left flanking (blue) and 96 kb right flanking (green) sequences (Fig. 1b). b, PCR analysis of GLO T linkage using pin (P) and thrum (T) plants and 100 pooled non-recombinant (NR) progeny from a three-point cross 35 (Supplementary Table 4) compared with non-S locus GLO as control. Two thrum plants (T1 and T2) from doublerecombination events (DR) 35 Table 4). This cross also yielded the short homostyle Hose in Hose plant 35 used here (Fig. 1a). We analysed DNA from pin and thrum parents, pools of pin and thrum non-recombinant progeny and two doublerecombinant (Oakleaf-S-Hose in Hose) thrum progeny by PCR analysis with GLO T -and GLO-specific primers ( Fig. 3 and Supplementary Table 2); GLO is present in both pin and thrum 32 . The parent plants show the original linkage profiles (Fig. 3b); we found no linkage disruption between GLO T and thrum phenotype using pools of 100 non-recombinant progeny. Furthermore, double recombinants show that recombination between Oakleaf and S, or S and Hose in Hose, does not disrupt linkage between GLO T and thrum phenotype (Fig. 3b). These data place the 455 kb assembly between Oakleaf and Hose in Hose within the S locus BAC assembly 35 (Fig. 3a); previous studies did not identify any BACs that link the 455 kb region to BAC contigs S-left and S-right 35 . To increase mapping resolution, we analysed natural populations of P. vulgaris and P. veris. Pooled genomic DNA from 200 pin plants of each species was analysed by PCR using GLO Tand GLO-specific primers (Fig. 3c). A single thrum plant was used as control because loss of a dominant marker in one individual would not be detected in a thrum pool. In total, 500 pin plants were analysed (Fig. 3b,c); none showed recombination. These data demonstrate that GLO T and the surrounding region are in tight thrum-specific linkage (<0.2 cM) with the S locus in both P. vulgaris and P. veris (Fig. 3a).

S locus gene expression and function
Having shown that homostyles did not occur by recombination, we sought to determine their molecular basis by comparing gene expression across the four haplotypes. Expression analysis of genes within, and flanking, the S locus was undertaken by mapping four replicate RNA-Seq datasets from pin and thrum flowers to the S LH1 assembly (Fig. 4a and Supplementary Table 5). GLO T , CYP T , PUM T , KFB T and CCM T all show thrum-specific expression (Figs 2b,  4a). CFB TL is expressed at a low level in both pin and thrum flowers; CFB TR is not expressed. Genes flanking the S locus are expressed in both pin and thrum flowers, except SFG R 6, which has low expression in thrum and is not detected in pin; SFG L 1 is expressed at a low level in both pin and thrum ( Fig. 4a and Supplementary Table 5). These analyses reveal the 278 kb region as an island of thrum-specific gene expression.
Gene model predictions ( Supplementary Fig. 2a) for CCM T , GLO T , CYP T , PUM T , KFB T and three CFB alleles were confirmed by alignment to RNA-Seq data to define intron-exon boundaries. Two S locus genes are surprisingly large: GLO T spans 25 kb with two introns over 10 kb; CYP T spans 68 kb with 10, 20 and 30 kb introns ( Supplementary Fig. 2c). Interestingly, the GLO T S SH1 allele contains a 2.5 kb retrotransposon in exon 2 which disrupts and severely truncates the encoded protein; mutation of GLO T in the short homostyle is associated with loss of anther elevation; style length and pollen size are unaffected. The long homostyle (S LH1 ) CYP T allele has a single base insertion in exon 2 that introduces a disruptive premature stop codon, and is associated with loss of style length suppression; anther height and pollen size are unaffected. We also sequenced CYP T cDNA from an independent long homostyle (S LH2 ) from the Chiltern Hills 25 which represents a second CYP T mutant allele with a G-C transversion in exon 2 that results in an Asp 126 His substitution. CFB TR has an 11 bp deletion compared to CFB TL and CFB P that introduces a premature stop codon. The architecture of S and s haplotypes is summarized in Fig. 4b. Comparison of alleles is presented in Supplementary Sequence Analysis 3.

Date of the GLO T duplication
The first indication that GLO T was a discrete locus from GLO came when we identified distinct BAC clones for each gene, together with insight from other studies of B function MADS box genes, which suggested duplication could underpin diversification of novel floral morphologies 42 . The short homostyle GLO T mutation is not complemented by GLO, or by ectopic expression of GLO; the short homostyle is in the Hose in Hose background 32 . The recent report of a partial P. veris genome sequence 36 noted the duplication of GLO and referred to the genes as GLO1 and GLO2 but could not show linkage of GLO T (GLO2) to the S locus. Demonstration that these genes represent distinct loci with GLO T at the S locus provides the opportunity to date the duplication event associated with assembly of the S locus supergene. To determine the age of duplication we isolated GLO and GLO T sequences from six Primula species, and used these with sequences from other species to conduct a Bayesian relaxed-clock phylogenetic analysis with a combination of secondary calibrations (Fig. 5,  Supplementary Table 6a,b). The index of substitution (Iss) saturation value 43 for GLO and GLO T sequences (0.1187) was significantly lower than the Iss critical value (0.7318, P < 0.0001), indicating low saturation between these sequences. These analyses yielded a mean (5-95% highest posterior density) age estimate of 51.7 (33.1-72.1) Myr for the duplication leading to the divergence of the GLO and GLO T lineages.   Figure 4 | Expression and genomic organization of S locus genes. a, Gene expression from the S (red) and s (blue) haplotypes using pin and thrum RNA-Seq data represented as log 10 of the number of fragments per kb of transcript per million fragments mapped (FPKM) +1; gene models as defined in Supplementary Fig. 2a The duplication and neofunctionalization of GLO T represents a landmark evolutionary event at the S locus, and precedes estimates for the Primula-Androsace divergence; estimates for this node are 32 (20-51) Myr 44 and 44 (33-54) Myr 44 with fossil priors being set with a log normal distribution, and 40 (30-51) Myr with fossils modelled as exponential priors 45 . The Androsace were predicted to be the first taxon within the Primulaceae to exhibit heteromorphy 46 . Our data indicate that the GLO-GLO T duplication predates this divergence, which implies heterostyly evolved following a single duplication event in the Primulaceae. Two models have been proposed for the evolution of Primula heterostyly: the first postulates a long homostyle 21 and the other an approach herkogamous pin-form flower 20 as the original floral form. The duplication and neofunctionalization of GLO T would be consistent with both models if this was the first gene at the proto-S locus (Fig. 5). The S locus sequence and structure, timing of the GLO T duplication, and analysis of other genes at the S locus, will inform further evolutionary genetic studies of heterostyly and homostyly in Primula, and help to determine the sequence of events leading to the establishment of the S locus gene cluster.

Conclusion
We show that the S locus supergene is a tightly linked cluster of five thrum-specific genes, spanning a 278 kb sequence that is absent in pins (Fig. 2a). This finding defines the basis for Bateson and Gregory's S-haplotype dominance 5 . The annotation S/s and s/s for thrum and pin could be represented by S/-and -/-, but we suggest retention of the traditional nomenclature with recognition of s as a null haplotype. Floral heteromorphy in Primula has evolved after duplication of a floral homeotic gene 51.7 Myr ago, followed by its neofunctionalization, creating the current S locus assemblage. This insight has profound implications for our understanding of a key evolutionary innovation of flowering plants. The molecular basis of the Primula S locus supergene appears to be different from those proposed for the control of butterfly mimicry, and avian and insect social behaviour [47][48][49] . It is also unlike the mating-type locus in ascomycete fungi, which comprises two distinct idiomorphs 50 . Ernst 8 originally proposed that Primula homostyles arose by mutation. He was correct, and mutations in CYP T and GLO T homostyle alleles earmark these genes as candidates for the style length suppression (G), and anther elevation (A) functions 7 respectively. Darwin suggested the primary function of heterostyly evolved to promote out-crossing 13 , generating novel variation that is the substrate of natural selection. The parallel evolution of heterostyly in diverse angiosperm families 1 has exploited insect-mediated pollination, which in turn is associated with an accelerated rate of speciation in angiosperms 14 .
Deciphering the genetic architecture of the Primula S locus as the first heterostyly supergene provides a blueprint for the comparative evolutionary genetic analysis of this key adaptation in other angiosperm families, as well as for the molecular characterization of other pollination syndromes underpinning both biodiversity and food security.

Methods
Plant material. The long homostyle plant (S LH1 ) used for DNA sequencing was a homozygote derived from a population originally described by Crosby in 1940 25 at Wyke Champflower, Somerset, UK, which had undergone several generations of selfing to generate a homozygous line which greatly facilitated assembly of the genome sequence. The independent long homostyle population in the Chiltern Hills discovered by Crosby in 1944 26 provided our second long homostyle (S LH2 ) from Hawridge, Buckinghamshire, UK. Pin and thrum P. vulgaris were grown from seed (http://www.wildseed.co.uk) as described previously 27 . Pin and thrum plants selected for genome sequencing were crossed to generate an F 1 population. The short homostyle was originally identified in a mapping population of P. vulgaris plants 35 P. veris for genome sequencing were grown from seed collected at the Durham University Mountjoy site. P. elatior leaf material was collected from Bull's Wood (http://www.suffolkwildlifetrust.org) with permission of Suffolk Wildlife Trust. P. farinosa were obtained from Kevock Garden Plants (http://www.kevockgarden.co. uk/). P. vialii and P. denticulata were from the laboratory collection. The population of P. veris used for S locus linkage analysis was sampled from Lolly Moor (http://www.norfolkwildlifetrust.org.uk) with the permission of the Norfolk Wildlife Trust. P. vulgaris used for S locus linkage analysis was sampled with permission of the Norfolk County Council from the B1135 roadside verge between Ketteringham and Browick near Wymondham, Norfolk. Plants from the three-point mapping cross have been described previously 35 .
Preparation of sequencing libraries. DNA and RNA preparation was as described previously 30,33 . All genomic DNA and RNA-Seq libraries for sequencing were prepared at the Genome Analysis Centre using standard Illumina protocols.  Genomic paired-end libraries. An Illumina TruSeq library was prepared using a protocol optimized for 1 µg of input genomic DNA (Illumina 15026486 Rev. C).
Mate-pair libraries. The protocol was optimized for 4-10 µg of high molecular weight DNA. Following fragmentation, samples were size fractionated to enable generation of mate-pair libraries of 5, 7 and 9 kb (Illumina 15035209 Rev. D).
RNA paired-end libraries. Libraries were constructed using the Illumina TruSeq RNA protocol (Illumina 15026495 Rev. B). The S locus region was assembled as outlined in Supplementary Methods and Supplementary Fig. 1. The assembly was validated by comparison of independent assemblies from different Illumina pairedend and mate-paired sequencing libraries of thrum, long and short homostyle individuals, gaps between contigs and regions of Ns were resolved by PCR amplification and Sanger sequencing of the products.
Genomic DNA PCR analysis. Genomic DNA was isolated as described previously 30 . Primers are shown in Supplementary