当前位置：文档库 › 基因重组软件(Recombination Detection Program version 4,RDP4)说明书

基因重组软件(Recombination Detection Program version 4,RDP4)说明书

RDP4

Instruction Manual

Contents

1 INTRODUCTION (1)

2 OPENING ALIGNMENTS AND OTHER FILES (1)

3 ANALYSIS OPTIONS (2)

3.1 General Settings (2)

3.2 RDP Method Settings (4)

3.3 GENECONV Settings (4)

3.4 BOOTSCAN/RECSCAN Settings (5)

3.5 MAXCHI Settings (5)

3.6 CHIMAERA Settings (5)

3.7 SISCAN Settings (6)

3.8 LARD Settings (6)

3.9 PHYLPRO Settings (6)

3.10 DNA Distance Plot Settings (7)

3.11 TOPAL Settings (7)

3.12 VisRD Setting (7)

3.13 Breakpoint Distribution Plot Settings (7)

3.14 Recombination Rate Settings (7)

3.15 Matrix Settings (8)

3.16 Tree Settings (8)

3.17 SCHEMA Settings (10)

4 FINDING EVIDENCE OF RECOMBINATION (10)

4.1 Automated Exploratory Recombination Analysis (10)

4.2 Manual Quesry vs Reference Analyses (13)

5 EXAMINING AUTOMATED ANALYSIS RESULTS (13)

5.1 The Schematic Sequence Display (13)

5.2 The Recombination Information Display (16)

5.3 The Plot Display (16)

5.4 The Sequence Display (17)

5.5 The Tree Displays (18)

5.6 The Matrix Display (19)

6 SAVING RESULTS AND RECOMBINATION-FREE DATASETS 20

7 SUPPLEMENTARY ANALYSES (20)

8 RECOMBINATION SIGNAL DETECTION METHODS (21)

8.1 The RDP Method (21)

8.2 GENECONV (22)

8.3 BOOTSCAN/RECSCAN (23)

8.4 MAXCHI (24)

8.5 CHIMAERA (25)

8.6 SISCAN (26)

8.7 3SEQ (27)

8.8 PHYLPRO (28)

8.9 VISRD (29)

8.10 LARD (30)

8.12 DNA Distance Plots (30)

8.13 TOPAL/DSS (31)

8.14 BURT (32)

9 SUPPLEMENTARY METHODS (33)

9.1 Breakpoint Distribution Plots (33)

9.2 Association Tests (33)

9.3 Recombination Rate Plots (Using LDHat) (35)

9.4 Matrices (35)

9.5 SCHEMA Protein Folding Disruption Test (38)

9.6 SCHEMA Nucleic Acid Folding Disruption Test (38)

10 A STEP-BY-STEP GUIDE TO USING RDP4 (38)

10.1 Compiling a Good Dataset (38)

10.2 Making a Good Alignment (39)

10.3 Setting Up a Preliminary Scan for Recombination (40)

10.4 Testing and Refining Preliminary Hypotheses (40)

11 RUNNING RDP4 FROM A COMMAND LINE (43)

12 POSSIBLE PROBLEMS WITH USING RDP4 (43)

12.1 Poor Alignments (43)

12.2 Recombinants of Recombinants (43)

12.3 Over-Grouping of Recombinants (43)

12.4 Degeneracies (44)

12.5 Software Crashes/File Incompatibilities (44)

12.6 Crashes When Using Windows VISTA (44)

12.7 Crashes When Pressing the “Options” Button (44)

13 ACKNOWLEDGEMENTS (44)

14 APPENDIX (44)

15 REFERENCES........................................................................... 45 1 INTRODUCTION

RDP4 (Recombination Detection Program version 4) is a Windows XP/VISTA/7/8 program for detecting and analysing recombination and/or genomic reassortment signals in a set of aligned DNA sequences. While a number of other programs have been written to carry out the same task (see Martin et al., 2011, and the web-site https://www.wendangku.net/doc/5910908807.html,/recombination/programs.shtml), my motivation for writing RDP4 has been to produce an analysis tool that is both accessible to users who are uncomfortable with the use of UNIX/DOS command lines and permits a more interactive role in the analysis of recombination. I have particularly focused on making the program run with a minimum of fuss. This mean that it should be usable with most multiple nucleotide sequence alignments (unfortunately RDP4 cannot align your sequences for you, although the programs IMPALE, MUSCLE and CLUSTALW that are distributed with the RDP4 download can be used for this purpose) and should be able to give a detailed and reasonably accurate breakdown of the recombination events that have occurred during the evolutionary histories of the sequences being analysed.

The main strength of RDP4 is that it simultaneously uses a range of different recombination detection methods to both detect and characterise the recombination events that are evident within a sequence alignment without any prior user indication of a non-recombinant set of reference sequences. Besides the original RDP method, it includes the BOOTSCANning method (Salminien et al., 1995; Martin et al., 2005b), the GENECONV method (Padidam et al., 1999), the Maximum Chi Square method (MAXCHI; Maynard Smith, 1992; Posada and Crandall, 2001), the CHIMAERA method (Posada and Crandall, 2001), the Sister Scanning method (SISCAN; Gibbs et al., 2000), the 3SEQ method (Boni et al., 2007), the VisRD method (Lemey et al., 2009) and the BURT method.

If you are impatient and want to start analysing your sequences without reading the manual it is strongly recommended that you go straight to the step-by-step guide in section 10. This guide will help you use the program in the way it was intended to work. Also, if you want to run the program under Windows VISTA you will need to give RDP4 administrator rights. Find out how to do this in section 12.5.

2 OPENING ALIGNMENTS AND OTHER FILES

A number of different alignment file formats are recognized by RDP4 including PHYLIP, GDE, FASTA, CLUSTAL, GCG, NEXUS, MEGA and DNAMAN. To open a file press the “Open” button (Fig 1 in the command button panel) and select the file to be opened. The directory from which files are loaded is “remembered” by RDP4 when it is shut down. Once loaded the aligned sequences and their names are displayed in the “sequence display panel” (Fig 1). Also displayed are the degrees of nucleotide identity in different regions of the aligned sequences in an “identity display panel” (Fig 1). When analysing datasets where sequences have been obtained either from different genomic components (in the case of viruses) or different genomic loci (in the case of bacteria), and these sequences have been concatenated for analysis, RDP4 can be made aware of the concatenation points by denoting them in an alignment using “!” symbols inserted at appropriate points within the first sequence of the alignment. When inserting these symbols make sure not to knock the first sequence out of alignment.

Besides alignment files, RDP4 project files (with a “.rdp” extension) may also be loaded. In addition to aligned sequences these files also contain information on possible recombination events detected in previous analysis sessions.

RDP4, can read ORF positions and names from either GenBank files or flat text “ORFMap” files. ORFMap files can be manually made in a text editor such as wordpad. The first line of an ORFMap file should have the text “[ORF]” and each subsequent line should have three comma separated values in the following order: , , . For GenBank and ORFMap files the program requires that the files be opened after an alignment file. In the case of GenBank files, one of the sequences within the multiple alignment must be the same as the sequence in the GenBank file. RDP4 will automatically scan the sequences in the alignment to check whether any match the sequence in the GenBank file. For ORFMap files the coordinates in the file must map either to the alignment or to one of the sequences within the alignment: the program will ask you how to interpret the coordinates and, if necessary, ask you to indicate the sequence to which that the coordinates refer. If ORF information is supplied to RDP4 and breakpoint distribution analyses are performed, it will automatically test for variations in recombination breakpoint distributions relative to ORF

boundaries as described in Lefuvre et al. (2009). If you are unable to load a particular GenBank or ORFMap file successfully, send me the file together with your alignment and I’ll fix the problem for you.

RDP4 can also read protein structure information from .pdb files. If the genome regions being analysed encode proteins with associated structures any number of different .pdb files can be loaded. These .pdb files can include those containing multiple interacting proteins and RDP4 will automatically extract all information on the potential interactions of all amino acids encoded in the analysed alignments. Once .pdb files are loaded atomic coordinate positions can be used in protein SCHEMA analyses (See section 9.4; Voigt et al., 2002). Such analyses are described in Lefeuvre et al. (2007), and can be used to determine whether detectable recombination breakpoint distributions are influenced by natural selection acting against recombinants with disrupted intra- and/or inter-protein amino acid interactions (such as those that are respectively required for proper folding and optimal inter-protein binding).

3 SETTING ANALYSIS OPTIONS

Pressing the “Options” button in the command button panel will allow you to modify RDP4’s settings. For casual users of RDP4, the program’s default settings should work fine for most datasets. The only settings that you should ever need to change are italicised in blue below but should usually (unless you really know what you are doing) include only the (1) list of methods that should be used for automated recombination analyses (2) the window size settings of the various individual methods (3) the tree settings (where you can change substitution models and bootstrap replicates) and (4) the recombination rate settings. Unless you are particularly interested in exploring the influences of the various other settings it is OK to skip to section 4 of the manual.

3.1 General Settings

3.1.1 General recombination detection options.The various recombination detection methods can be set to perceive sequences as being either linear or circular. Note that even linear sequences can be analysed as though they are circular and this will in no way invalidate the analysis results unless an analysis of recombination breakpoint distributions is intended (see section 9.1). If linear sequences are analysed as though they are circular and some recombination is detected in an alignment, a strong recombination hotspot might be identified which spans the beginning and ending of the analysed sequences. While this will correctly indicate that the ends of recombinants tend to be inherited from different parental sequences, it should not be interpreted as the ends of the analysed sequences being genuine recombination hotspots. If recombination breakpoint distributions are of interest it would almost always be best to tell the program whether the sequences being analysed are linear or circular.

The highest acceptable P-value setting is the highest acceptable probability that sequences could share high identities in potentially recombinant regions due to chance alone (the calculation of P-values differs for the different methods and will be discussed in section 8). The optimal highest P-value setting varies depending on the number of sequences in the alignment being analysed, the recombination methods being used to examine the alignment, the size of the sliding windows that are used (for RDP, Bootstcan, MAXCHI, CHIMAERA and SISCAN), and on whether the multiple comparison correction setting is on or off.

The default setting for multiple comparison correction is “on” as this makes the calculated P-values “experiment-wide” (or global) rather than “currently selected sequence triplet/pair wide” (or local) estimates of probability. Note that there are two multiple comparison correction “on” settings. The default is “Bonferroni correction” but a modification of this ca lled “step-down correction” is also offered. These corrections act as P-value modifiers that decrease the P-value cutoff according to the size of the dataset being examined. For a highest acceptable P-value setting of 0.05 with multiple comparison correc tion “off” you would expect that approximately 5% of P-values that are calculated would make the P-value cutoff by chance alone (i.e. without the need to invoke recombination). For a large dataset you would therefore expect many false positive results. For the same P-value cutoff but with multiple comparison correction set to “on” you would expect to only encounter one false positive in ~5% of the datasets that are examined. In most situations (<100 sequences with analysed sequences sharing >70% identity) a highest acceptable P-value setting of 0.05 when multiple comparison correction is on and 0.0001 when it is off should give few false positives but still enable the identification of most detectable recombination events. If the correction setting is off the P-value cut-off must be very carefully selected based on the number of false positives you are prepared to tolerate. When a large dataset containing sequences with low diversity (e.g. 100 sequences all sharing >95% identity) is analysed it may in fact be impossible to detect any of the recombination that is present if one of the multiple comparison correction settings are on. In these cases it may be best to analyse the dataset using the permutation tests offered (see section 3.1.2) with the multiple comparison correction setting off and a P-value cut-off of 0.001 –this will give you some idea of the expected false positive rate for each identified recombination signal. Be warned, however, that the permutation test should be used with extreme caution.

3.1.2 Permutation options:Unless you really know what you are doing leave the “number of permutations setting” at 0. In almost all cases the analysis results you will get without running permutations will be more credible than those that you obtain if you use this permutation test. If this setting is set to anything other than 0, RDP4 will run its automated recombination detection analyses in permutation mode. This involves generating a group of simulated recombination free datasets (the number that are simulated is specified by you in the space provided), which are then analysed by the program using the exact same settings that it uses to analyse a real dataset. There are several ways in which the results from such an analysis can be interpreted. Firstly, if RDP4 identifies more recombination events in the real dataset than it does in 95% or more of the simulated datasets then this is equivalent to a P-value <= 0.05 that there is no recombination evident in the dataset – i.e. you can be more 95% sure that there is some evidence of recombination in the dataset. This result does not, however tell you which of the detected recombination events are actual recombination events –the result simply tells you that some of them are probably real. Secondly if RDP detects a single recombination signal in the real dataset that has a better associated P-value than the best signals in 95% or more of the simulated datasets then this is the equivalent of saying that this signal has an associated P-value <= 0.05 –i.e. that you can be 95% confident that the recombination event associated with this recombination signal is a real event and not a false positive.

RDP4 can use two different approaches to simulate the sequences used in the permutation test. The simplest involves shuffling alignment columns to destroy most of the recombination signals evident in an alignment. While this has the pleasing effect of maintaining most of the properties of the sequences in the alignment (such as their phylogenetic relatedness and nucleotide composition), it does not maintain in the shuffled alignments the same spatial distribution of variable sites found in the original alignment. Maintaining the distribution of polymorphic sites in an alignment can, however, be important when evolutionary rates vary widely in different regions of the sequences being analysed. This is important for two reasons. The first is that it is generally easier to detect recombination in parts of an alignment where there are many polymorphic sites than it is in parts of an alignment with few polymorphic sites. If the distribution of detectable recombination breakpoints along an alignment is significant then so too will be maintenance of the spatial distribution of polymorphic sites in the simulated alignments. The second reason that spatial distribution of sites is important is that in very diverse parts of an alignment sequences are often poorly aligned. All recombination detection methods in RDP4 are particularly sensitive to sequence misalignment and whereas false positive signals due to misalignment of highly diverged sequence tracts in the real alignment will be detected as recombination events with significant P-values, these false positive signals will likely be undetectable in the shuffled alignments.

To solve this problem, the second (and default) method that RDP4 uses to simulate datasets employs the program SEQ-GEN to generate alignments with approximately the same spatial distribution of polymorphic sites as the real dataset (the “Use SEQGEN parametric simulations” setting). To obtain an appropriate spatial distribution of polymorphic sites in different parts of the alignment, different groups of columns in the alignment are separately simulated by SEQ-GEN where the input tree is scaled to reflect the degree of nucleotide diversity of the particular set of alignment columns being simulated.

Be very careful when using the permutation settings. Besides the program running very slowly, it may also crash unexpectedly. If you are sure that this kind of analysis is what you need and experience problems with it please e-mail me and I’ll do my best to help.

3.1.3 Data processing options:Once RDP4 has scanned an alignment and enumerated all detectable recombination signals, it

Figure 1.The main components of the RDP4 interface. Once sequence files in any of a variety of formats are loaded with the “Open” button in the command button panel, pressing the “X-Over” button will begin

scan for recombination with whatever analysis options are currently set (these can be viewed by pressing the “Options” button). Various phylogenetic tree and matrix-based visualisations of the dataset can be accessed via the arrows besides the “Trees” and “Matrices” buttons. It is possible to swap between small tree, matrix and recombination information displays using the buttons above them.

begins the (often quite time consuming) task of trying to distil all the detectable recombination signals down to a minimal set of unique recombination events that could account for the signals. This process is necessary if you are hoping to make sense of the program’s results because a single actual recombination event will almost always be detectable using multiple combinations of sequences in an alignment.

The “require topological evidence” setting allows you to specify whether or not you want the program to discard recombination signals that have no phylogenetic support. While this might seem an obvious thing to do, you should realise that many of the recombination detection methods implemented in RDP4 are fully capable of detecting real recombination events that do not result in any detectable change in phylogenetic tree topologies along an alignment. The default setting is that topological evidence is required but this is simply because most users (for good or bad reasons) would find this setting most desirable.

During automated scans the different detection methods will identify regions of sequence that are recombinant. The boundaries of these regions, called breakpoints, will often be obviously suboptimal and selecting the “polish breakpoints” setting will prompt RDP4 to look, for better breakpoints using the BURT method (see section 8.13) in the immediate vicinity of those identified. Even if this setting is used you should realise that the program will still potentially identify the wrong breakpoint position –read section 10 on how to correct the obvious breakpoint detection errors that the program makes.

As mentioned earlier, misalignment of sequences is a major cause of false recombination signals. RDP4 is able to automatically assess whether the recombination signals it has detected are the product of misalignment. While it is possible to tell the program to not bother checking the consistency of alignments in the areas where it detects recombination signals (it makes the program a little faster), this is not advisable unless you are examining recombination in very good alignments with either no or very few inserted gap characters.

When it is trying to piece together a plausible set of recombination events that explain the recombination signals it has detected in an alignment, RDP4 can be told to disallow detection of recombination events in which one or both of the inferred parental sequences are themselves recombinant. This “disentangle recombination signals” setting should, however, only be used for datasets in which recombination is relatively rare. If it is used for complex datasets where most of the sequences are recombinant, it can cause the program to get stuck in a never-ending analysis loop whenever it cannot find a viable set of recombination events that does not involve recombination between recombinant sequences. You should also be aware that there is no natural law that prevents recombinant sequences from recombining with one another (i.e. the actual parental sequences of some recombinants might in fact also be recombinants).

When RDP4 attempts to determine whether similar recombination signals that are detected in two or more different sequences might mean that these sequences all descended from the same recombinant ancestral sequence it is possible to make the rigor with which RDP4 does this more or less conservative with the “Group recombinants realistically/conservatively” setting. The “realistic” version of this setting will ensure that groups of two or more sequences that are listed as having descended from the same recombinant ancestor could all plausibly cluster together within phylogenetic trees that are constructed from a portion of the analysed alignment that spans one or the other of the detected recombination breakpoints. The “conservative” version of this setting will identify sequences that have similar breakpoint patterns and similar degrees of genetic relatedness to the identified parental sequences, as having descended from the same recombinant ancestor even when there is no strong phylogenetic evidence that these sequences all share a more recent common ancestor with one another than they do with the remainder of sequences in the analysed dataset. The conservative setting is called conservative because it will result in fewer unique recombination events being identified than the realistic setting.

When more than one recombination signal detection method is used to scan an alignment, the “list events” setting can be altered so that RDP4 will only display evidence detected by greater than a certain number of methods. If, for example, six methods are used during the primary screen for recombination (see below what the difference between a primary and a secondary screen is) and the “list events detected by >2 methods” setting is used RDP4 will only display recombination results that could be confirmed by between three and six different methods. If, after an analysis is completed, you would like to either relax this setting or make it stricter, you can do so and the list of detected events will then be instantly updated (i.e. unlike all the other settings described here, this setting can also be meaningfully changed even after the initial recombination screen is completed). 3.1.4 Analyse sequences using: RDP4 allows you to automatically analyse sequences for recombination using seven different recombination detection methods (see section 8 for a detailed description of the methods). These are the original RDP method, the

BOOTSCAN/RECSCAN method (Salminen et al., 1995; Martin et al., 2005b), the method applied in the program GENECONV (Padidam et al., 1999; Sawyer, 1989), the MAXCHI method (Maynard Smith, 1992; Posada and Crendall, 2001), the CHIMAERA method (Posada and Crendall, 2001), the SISCAN method (Gibbs et al., 2000) and the 3SEQ method (Boni et al., 2007). It is possible to use the different methods either alone or in combination with one another. An indicator of the relative execution times of the different methods and an estimate of total execution time is given. Be warned that (1) estimates of relative and total execution times may be inaccurate and (2) the different methods may have vastly different speeds – take note when you are told that the analysis you are proposing will take a number of days or weeks. Also notice that BOOTSCAN and SISCAN have two associated selection boxes. If the left boxes are selected the methods will be used to explore for new recombination signals. If the right boxes are selected the methods will only be used to examine sequences in which recombinatio n signals are detectable by other “primary scanning” methods that have been selected. This “secondary” scanning mode is also available for the LARD method. The reason these methods may be selected so that they will only run in this secondary scanning mode is that they are a lot slower than the other automated recombination signal detection methods implemented in RDP4. When analysing large datasets, therefore, it will often be desirable to explore for recombination signals using the fast methods and then use the slower methods to verify these results. Note that regardless of whether the 3SEQ, RDP, GENECONV, MAXCHI or CHIMAERA methods are selected for primary scans, these methods are so quick that they will always all be used in secondary scans of recombination signals detected by other methods.

3.2 RDP Method Settings

3.2.1 Reference sequence selection. Reference sequences used for identifying phylogenetically informative sites during analyses can be selected in five different ways. The default setting i s to “use no reference” which means that all sites will be examined irrespective of whether they are phylogenetically informative or not. Whereas I have found that this setting provides the greatest power for recombination detection, it does tend to identify some false positive signals if very divergent sequences are being examined (i.e. if there are sequences sharing <60% identity in the alignment). This is not a problem if only recombination signals detected by multiple methods are to be accepted as genuine evidence of recombination. If the RDP method is to be used alone for an analysis of medium-large datasets (>30 sequences) containing both closely related and highly diverged sequences, I have found that the “using internal references only” setting provides the best unambiguous estimates of recombination breakpoints. If small datasets are being examined (< 30 sequences) the “use internal and external references” setting is recommended. For very small datasets (<5 sequences) the “use no reference seque nces” setting is always recommended as long as all the sequences in the dataset are >70% identical. If you are examining a dataset containing a group of closely related sequences and you have access to a not too distantly related outlier sequence then the outlier can be used as the “user defined reference sequence.” This setting is, however, not recommended. Note that while the “use internal and external references” setting is meaningful for small datasets, as datasets become larger, the behaviour of an analysis with this setting will begin to approach that of the “use no references” setting. If accurate identification of breakpoints is desired it is not recommended that the “use external references” or “user defined reference” settings be used.

3.2.2 Recombination detection options. The window size used by the RDP method when scanning for evidence of recombination may be set. Note that the RDP method only examines polymorphic sites within triplets of sequences sampled from the alignment and the window size here refers to the number of these sites included in every window. While larger window sizes will lower signal:noise ratios but decrease the sensitivity of the analysis, smaller window sizes will increase the sensitivity but also increase the possibility of false positives.

Because some of the reference sequence settings can lead to a higher than desirable false positive rate when divergent sequences are being analysed, there is also a setting that will restrict RDP analysis to sequences that share identities that fall within a given range. This is also useful if, for example, within a genus an analysis of inter-species recombination is desired. If it has been determined that members of a virus species share greater than 90% identity whereas members of a genus share greater than 80% identity, only inter species recombination within a genus will be detected if the “only detect recombination” values are set to 80 and 90.

3.3 GENECONV Settings

For additional information on GENECONV settings please consult the GENECONV manual. It can be obtained online from: https://www.wendangku.net/doc/5910908807.html,/~sawyer/geneconv/

3.3.1 Sequence options.In RDP2 GENECONV could be set to screen sequences in an alignment in either pairs or triplets. In RDP4 only the triplet scan can be used for automated recombination signal detection with GENECONV and the “scan sequence pairs” setting can only be used during manual recombination detection. When the “scan sequence pairs” settin g is used GENECONV will identify variable alignment positions as polymorphic sites and then check every possible sequence pair for evidence of recombination. If the “scan sequence triplets” setting is chosen the program will treat every possible sequence triplet in an alignment as independent alignments and screen them as it would if it were using the “scan sequence pairs” setting. Because there are many more possible sequence triplets in an alignment than there are sequence pairs, the triplet setting will have a more stringent multiple comparison correction than the pair setting. See section 8.2 for a detailed account of how screening triplets differs from screening pairs. I personally prefer the triplet setting as it yields results which are more consistent with the other automated recombination signal detection methods that are implemented in RDP

4. This consistency greatly simplifies the task RDP4 faces when trying to reconcile all the recombination signals various methods have detected during its formulation of a feasible scenario of recombination events at the end of an automated analysis. Note, however, that the enforced triplet setting prevents the use of many standard GENECONV settings. The reason for this is that triplet scans are performed directly by RDP4, whereas RDP4 uses the GENECONV.exe to do pairwise scans.

The way in which gaps(or indels: “-“or “.”Insertion symbols which are used to align sequences optimally) are handled can also be altered. A group of consecutive “-“insertions that correspond with nucleotides in another sequence can be treated as a single polymorphism, each individual insertion can be treated as an individual polymorphism, or gaps can simply be ignored. The best setting will depend on the alignment being analysed. If the sequences in the alignment have diverged somewhat and the alignment process has inserted a large number of gaps, it is probably best that each run of gaps be considered a single polymorphism. When gaps are ignored the program performs similarly to when runs of gaps are treated as a single polymorphism, except that occasionally the latter setting increases the number of polymorphisms. An increase in the number of polymorphisms may enable the identification of more difficult to detect recombinant regions. Stanley Sawyer (the author of GENECONV) recommends that the “treat each indel site as an individual polymorphism” setting never be used.

3.3.2 Fragment list options.The G-scale setting will influence how GENECONV handles nucleotide mismatches. Setting the G-scale to 0 will not allow mismatches within a fragment (See section 8.2 for information on what a fragment is). Setting the G-scale to 0 is a special case that sets an infinitely high mismatch penalty. Setting G-scale to 1, however, sets the lowest possible mismatch penalty. Increasing the G-scale above 1 increases the mismatch penalty - at very high values the mismatch penalty will approach that used when the G-scale is set to 0. There is no optimal G-scale setting and it should be adjusted according to the dataset being examined –For detecting recent recombination events a G-scale of 0 or a G-scale with a high value (5+) would probably be best. For detecting older recombination events a G-scale value of 1 or 2 would probably be best. I personally only ever use a G-scale of 1 (the default). .

During its execution, GENECONV can be set to ignore potential recombinant regions that (1) have less than a certain length (the “Min. aligned fragment length” setting), (2) have fewer than a certain number of polymorphic sites (the “Min. polymorphisms” setting which is useful for differentiating between sequence conservation and recombination), and (3) have pair-wise scores that are below a particular cutoff (the “Min. pairwise frag score” setting). The program can also be set to ignore fragments with higher P-values that overlap with fragments that have lower P-values. By changing the “Max. overlapping frags” setting to >0 the program will report a specified number of potential recombinant regions that overlap with regions that have smaller P-value.

3.4 BOOTSCAN/RECSCAN Settings

3.4.1 Scan options.The window and step sizes used during BOOTSCANning should be carefully selected based on the length of the sequences being analysed, their relatedness and the sizes of recombinant regions that are anticipated. Note that the duration of a BOOTSCAN is effected far more by step size and number of bootstrap replicates than it is by the window size. The step size used must be smaller than the window size and should ideally be set to less than 50% of the window size. Window sizes should be selected so that, on average, there will always be more than ~10 variable nucleotide positions within every window examined. Whereas larger window sizes will increase signal:noise ratios, you should understand that obvious recombinant regions that are only slightly smaller than the window size may not be detected.

There are three different settings that determine how sequence relationships are measured during a BOOTSCAN. The “Use distances” setting will permit the quickest BOOTSCANs because, with it, pair-wise distance measurements without the construction of trees will be used to infer sequence relationships. The “Use UPGMA” and “Use NJ trees” settings determine relationships between sequences based on the positions of the sequences within trees. I would recommend that you use either the NJ tree or distance settings. Unless there are sequences in your alignment that are evolving at very different rates the distance method will give nearly identical results to the tree drawing methods and should always be tried first. Remember that the automated scan is just the first stage of the analysis and that once it is complete you will have the opportunity to scan any potential recombinants using more accurate (but slower) methods.

The number of bootstrap replicates that are used largely controls the significance of the recombination events that are detected using any particular percentage bootstrap cutoff (see below). It is strongly recommended that for any dataset containing more than ~20 2Kb+ sequences that the number of replicates be kept under 1000 and that the significance of results be controlled by increasing the percentage cutoff value. As a general rule 200 replicates with a 95% cutoff percentage seems to yield similar results to those obtained with the other methods when using a 0.05 P-value cutoff with multiple comparison correction on.

Using the same random number seed in two separate analyses will ensure that bootstrapped datasets remain the same for both analyses and that results are repeatable.

The cut-off percentage refers to the percentage bootstrap support that is required before any altered relationships between three sequences within an alignment are interpreted as evidence of potential recombination. Setting this value higher (it could be set as high as 100%) will increase the probability that any regions detected are recombinant. This value is only meaningful in the context of the number of bootstrap replicates selected. It should be noted that a value of 95% does not equate with a P-value cutoff of 5% (i.e. 0.05). The value (together with the number of bootstrap replicates) is simply proportional to the confidence that you have in the recombinant regions that the program detects –i.e. you could have more confidence in the recombinant nature of regions detected using 1000 replicates and a 100% cutoff percentage than regions detected with 50 replicates and a 70% cutoff percentage.

While it is possible to simply use bootstrap values as P-values during a scan (with any region exceeding the bootstrap cut-off being reported as possibly recombinant), it is strongly recommended that either the “calculate binomial P-value” or “calculate Chi Square P-value” settings be used. If either of these settings is selected a statistical test will be used to determine the probability that regions exceeding the bootstrap cut-off are recombinant. Using simulations I have found that the “calculate binomial P-value” sitting is by far the most powerful and this is the setting I strongly recommend you use. 3.4.2 Model options. Four different nucleotide substitution models may be used when calculating distance matrices from bootstrap replicated alignments. With all the models other than the Jukes Cantor, 1969 model it is possible to score transitions and transversions differently during pair-wise distance calculations. The Jukes-Cantor model is identical to the Kimura, 1980 model with a transition:transversion ratio set to 0.5. The Kimura model is in turn identical to the Felsenstein, 1984 model when equilibrium frequencies of all four bases are equal. The Felsenstein, 1984 model allows for differences in equilibrium base frequencies that may be either supplied by you or inferred from the alignment. The Jin-Nei, 1990 model is similar to the Kimura model except that it assumes that different rates of substitution occur at different sites. The Jin-Nei model determines site-specific substitution rates from a gamma distribution, the shape of which is determined by the coefficient of variation. Low values mean sites are expected to evolve at similar rates and high values mean rates are expected to vary more widely. RDP4 utilises code from the PHYLIP component DNADIST to calculate distances and additional information on this program can be obtained online from:

https://www.wendangku.net/doc/5910908807.html,/phylip/doc/dnadist.html

3.5 MAXCHI Settings

3.5.1 Scan options.Whereas in RDP2 it was possible to use the MAXCHI method to automatically screen an alignment either three sequences at a time or two sequences at a time, in RDP4 only triplet scans can be performed during automated recombination detection. Doublet scans are, however, still possible when using MAXCHI to manually screen sequences for evidence of recombination. The major difference between the triplet and doublet scans is that the doublet scans do not allow proper identification of parental and recombinant sequences.

As with other scanning window settings the optimal window size that should be selected for a MAXCHI analysis will depend on the sequences being analysed and the size of recombinant regions that must be detected. As is the case with the original RDP, CHIMAERA, GENECONV and 3SEQ methods, MAXCHI only examines variable nucleotide positions –i.e. the window size refers to the number of variable sites and not the number of nucleotide positions. The optimal window size for detecting recombinant regions with 20 variable nucleotide sites will be 40. The reason for this is that the MAXCHI scanning window is split into two with the halves being compared to one another (see section 8.4 for details on the MAXCHI method).

Because the 2statistic is only calculated within individual windows a situation can arise where it is impossible to achieve a significant 2P-value even with a fairly lax P-value cut-off. For example, with a window size of 20 it is impossible to achieve a P-value lower than ~1X10-5. This isn’t too much of a problem if the multiple comparison correction setting is set to off (a setting that is not recommended). However, with an alignment containing 20 sequences, multiple comparison correction on, a window size of 20 and a highest acceptable P-value cutoff of 0.01 it will be impossible to achieve a P-value below the cutoff (i.e. no recombination will be detected). Always remember this when selecting the window size.

Variable or set window sizes can also be used. Changing this setting to “variable” lets you specify which proportion of variable sites should be included in a window. If variable window sizes are used, windows will get larger for sequence triplets containing quite diverged sequences and smaller for triplets containing more closely related sequences. Note that if a sequence triplet has fewer variable sites than 1.5 times the specified window size, the window size will automatically be set to 0.75 times the number of variable sites. If the window size thus derived is smaller than 10, then the sequence triplet in question will not be examined.

It is always advisable to use the “strip gaps” setting for MAXCHI. If the “use gaps” setting is selected you should realise that each individual gap character (“-“ or “.”) will be treated as a fift h nucleotide. This may cause problems if, for example, one of the sequences in a triplet has a run of gaps in a particular region because the other two sequences in the triplet will appear much more similar to one another in that region than they should and recombination will be inferred.

3.6 CHIMAERA Settings

3.6.1 Scan options.As with other scanning window settings the optimal window size that should be selected for a CHIMAERA analysis will depend on the sequences being analysed and the size of recombinant regions that must be detected. As is the case with the original RDP, GENECONV, 3SEQ and MAXCHI methods, CHIMAERA only examines variable nucleotide positions –i.e. the window size refers to the number of variable sites and not the number of nucleotide positions. The optimal window size for detecting recombinant regions with 20 variable nucleotide sites will be 40. The reason for this is that, like with the MAXCHI method, the CHIMAERA scanning window is split into two with the halves being compared to one another (see section 8.5 for details on the CHIMAERA method).

For information on setting window sizes refer to the previous section on appropriate window sizes for the MAXCHI method.

As with the MAXCHI method a variable window size setting may also be used with the CHIMAERA method, which allows you to specify the proportion of variable sites that should be included in a window. If variable window sizes are used, windows will get larger for sequence triplets containing quite diverged sequences and smaller for triplets

containing more closely related sequences. Note that if a sequence triplet has fewer variable sites than 1.5 times the specified window size, the window size will automatically be set to 0.75 times the number of variable sites. If the window size thus derived is smaller than 10 the sequence triplet in question will not be examined.

3.7 SISCAN Settings

3.7.1 Scan options.The window and step sizes used during a SISCAN should be carefully selected based on the length of the sequences being analysed, their relatedness and the sizes of recombinant regions that are anticipated. The step size used must be smaller than the window size and should ideally be set to less than 50% of the window size. Window sizes should be selected so that, on average, there are more than ~10 variable nucleotide positions within every window examined. Whereas larger window sizes will increase signal:noise ratios, you should understand that obvious recombinant regions that are only slightly smaller than the window size may not be detected.

It is strongly recommended that the “strip gaps” setting be used. If gaps are used, each individual gap character (“-“ or “.”) will be treated as a fifth nucleotide.

It is also strongly recommended that the “use 1/2/3 variable positions” setting be used. This setting will focus the analysis on sites that differ between the sequences in a triplet. Whereas the “use 1/2/3/4 variable positions” setting will focus the analysis on sites that vary between the sequences in a triplet and/or the sequences in a triplet and an outlyer sequence (see 3.7.2 for information on outlyer sequences), the “use all positions” setting will examine all sites both variable and constant. The “use 1/2/3 variable positions” setting is recommended because t he other settings tend to “dilute” recombination signals by including a lot of irrelevant sites in the analysis.

3.7.2 Fourth sequence selection.During a “SISCAN” sequence triplets are examined together with a fourth outlyer sequence (See section 8.6 for details of the SISCAN method). The outlyer can either be another sequence in the alignment or a randomised sequence constructed from the sequences in the triplet. With the “use nearest outlyer” setting, for every sequence triplet examined, RDP4 will scan an alignment for an outlyer sequence that most closely resembles the three sequences in the triplet. With the “use most divergent sequence” setting, RDP4 will always use the most divergent sequence in the alignment as an outlyer. The “use radomised sequence” setting will, for every window analysed in every sequence triplet, require construction of a new randomised sequence. It is recommended that the “use nearest outlyer” setting be used because this is both the quickest setting and, unlike the other settings, it yields results that are usually well supported by other recombination signal detection methods.

3.7.3 Permutation options.When determining the significance of potential recombination signals SISCAN uses a permutation test (for details of the calculation of P-values see section 8.6). Because the test can be quite time consuming RDP4 can be set to use fewer permutations during an exploratory scanning phase (the scan permutation number) and, when a possible recombination signal is detected, use more permutations to accurately determine P-values for likely recombinant regions (the P-value permutation number).

Because SISCANning involves the generation of randomised sequences (see section 8.6 for details) there is the option to provide a random number seed. Using the same random number seed in repeated analyses will ensure that SISCAN results are reproducible.

If the “do fast scan” setting is used RDP4 will only use permutation tests to analyse windows in which the pair-wise relationships between the sequences in a triplet differ relative to the relationships of the sequences over their entire lengths (these are the only windows within which a recombination signal is likely to be found). The “do exhaustive scan” setting will perform permutation tests on every window –regardless of how unlikely it is that a recombination signal will be detected in windows where sequence relationships are the same as they are over the entire length of the sequences.

3.8 LARD Settings

For additional information on LARD settings please consult the LARD manual. It can be downloaded from:

https://www.wendangku.net/doc/5910908807.html,/software/Lard/main.html 3.8.1 Model https://www.wendangku.net/doc/5910908807.html,RD offers the option of using three different nucleotide substitution models for the maximum likelihood reconstruction of three sequence phylogenies. (1) The Hasegawa Kishino and Yano, 1985(HKY) model allows different transition and transversion rates and unequal nucleotide frequencies. The Kimura, 1980 and Jukes-Cantor, 1969 models are specific cases of this model.

(2) The Felsenstein, 1984model is similar to the HKY model but allows nucleotide frequencies to be estimated from the alignment and handles transition/transversion rates differently. (3) The reversible process model allows different rates for all six different types of substitution and assumes, for example, that the frequency of T to C substitutions will be the same as the frequency of C to T substitutions.

Besides the different nucleotide substitution models, LARD also offers the option of using two different models that allow for site-specific variation in substitution rates. (1) A codon-based model allows different substitution rates at each codon position (this is obviously only applicable to coding regions). In general the last codon position should have the highest substitution rate, the middle position the lowest rate and the first position an intermediate rate; (2) A model that assigns different substitution rates to sites based on a gamma distribution. Whereas the gamma distribution is scaled so that the average rate is equal to 1, it is possible to specify the shape of the distribution using the “gamma shape for site rate heterogeneity” setting. A low value (<1) will mean that sites vary greatly in their evolution rate whereas higher numbers for this setting will specify that sites evolve at more similar rates. Setting “# categs for gamma rate heterogeneity” to 0 will give all sites the same substitution rate. Setting this number to a positive integer (N) will assign each site with a different probability to each of the N substitution rate categories specified

3.8.2 Scan options. LARD examines three aligned sequences at a time. It can be set to scan sequences in three different ways. The first and quickest way involves moving a partition along the alignment and determining the likelihood that trees constructed from sequences on either side of the partition have the same branch lengths (the “test one breakpoint” setting; for a detailed description of what LARD does see section 8.7). The second way is to move a window along the alignment with a partition in the centre (this is similar to that used for the MAXCHI and CHIMAERA methods; the “sliding windows scan”). The third, and by far the slowest, way to scan the alignment is to search for two optimal breakpoint partitions (the “test two breakpoints” setting). This could involve evaluating every possible pair of partitions of the alignment.

The “step size” setting will specify how many nu cleotides along the alignment the partition(s)/window will move at each step of the analysis. While setting the step size to 1 will ensure the highest possible scan resolution, the scan will most likely be quite slow. Increasing the step size will speed up the analysis but decreases the scan resolution. A step size of 10 nucleotides should be a good compromise.

If a sliding window scan is chosen, you can specify the window size that is used – remember though that the window has a partition in the centre so that a window size of 400 indicates that the 200 nucleotides on the left of the window get compared with the 200 on the right. The LARD method examines both conserved and variable alignment positions and the window size setting should be large enough that every window examined contains at least 20 variable nucleotide sites

3.9 PHYLPRO Settings

For additional information on PHYLPRO settings and how PHYLPRO works please consult either section 8.8 or the PHYLPRO manual. It can be downloaded from:

https://www.wendangku.net/doc/5910908807.html,.au/ResearchGroups/GIG/Products/phylpro/

3.9.1 Scan options. PHYLPRO is another recombination detection method (like the LARD, BOOTSCAN and SISCAN methods) that examines both variable and conserved alignment positions. The window size setting should be large enough that all examined windows contain 20 or more variable alignment columns. Like with the LARD method this number is twice that recommended for the BOOTSCAN and SISCAN methods because the PHYLPRO method involves moving a window with a partition in its centre along the length of an alignment with each half of the window being compared to the other. See section 3.

4.1 of this manual for advice on selecting window sizes. During pair-wise distance calculations (see section 8.8) the PHYLPRO method can be set to handle gaps in two different ways:

Alignment positions with any gap characters can be either completely ignored (the “strip gaps” setting) or these positions can be considered as long as both of the sequences compared have a nucleotide in the relevant position (the “ignore gaps setting).

When calculating correlation coefficients for sets of pair-wise distances on either side of the moving window (see section 8.8) the PHYLPRO method can be set to either use or not use the zero distance values obtained when sequences are compared with themselves. The permutation test is not currently implemented and the permutation options will have no influence on the analysis results.

3.10 DNA Distance Plot Settings

3.10.1 Scan options.The window and step sizes used during the construction of distance plots may be set. You should set window sizes based on the relatedness of parents that are being examined. Ideally each window in the scan should contain at least 5 variable positions. The optimal step size is also dependent on the relatedness of the sequences being examined and should be smaller than 20% of the window size.

3.10.2 Model options. RDP4 uses code from the PHYLIP component, DNADIST, to construct distance plots and the model options on offer are those available in that program. For additional information on the DNA distance models used by DNADIST please consult the online manual at: https://www.wendangku.net/doc/5910908807.html,/phylip/doc/dnadist.html Consult section 3.

4.3 of this manual for a brief description of the model options.

3.11 TOPAL Settings

For additional information on TOPAL settings please consult the TOPAL manual. It can be obtained online from:

https://www.wendangku.net/doc/5910908807.html,/~frank/Genetics/manual.html

3.11.1 Scan Options. As with the PHYLPRO, BOOTSCAN and SISCAN methods (see sections 3.9, 3.4 and 3.7 respectively) the optimal window and step sizes used during a TOPAL scan are dependent on the relatedness of the sequences being examined. Note, however, that the TOPAL method is similar to the PHYLPRO method in that the windows examined are split in two and have an optimal size that is twice that of the BOOTSCAN and SISCAN methods. You should attempt to set the window size so that each window will cover more than ~20 variable nucleotide positions. See section 3.

4.1 of this manual for advice on selecting window sizes.

When drawing a difference in sum of squares (DSS) plot you can opt to smooth it by averaging DSS values over a “smoothing window” that is moved across the plot one DSS value at a time.

3.11.2 Tree options. During a TOPAL scan RDP4 uses the PHYLIP components NEIGHBOR and FITCH to calculate neighbour joining (NJ) and least squares (LS) trees, respectively. Although the “calculate NJ and LS trees” setting is substantially faster than the “use only LS trees” setting, according to the people who developed the method, it should only be used during manual TOPAL analyses of >20 sequences. I’m not sure if I agree with this though as bo th settings seem pretty similar in practice – except of course that the one is much quicker than the other.

The “Power” setting will influence the magnitude of the DSS values that are calculated – if DSS values are very small (e.g. 0.002) increasing the Power setting will increase them to values that are easier to compare.

A random number seed used during generation of simulated sequences, and randomising the input order of sequences in FITCH and NEIGHBOR can be provided. Using the same seed will result in identical DSS plots in repeated analyses.

3.11.3 Parametric bootstrapping options.If the number of permutations is set to a number greater than 10, RDP4 will perform a permutation test called a parametric bootstrap to determine the significance of any detected DSS peaks. The parametric bootstrap alignments are simulated using SEQ-GEN (Rambaut and Greaaly, 1997) and the DSS plots generated from these alignments are presented together with plots from the real data for comparison purposes.

3.11.4 Model options. RDP4 uses the PHYLIP component DNADIST to construct distance matrices and the model options on offer are those available in that program. For additional information on the DNA distance models used by DNADIST please consult the online manual at: https://www.wendangku.net/doc/5910908807.html,/phylip/doc/dnadist.html Consult section 3.

4.3 of this manual for a brief description of the model options.

3.12 VisRD Setting

VisRD, like the PHYLPRO, LARD, BOOTSCAN and SISCAN methods), is a recombination analysis method that examines both variable and conserved alignment positions. The scanning window size is the only setting that can be changed and should be made large enough that all examined windows contain 10 or more variable alignment columns. See section 3.4.1 of this manual for advice on selecting window sizes.

3.13 Breakpoint Distribution Plot Settings

Breakpoint distribution plots are a useful way of analysing alignments for evidence of recombination hot and cold spots (see section 9.1; Heath et al., 2006). The test used to detect breakpoint hot and cold spots is based on permutations. The number of permutations used in this test can be specified. The number should be 100 or greater. The size of breakpoint clusters that you wish to examine can be specified with the “window size” setting. Note that small window sizes (<=50nts) are useful for detecting unusually tight clusters of breakpoints (i.e. highly focused recombination hotspots) but are not very good for detecting either recombination cold spots or dispersed recombination-hotspots. Window sizes between 100 and 200 nt are generally a good compromise between detecting hot and cold spots but might miss evidence of unusually tight clusters of breakpoints within regions smaller than the specified window size. It is therefore advisable to try

a range of window size settings

3.14 Recombination Rate Settings

RDP4 uses the programs CONVERT and INTERVAL from the LDHAT package (McVean et al., 2002; McVean et al., 2004) to construct plots of varying recombination rates across sequences. For additional information on the settings used by these programs consult the LDHAT manual at:

https://www.wendangku.net/doc/5910908807.html,/~mcvean/LDhat/instructions.html The INTERVAL program that RDP4 uses to draw recombination rate plots, estimates variations in recombination rates along an alignment using a penalised approximate likelihood approach within a Bayesian reversible-jump Markov chain Monte Carlo (RJMCMC) scheme. INTERVAL requires an initial estimate of the alignment-wide population scaled recombination rate (rho) as a starting point. The “starting rho” value should be a number between 0 and 100 that should ideally be an actual estimate of the alignment-wide population scaled recombination rate. An estimate of this can be obtained by firstly drawing a plot with an arbitrary starting rho value (say 10) which, apart from giving you a plot of recombination rates along your alignment, will also give you an estimate of the alignment-wide population scaled recombination rate. This value, displayed in the recombination information panel, can then be used as a better starting value when you redo the plot.

INTERVAL allows you to specify a “block penalty” to prevent the RJMCMC invoking the existence of too many changes in recombination rate across a region of sequence – i.e. you can set the block penalty to prevent INTERVAL from over-fitting a complex variable recombination rate model to the data. I cannot give any really good advice on what constitutes an appropriate penalty other than that you should try constructing plots with a range of penalties between 0 and 50. Lower penalties will enable the analysis to detect smaller, more subtle variations in recombination rates but could also result in over-fitting of the inferred changes to the data. Conversely, higher block penalties will sacrifice sensitivity in return for greater confidence in the recombination rate changes that are detected. Gill McVean advises the use of simulations with sequences resembling those you are analysing to determine the most appropriate block penalty. As this approach will probably be beyond most RDP4 users, I’d recommend that you settle on a penalty somewhere in the range 5-30 and don’t over-interpret the peaks and valleys in the plots that you get.

The “minor allele frequency cutoff” setting determines which variable alignment columns INTERVAL will examine. Having a cutoff that excludes rare polymorphisms focuses the analysis on the most reliable and least noisy evidence of recombination –i.e. that which have left a mark on the distributions of the older, most phylogenetically informative nucleotide polymorphisms in the dataset. It is strongly

advisable that a cutoff is chosen which excludes alignment columns that contain a single sequence with a site that is different from all the rest in the alignment. I recommend that the cut-off is chosen so that only sites carried by three or more sequences are included in the analysis. The value of this setting will therefore need to be changed with every analysis you do. For example, with an alignment containing 100 sequences, a minor allele frequency cut-off of 0.05 will exclude all variable alignment positions where fewer than six sequences share one of the two alternative nucleotides at that position.

The “gap frequency cutoff” can be used to exclude from an analysis any alignment columns with more than a certain amount of missing data.

The number of MCMC updates performed during the analysis can be set. The first 10% of updates will always be discarded as burn-in and the number of updates must always be greater than 105. It is strongly recommended that you never use less than 106 updates.

3.15 Matrix Settings

RDP4 draws several different types of matrices. Many of the different matrices share settings such as their colour scales, permutation numbers and window sizes.Although it is not a matrix, various matrix settings (window size, permutation number and type species) are shared with the recombination breakpoint plot (see section 3.13).

Note that the Rmin(HK), Rmin(HK)/D and LD matrices that RDP4 presents are drawn by the program PAIRWISE (a component of the LDHAT package) using minor allele frequency, gap frequency, gene conversion model and average tract length settings that are specified in the recombination rate options section (see 3.14). See sections 9.3.6 – 9.3.8 for what is being plotted in these matrices.

3.15.1Ingrid Jakobsen (IJ) compatibility matrix. The IJ compatibility matrix in RDP4 is only a partial implementation of that implemented in the program Reticulate in that a statistical test using Ingrid Jacobsen’s neighbour similarity scores is not available in RDP4 (It is, however, implemented in RDP2 which is available from the RDP web-page). See section 9.3.1 for details of what is being plotted in a compatibility matrix. For additional information on compatibility matrices and the program reticulate please consult the manual: https://www.wendangku.net/doc/5910908807.html,.au/dmm/humgen/ingrid/ftp/reticulate/instructions 3.15.2Robinson-Foulds (RF) compatibility matrix. The “window size” setting refers to the number of nucleotides that are used to construct the various phylogenetic trees that are to be compared with one another and the “step size” refers to the number of nucleotides that are skipped between consecutive windows. As with the SH compatibility matrix, if the step size is set to larger than half the window size, the window size will be automatically adjusted so that it is twice the step size. While decreasing the step size will increase the resolution of RF matrices, it will also exponentially increase the amount of time it takes to construct the matrix (i.e. it can take a very long time to construct SH matrices if the step size is small). If the step size to smaller than 1/2000 the length of the analysed sequences it will be increased so that it is 1/2000 the length of the analysed sequences. See section 9.3.2 for details of what is being plotted in a RF compatibility matrix.

3.15.3Shimodaira-Hasegawa (SH) compatibility matrix. The “window size” setting refers to the number of nucleo tides that are used to construct the various phylogenetic trees that are to be compared with one another and the “step size” refers to the number of nucleotides that are skipped between consecutive windows. As with the RF compatibility matrix, if the step size is set to larger than half the window size, the window size will be automatically adjusted so that it is twice the step size. While decreasing the step size will increase the resolution of SH matrices, it will also exponentially increase the amount of time it takes to construct the matrix (i.e. it can take a very long time to construct SH matrices if the step size is small). If the step size to smaller than 1/2000 the length of the analysed sequences it will be increased so that it is 1/2000 the length of the analysed sequences. See section 9.3.3 for details of what is being plotted in a RF compatibility matrix.

3.15.4 Recombination matrix. The “type sequence” setting can be used to specify the sequence in an alignment that will be used as a reference when numbering the nucleotide coordinates that are plotted. See section 9.3.4 for details of what is being plotted in a recombination matrix. 3.15.5 Modularity matrix.See 3.15.4 for what the “type sequence” setting means The “window size” setting ref ers to the number of nucleotides that are examined when comparing how closely the parental sequences of detected recombinants resemble one another. See section 9.3.5 for details of what is being plotted in a modularity matrix.

3.15.6 Recombination region count matrix. See 3.15.4 for what the “type sequence” setting means. The “window size” setting here refers to the diameter of the circle drawn around every recombination breakpoint pair plotted on a breakpoint pair matrix. See section 9.3.6 for details of what is being plotted on a recombination region count matrix.

3.15.7 Breakpoint distribution plot. See 3.15.4 for what the “type sequence” setting means. See 3.13 for what the other settings mean and section 9.3.7 for details on what is plotted.

3.16 Tree Settings

You are able to draw UPGMA, neighbor joining (NJ), Fast neighbour joining (FatNJ or approximate least squares; LS), maximum likelihood (ML) or Bayesian trees from within RDP4. To set tree options for a specific tree construction method you must first select the type of tree you’d like to set options for. Note, however, that you are unable to change the way RDP4 makes UPGMA and FastNJ trees.

3.16.1 Neighbor joining trees

3.16.1.1 Tree drawing options. RDP4 utilises the PHYLIP component NEIGHBOR to construct NJ trees and additional information on this program and its settings can be obtained online from:

https://www.wendangku.net/doc/5910908807.html,/phylip/doc/

It is possible to specify whether or not negative branch lengths are to be permitted in the finished tree. Negative branch lengths are possible when constructing trees with the NJ method. By not allowing negative branch lengths you will force RDP4 to report negative branch lengths as having zero length.

Randomising (or jumbling) the order in which sequences are added to NJ trees will influence the way NEIGHBOR produces the final tree (if ties are obtained in any of the iterative rounds of branch addition the first sequence in the order will win the tie with possible consequences for the topology of the finished tree). To test the influence of sequence input order on the topology of a tree, use the “randomise input order” setting, set the bootstrap number to 0 and then construct trees with a range of different random number seeds. If the tree topology changes with different random number seeds then the input order has had an influence on the tree’s topology.

3.16.1.2 Model options.RDP4 uses the PHYLIP component DNADIST to calculate distance matrices for NJ tree construction. The model options on offer are those available in DNADIST. For additional information on the DNA distance models used by DNADIST please consult the online manual at:

https://www.wendangku.net/doc/5910908807.html,/phylip/doc/dnadist.html Consult section 3.4.3 of this manual for a brief description of the model options.

3.16.1.3 Branch support tests.The number of bootstrap replicates used during the construction of NJ trees can be set. A random number seed used during generation of bootstrapped alignments can be provided. Using the same seed will result in identical bootstrapped alignments in repeated analyses.

3.16.3 Maximum likelihood trees

3.16.3.1 Model options.RDP4 can use the programs PHYML (versions 1 and 3; Guindon and Gascuel, 2003; Guindon et al., 2010), RAxML (version 8; Stamatakis, 2014) and FastTree (version 2; Price et al., 2010) to construct maximum likelihood (ML) trees. Model options can, however, only be set for PHYML. For additional information on the models that are applied by these programs please consult their online manuals at:

http://bioweb2.pasteur.fr/docs/modules/phyml/3.0.1/phyml_manual_20 08.pdf; (PhymL) https://www.wendangku.net/doc/5910908807.html,/exelixis/resource/download/NewManual.pdf(RaxML); https://www.wendangku.net/doc/5910908807.html,/fasttree/ (FastTree).

Eight different nucleotide substitution models are available for PHYML. These include the Jukes Cantor-1969 (JC69), Kimura-1980 (K80), Felsenstein-1981 (F81), Felsenstein-1984(F84), Tamura and Nei- 1993 (TN93), General time reversible (GTR; Lanave et al. 1984,

Tavaré 1986, Rodriguez et al. 1990) and Hasagawa, Kishino and Yano -1985 (HKY85) models. While PHYML allows users to specify their own GTR rate matrix this option is not implemented in RDP4. RDP4 will also automatically select a best fit model using an Aikaike information criterion (AIC) test such as that described in Posada and Crandall (1998). This test compares the likelihoods of trees constructed with various standard nucleotide substitution models (including or excluding extra parameters permitting site-to-site variations in substitution rates) and, accounting for the number of parameters the different models contain, selects the model that fits the data best.

Depending on the model selected you may be able to specify the transition:transversion rate ratio(note that to keep things consistent with PHYLIP components used elsewhere this is the “rate ratio” and not the “ratio” no rmally used in PHYML –the number that will be passed to PHYML for phylogeny construction will be twice the number specified here). If a value of 0 is specified PHYML will determine the maximum likelihood value of this parameter during tree construction (doing this will make tree construction slower).

The proportion of invariable sites can be set to any number between 0 and 1 inclusive. Setting this value to 1 will prompt PHYML to find the maximum likelihood value of this parameter during tree construction.

Depending on the model selected, equilibrium base frequencies may be estimated either empirically from the data, or by maximum likelihood during tree reconstruction (with the later making tree construction slower).

PHYML allows specification of multiple substitution rate categories–i.e. it can take into account that different sites along an alignment may evolve at different rates. The value of each substitution rate category is drawn from a discrete gamma distribution of possible categories. The greater the number of categories specified, the more accurate will be the fit of actual substitution rates to the rate categories chosen. However, the program should take four times longer to construct a tree using four rate categories than it will take to construct a tree using one. Whereas allowing fewer than four rate categories can be unrealistic, allowing more than eight does not really improve the accuracy of tree construction but seriously slows the tree construction process down.

If trees are to be constructed using more than one substitution rate category, the exact shape of the gamma distribution from which the categories are drawn can be changed using the gamma distribution parameter. Values of this parameter below 0.7 correspond with high variations between the evolution rates of sites in the sequences being examined. Values between 0.7 and 1.5 correspond with moderate variation and values larger than 1.5 correspond with low variation. If a value of 0 is specified the shape parameter will be inferred by maximum likelihood during tree construction (again, this will increase the tree construction time).

3.16.3.2 Branch support tests.For small datasets PHYML is fast enough to perform standard bootstrap tests of branch support. The number of bootstrap replicates used during the construction of ML trees can be set. Unlike with the NJ and LS trees, however, the random number seed will automatically change for each tree constructed.

3.16.3.3 Tree search strategy. Various different compute-program dependent strategies can be used to search for the ML tree. In order of fastest to slowest these are: fastest FastTree (the default), faster RAxML, fast PHYML1 tree search, PHYML3 tree search with NNI, PHYML3 tree search with SPR, and PHYML3 tree search with NNI +SPR. The relative accuracies of these different tree searching methods is disputed. FastTree seems to excellently balance speed and relatively high accuracy, but over-all RAxML or PHYML3 may be slightly more accurate. RaXML is, however, definitely more accurate than both PHYML and FastTree when it comes to analysing alignments with large amounts of missing data.

3.16.4 Bayesian trees

RDP4 uses the program MrBayes 3.2 (Ronquist et al, 2012) to draw Bayesian trees. The options on offer in RDP4 are only a very small subset of those available in MrBayes. For additional information on these options please consult the MrBayes online manual at:

https://www.wendangku.net/doc/5910908807.html,/wiki/index.php/Manual

3.16.

4.1 Model options. Three different nucleotide substitution models are available. You will notice that the model names do not correspond to those of any of the other three drawing methods in RDP4. However, MrB ayes run with the “all 6 substitution types are equally likely” and “no rate variation across sites” settings corresponds with the Jukes cantor, 1969 model. Similarly, MrBayes run with the “all 6 substitution types are equally likely” and “gamma distributed rate variation” corresponds with the Felsenstein, 1981 model. You should be able to find a suitable mixture of the three model settings to recreate most of the common nucleotide substitution models. The “transitions and transversions can be unequally likely” setting will result in the Transition:transversion rate ratio being approximated along with the phylogeny. The “all six subsititution types can be unequally likely” setting can be used to approximate the GTR model with Bayesian probabilities of the six different substitution types being inferred during tree construction.

You may also specify whether trees are to be inferred assuming gamma distributed rate variation across sites. Only three of the five types of rate variation (including no variation) on offer in MrBayes are offered in RDP4 (the options with invariable sites are not included). See section 3.16.3.1 for details on what gamma distributed rate variation means. The “auto-correlated” rate distribution setting will allow you to specify that the rates of adjacent sites are not chosen independently of one another. Although tree construction with the auto-correlated gamma distribution setting is always slower than that with the plain gamma distribution setting, the difference in construction times decreases with increasing dataset size.

See section 3.16.3.1 for advice on selecting the number of rate categories that are to be used during tree construction.

3.16.

4.2 MCMC options. Use the “number of generations” setting to indicate the maximum number of MCMC generations that should occur during tree construction. RDP4 is incapable of providing you access to an interactive use of MrBayes which means that you will not have the MrBayes option of simply continuing with the tree construction process until sufficient convergence is reached. Therefore RDP4 uses the “average standard deviation of split frequencies” convergence diagnostic to tell MrBayes when it should stop trying to find better trees. It will stop MrBayes when the average standard deviation of split frequencies is smaller than or equal to 0.1. If this degree of convergence is never reached then the trees should either be examined keeping this in mind, or another run with more generations should be started from scratch. Note that with MrBayes you could simply continue a run which means it will sometimes be a better idea for you to simply construct these trees using MrBayes directly. Anyway, the number of MCMC generations should probably never be set below 106. If con vergence doesn’t happen in this number of generations, the generation number could be set as high as 1010. Remember that the “stop rule” is in place so that as soon as the stop condition is reached (even if it is reached after only 105 generations) the run will terminate and your tree will be displayed.

The sampling frequency setting should be used to specify how many generations should pass between samples drawn from the Markov chain. The number should never be less than 10 or greater than 100th of the expected MCMC generations before convergence. 100 is a safe number to choose for this setting.

If the number of chains is set higher than 1 MrBayes will run multiple MCMC chains in parallel which it uses for something called “Metropolis coupling” to improve its sampling of potentially good trees. It will always run one “cold” chain and any extra chains specified will be “heated”. Running heated chains in parallel to the cold chain may be absolutely essential to achieve a good tree for alignments containing more than ~50 sequences. Basically, the more chains you specify the better will be your chances of obtaining a good tree. However, the time taken for the program to create and examine a specified number of MCMC generations will increase in proportion with the number of chains specified. Also, if your computer does not have enough RAM for MrBayes to store all the chains you ask it to analyse, the program can start running really slowly.

Another parameter influencing the Metropolis coupling behaviour of MrBayes is the “temperature”. The temperature parameter controls the rate at which the heated chains get hotter. The whole rationale behind heating of the chains is to reduce the penalisation of potential trees that are relatively less probable than the best trees sampled at any given point in the program’s execution. These less probable trees might more closely resemble, and therefore provide access to, some really good trees that the MCMC sampler would never otherwise find without the heating process. Low temperature values will heat the heated chains more slowly than high values. I’m not sure how high the temparature setting might be set without there being a complete collapse in the sampling scheme but the default value in Mr Bayes is 0.2 (corresponding with a 20% increase in temparature) at every heating step.

The swap frequency and swap number determine the rate at which states (from hot to cold and vice versa) are swapped between the chains being analysed. The swap frequency setting specifies the number of generations that pass between attempted exchanges of states between a randomly picked hot chain and the cold chain. The swap number determines how many swaps are attempted between different hot chains and the col d chain at every “swapping generation”3.17 SCHEMA Settings

SCHEMA (see section 9.4; Voigt et al., 2002; Lefeuvre et al., 2007) is a method that takes protein atomic coordinates and estimates degrees of protein or single stranded nucleic acid fold disruption expected in recombinant proteins or single stranded DNA/RNA molecules. RDP4 uses a permutation test to determine whether natural recombinats are significantly less disruptive of protein/nucleic acid folding than randomly generated recombinants. The number of permutations used in this test can be specified with the permutation number setting

3.17.1 Protein folding disruption.The SCHEMA method finds all amino acid pairs that are within a user defined distance of one another (which is usually between 2 and 20 angstroms) and identifies these as being potentially interacting within the folded protein. This distance can be defined with the interaction distance setting.

3.17.2 Nucleic acid folding disruption. RDP4 uses the program hybrid-ss-min from the UNAFOLD package (Markham and Zuker, 2008) to infer the secondary structures of DNA and RNA molecules. The temperature at which this inference is carried out at is important and should be set to the approximate physiological temperature at which the DNA/RNA being analysed occurs (e.g. 37o C for human viruses and 20o C for plant viruses). For accurate secondary structure inference it is also necessary to indicate whether the sequences being analysed are RNA or DNA.

4 FINDING EVIDENCE OF RECOMBINATION

4.1 Automated Exploratory Recombination Analysis

4.1.1 Masking and disabling sequences. When large numbers of sequences are to be analysed, certain sequences in an alignment can be either “masked” or completely removed from the analysis (“disabled”) by clickin g (with the left mouse button) on the name of the sequence either in the sequence display panel or in the small tree display panel(Fig 1). Masking does not stop the sequence being used in either tree construction, BOOTSCANning or as a reference sequence in determining informative sites (for the original RDP method, SISCAN or VisRD). Masking of sequences is useful for both focusing the analysis on groups of sequences within an alignment and, because fewer sequence comparisons are made when some sequences are masked, increasing the power of recombination detection amongst a smaller subset of sequences within an alignment. Disabling sequences is useful for temporarily discarding sequences from an alignment.

RDP4 will, by default, automatically mask sequences to ensure optimal recombination detection. Auto masking will minimise the number of comparisons the program makes during an exploratory recombination screen. This will ensure that the multiple testing correction needed for P-values will be kept to a minimum and will therefore guarantee that at least as many (but probably more) recombination events will be detected as would have been detected if no sequences were masked.

4.1.2 Grouping sequences. Grouping of sequences provides and additional means of focusing analyses onto a specific group of sequences. To make a group right click on the sequence names in the sequence display panel or the small tree display panel and select the “group” option than is offered. Then simply click on the names of the sequences (in either the sequence or small tree display) that you wish to form part of the group. When a group of sequences is selected and an exploratory scan for recombination is subsequently carried out the only sequence triplets that will be examined will be those for which two or more of the sequences are within the selected group. As with masking, this minimises the numbers of tests that are performed and increases the program’s power to detect recombination events within the specified group of sequences.

4.1.3 Running an automated (or exploratory) analysis. Once the appropriate settings have been selected, pressing the “X-Over” button in the command button panel (Fig 1) starts the analysis. A progress bar, the time taken, the number of unique events and the number of recombination signals detected are displayed for each of the different methods selected for the primary exploratory scan for recombination. It is recommended that the “Do not show plots” or “show overview during scan” option be selected in the “General Options” (see section 3.1). If the “show plots” setting is selected the program will display plots of raw data which could more than double the analysis time.

If the “show overview during scan” setting is selected the program will display plots during a scan indicating the positions in the alignment where recombination is being detected. Displayed are plots indicating the genetic distances between parental sequences involved

in generating the detected recombinantion signals (PDist), the minimum probability values associated with detected events (P-Val), and the number of events detected in particular regions of the alignment (#Hits).

4.1.4 Identification of unique recombination events. You will notice that the program will sequentially scan the alignment with each of the methods you have selected with the number of detected recombination signals being displayed as the scan progresses. The recombination detection methods implemented in RDP4 examine every possible triplet of sequences within an alignment for patterns of nucleotide variation indicative of recombination. Once identified the characteristics of each “recombination signal” (sequences in the triplet, the approximate breakpoint positions, approximate probability of recombination and the method used to detect the recombination event) are stored until every recombination signal in every sequence triplet has been identified. It is important to note that not every recombination signal is indicative of a single unique recombination event. A recombination event between two nucleotide sequences produces a recombinant molecule that has two pieces each of which is most closely related to one or the other of the two recombining sequences (also called the parental sequences). It is important to note that these “parental” sequences are not the actual parents of the recombinant sequence –they are instead simply sequences within the analysed dataset that were used to infer the existence of the actual parents).

When detecting recombination amongst a sample of aligned sequences, the recombination signal detection methods in RDP4 will be able to detect a recombination event if:

1. One or more descendents of the recombinant have been sampled.

2. One or more reasonably close relatives of at least one of the

parental sequences have been sampled

Once a preliminary account has been made of all the recombination signals detected by all the selected exploratory recombination signal detection methods, RDP4 will begin trying to determine how many unique recombination events are responsible for the recombination signals detected. If more than one descendent of a recombinant is sampled, or more than one close relative of either of the parental sequences has been sampled, then the recombination event will be detectable with more than just one combination of three sequences within the total sequence dataset being analysed. These multiple detections of the same event must be taken into consideration when RDP4 attempts to identify the set of unique recombination events responsible for the recombination signals in the alignment.

RDP4 handles multiple detections of the same events using repeated cycles of recombination signal detection. All detectable recombination signals in an alignment are identified, the strongest signal is chosen and a piece of sequence within the apparently recombinant sequence that is responsible for the signal is removed (identification of which sequence in a triplet is identified as the recombinant is outlined later in section 4.1.4). The alignment is then

re-analysed and the process repeated until there are no longer any recombination signals detectable.

During this second stage of an exploratory recombination analysis a second set of graphs may be displayed (if the “s how overview” setting is selected). These graphs indicate the same stats

as before except that (1) the PDist plot is replaced by a plot of recombination breakpoint numbers (BPNum) and (2) the data plotted

is only that from unique recombination events (previously the data plotted was a composite of all detected recombination signals).

The procedure used for detecting unique recombination events can become a little complicated when there are multiple descendants

of a single recombinant in a sample of analysed sequences. It is important not to count each of the descendents as though they possess a unique recombination event. Therefore, when a recombination signal is detected, RDP4 uses a mixture of statistical and phylogenetic methods to identify multiple descendents of ancient recombinants. Note that whenever a sequence is referred to as the “presumed recombinant” in the following sections it does not mean it is the sequence that will ultimately be identified as the recombinant. In

fact all three sequences used to detect the recombination signal are in turn analysed as if they are the recombinant and the other two sequences are parental. These various methods involve:

1. Making six “sub” alignments of the alignment being analysed. Two

sub-alignments are taken from the regions 3’ and 5’ of each identified recombination breakpoint (i.e. four alignments in total) with the length of each sub-alignment corresponding to 20 variable nucleotide positions between the presumed recombinant and either of its presumed parental sequences. If there is only one breakpoint in a linear sequence the sequences are treated as if they are circular and the join between the two ends are treated as

a second breakpoint. The final two sub-alignments are the bits of

sequence bounded by the recombination breakpoints. Again, if there is only one breakpoint in a linear sequence then the sequence is treated as circular and the region 5’ of the 5’ breakpoint is “ligated” to the region 3’ of the 3’ breakpoint.

2. A Jukes Cantor distance matrix and a bootstrapped neighbor

joining tree (which branches being collapsed if they have <50% support) is constructed for each of the six sub-alignments. The six distance matrices and six trees are divided into three pairs – one for the sub-alignment s bounding the 3’ breakpoint, one for the sub-alignments bounding the 5’ breakpoint and one for sub-alignments obtained by partitioning the entire alignment into two pieces.

3. A “presumed recombinant” is selected from the three sequences

used to detect the current event.

4. The trees are used to identify sequences that are “phylogenetically

correlated” with the presumed recombinant–i.e. sequences that tend to move around in trees with the presumed recombinant. A set of sequences are identified that “move” wi th the presumed recombinant relative to the parental sequences between the trees.

All of the sequences thus identified are included in a phylogenetic correlation set. Due to the lack of either a known statistical test for tree robustness, or multiple testing correction, the statistical meaning of grouping sequences into such sets is obscure.

However, due to the multiple testing carried out, the groupings are expected to be reasonably unconservative and although a large number of false positives are expected, the number of false negatives will be correspondingly low.

5. Each sequence in the alignment is then compared with the

presumed recombinant by correlating distances between each sequence and the parentals with those of the presumed recombinant and the parentals in the paired matrices –ie the distance between sequence X and parental 1 in matrix 1 is regressed against that of the presumed recombinant and parental

1 in matrix 1. Altogether the regression analysis of each sequence

using each matrix pair involves the correlation of six distance measures (those of the selected sequence/presumed recombinant against both the parentals in both matrices and the distances between the parentals in both matrices). Significant correlation (Pearson’s correlation usin g a t-test and P < 0.05 cutoff) between the distances of a selected sequence to the parental sequences with those of a presumed recombinant sequence to the same parental sequences using any of the three matrix pairs, is used to identify the sequences that have potentially descended from the same ancestral recombinant as the presumed recombinant. This mechanism of grouping sequences into a distance correlation set is also extremely unconservative because P-values are not Bonferroni corrected and again one would therefore expect a large number of false positives and few if any false negatives.

6. The total pool of identified recombination signals in the entire

alignment is then scanned for potential matches to the current recombination event under consideration. Potential matches are recombination signals (a) detected with two of the sequences in the triplet used to detect the event under consideration, and (b) where the amount of sequence bounded by the approximated recombination breakpoints overlaps that bounded by the breakpoints estimated for the current event by greater than 30%.

Sequences identified in this way are placed into a detectable signal set.

7. Sequences occurring in at least two of the phylogenetic correlation,

distance correlation and detectable signal sets are presumed to have descended from the same original recombinant sequence as the presumed recombinant currently under consideration. These sequences are grouped into a co-recombinant set.

8. Another, different, presumed recombinant is selected from the

three sequences used to detect the current event and the process from (4) through (8) is repeated until all three sequences have been considered as the presumed recombinant. For every detectable recombination event this process conservatively identifies the sequences potentially carrying trace evidence of the same original recombination event.

4.1.5 Identification of recombinant sequences. Identification of the recombinant sequence in a sequence triplet used to detect a recombination signal is achieved using the consensus of various statistical and phylogenetic methods. These include:

1. PhPr: The phylogenetic profile or PHYLPRO method of Weiller

(1998). Pair-wise Jukes Cantor distances between a query sequence and all the other sequences sampled are calculated using two portions of the multiple sequence alignment bounded by the approximate recombination breakpoints, and correlated with one another. The recombinant sequence is likely to be the sequence with the lowest correlation score of all three sequences in the triplet. However, it is possible that if a substantial proportion of the sequences in a sample are descended from the same recombinant, correlation of distances between the recombinant and the other sequences in the alignment (many of which share the same recombinant sequence mosaic as the recombinant) will be high and the PHYLPRO method may fail to identify the correct recombinant.

2. TreePhPr: A variation of the PHYLPRO method in which rooted

tree topology distances between sequences within neighbor joining trees (constructed from the same distance matrices used for the PHYLPRO method) rather than genetic distances are used.

Topology distances within a tree are calculated by midpoint rooting the tree and encoding the relatedness of sequences in the tree in

a distance matrix.

3. SubPhPr & TreeSubPhPr: Other variations of the PHYLPRO

method in which the sum of squares of differences in the distances between sequences in a triplet and the remainder of sequences in the two alignments is calculated. The difference in distances between the recombinant and the remainder of sequences in the alignment is expected to be greater than that of the parental sequences. The first variant (SubPhPr) uses distance matrices and the second (SubPhPr) uses rooted tree topologies encoded in distance matrices.

4. SubDist & TreeSubDist: Yet more variations of the PHYLPRO

method in which the average phylogenetic correlation between the two alignments is measured when each sequence in the triplet is in turn removed from the alignment. It is expected that removal of the recombinant sequence will result in the greatest increase in average phylogenetic correlation between the alignments. The first variant (SubDist) uses distance matrices and the second (TreeSubDist) uses rooted tree topologies encoded in distance matrices.

5. ParsimonyO & ParsiomonyI:Modifications of the subtree

pruning and re-grafting (SPR) methods of McLeod et al. (2005) and Beiko and Hamilton (2006). These methods involve using neighbor joining trees constructed from portions of the alignment bounded by the recombination breakpoints (as opposed to trees constructed using different genes as in McLeod et al., 2005 and Beiko and Hamilton, 2006), and determining the minimum number of SPR operations required to convert one tree into the other.

Other modifications of the McLeod et al./Beiko and Hamilton method are that, for each potentially recombinant sequence under consideration, (a) only the subtree containing the sequences in the co-recombinant set for that sequence is considered, (b) it is assumed that the co-recombinant set is monophyletc and (c) rather than comparing the two trees to one another, the number of SPR operations required to reconstitute the monphyletic co-recombinant subtree is determined separately for both trees and averaged. These modifications take into consideration the fact that in taxa where recombination is very frequent there will be many conflicting phylogenetic signals within and between both trees that have nothing to do with the recombination event currently under consideration.

6. O:E & O:EDist: Methods that compare observed recombination

signals with those that would be expected if each of the sequences in the triplet were recombinant. As mentioned previously, whenever a recombination event occurs it will potentially be possible to detect it if there is at least (a) one close relative of at least one of the parental sequences and (b) one descendent of the recombinant in the alignment. Whenever a sample contains more than one descendent of the recombinant or more than one close relative of one of the parental sequences, the recombination event will be detectable with more than one combination of sequences.

Therefore, recombination signals (a) detected with close relatives

of each of the sequences in the triplet used to identify the current event and (b) involving at least 30% sequence overlap between approximated breakpoints are identified and used to infer which of the sequences in the triplet is recombinant. This can be achieved because, depending on which sequence is recombinant, it would be expected that the recombination event should be detectable with different sets of sequence triplets. The sequence with the corresponding set of expected sequence triplets that has the greatest overlap with the set of observed triplets is most likely to be the recombinant.

7. dMax(VisRD): The recombinant identification statistic described

by Lemey et al (2009). dMax is a quartet mapping statistic that is calculated by constructing large numbers of four taxon maximum parsimony trees containing, in turn, each of the three sequences in the triplet used to detect recombination signals. Quartet map locations are determined using the fragment of the alignment between the recombination breakpoints and the remainder of the alignment. The difference between these map locations, d, is recorded for large numbers of quartets containing each of the sequences in the triplet used to detect the recombination signal.

The triplet sequence that yields the greatest d across all examined quartets (i.e. dMax) is assumed to be the recombinant.

8. Conflict: Indicates the degree to which distances are smaller

between the members of potential “co-recombinant” sets (see 4.13 above) than they are with other sequences in the alignment.

Whereas it is expected that the potential co-recombinant sets of the real recombinant sequence should all be more similar to one another than any is to any other sequence in the alignment (i.e.

recombinants descended from the same recombinant ancestor should be monophyletic), this is not expected to be the case for the potential co-recombinant sets of the parental sequences.

9. OuCheck: Indicates the degree to which phylogenetic

relationships between the triplet sequences and other individual sequences in the alignment are disturbed by recombination (similar to a doublet scanning version of the dMax statistic above). It is calculated by considering the topologies of rooted NJ trees constructed from the region of the alignment between the recombination breakpoints and the remainder of the alignment.

For each of the triplet sequences, the number of times relationships are maintained between the individual triplet sequences and each other sequence in the alignment across both trees is counted. The recombinant can be identified as the sequence that maintains the fewest unchanged relationships relative to the other triplet sequences.

10. TrpScore: Measures the change in rooted NJ tree positions

(without taking actual distances into account) for each sequence in the triplet between a phylogenetic tree constructed from the fragment of the alignment between the recombination breakpoints and the tree constructed from the remainder of the alignment (similar to a triplet scanning version of the dMax statistic above).

Differences in tree positions between each triplet sequence relative to every other pair of sequences in the alignment are calculated.

Using averaging over branches to account for sampling biases, the enumerated topology changes are expected to be highest for the recombinant sequences

11. SetDistT & SetDistP: Focus on the three sequences within the

triplet and compares the numbers of polymorphic sites found between the recombination breakpoints in these three sequences with those found in the remainder of the three sequence alignment.

It is expected that if the polymorphic sites are evenly distributed between the two regions, the recombinant sequence will be the one that is alternatively most closely related to the major and minor parents. If the polymorphic sites are sparser between the breakpoints then this implies both that there is an un-sampled major parental sequence and that it is the sequence that is most distantly related to the other two in the remainder of the alignment that is the recombinant. Conversely, If the polymorphic sites are more dense between the breakpoints then this implies both that there is an un-sampled minor parental sequence and that it is the sequence that is most distantly related to the other two in the alignment region between the breakpoints that is the recombinant.

A weighted consensus of these methods is used to identify the recombinant from amongst the sequences in a triplet.

It is important to note that although all of these methods work very well in sequences where recombination has been relatively rare, they all suffer from an elevated failure rate when recombination is frequent. The main reason for this is that when recombination is frequent many of the clearest recombination signals will be achieved when either two or all three of sequences in a triplet are recombinant. Another reason is that the accuracy of trees and distance measures used to infer which sequences are recombinant, decrease as the number of detectable recombination events in an alignment increases.

In analyses where large numbers of independent recombination events are detectable it can be very difficult, if not impossible, to properly resolve the origins of sequence fragments within the recombinant sequences. However, for purposes of identifying the number of unique recombination signals in an alignment neither incorrect identification of recombinants, nor multiple overlapping recombination signals, is a fatal problem. This is because when a recombination signal is detected, a recombinant sequence is chosen and the pieces of sequence between the estimated breakpoints in all the assumed descendants of the inferred ancestral recombinant are deleted. The signal originating from that event disappears and it is not counted again during the next round of analysis. This will be true even if the incorrect sequence is chosen as the recombinant.

4.1.6 Cyclical detection and erasing of recombination signals. The systematic detection and erasing of recombination signals from an alignment is specifically carried out in the following manner:

1. An alignment is screened for recombination signals using one or

more of the exploratory recombination signal detection methods that have been selected (see section 8).

2. The total pool of detectable recombination signals is examined and

the signal with the best approximated probability of being a real recombination event is selected.

3. All sequences in the alignment are compared with each sequence

in the triplet used to detect the selected recombination event as described in section 4.1.3. Three groups of sequences, called co-recombinant sets, are identified as possibly having the same recombinant origin as each of the three sequences in the triplet. 4. One of the sequences in the triplet is identified as the most likely

recombinant as outlined in section 4.1.4.

5. The tracts of sequence responsible for the recombination signals

in the identified recombinants and all the sequences in the corresponding co-recombinant set are erased. This simply involves replacing the nucleotide characters (i.e. A, C, G and T) with gap characters (i.e. -) in the region bounded by the approximated recombination breakpoints in each of the sequences in the co recombinant set. For every tract of sequence erased a new sequence is added to the alignment. Each new sequence contains a copy of the erased sequence tract and gap characters at all other un-copied sequence positions. What this in effect achieves is to uncouple from one another the two bits of sequence that have different evolutionary histories.

6. The cycle then resumes from step (1) and continues until no

further recombination signals are detectable.

It is important to note that once sequences have been erased from the alignment and the alignment is re-screened, the part of the detection procedure dealing with the identification of recombination breakpoint positions is altered slightly. When recombination events are determined to involve breakpoints that either bracket, or are predicted to be close to a portion of deleted sequence, then one or both of the breakpoint positions are marked as being “uncertain.” The number of variable nucleotide positions in the sequence triplet being examined that fall between the deleted region and the position identified as the likely breakpoint, and the recombination signal detection method estimating the breakpoint position, determine when a breakpoint position is identified as uncertain. For example, for the RDP method any breakpoint within one window length (i.e. in variable nucleotides) of a deleted region is labelled as “uncertain.” In cases where breakpoints bracket one or more deleted regions, detected signals are broken into two or more pieces, each corresponding to the portions of continuously uninterrupted sequence between the identified breakpoints. The recombination signals within these regions are reanalysed independently and breakpoints adjacent to deleted tracts of sequence are labelled as being uncertain.

Identifying breakpoints that are uncertain (due mostly to overlapping recombination events within a sequence triplet used to identify a recombination event) is vital for the accurate determination of detectable breakpoint distributions within a set of aligned sequences. See section 10 of this manual for a step-by-step guide on how features in RDP4 should be used to formulate a recombination hypothesis and section 9.1 on how approximated breakpoint positions for unique events can be used to detect recombination hotspots.

4.2 Manual Query vs Reference Analyses

It is possible to use RDP4 to “manually” detect recombinant sequences in an alignment using, a “query vs reference sequence” approach such as that used in programs like SIMPLOT (Lole et al., 1999) or cBrothers (Fang et al., 2007). Pressing th e arrow button beside the “X-Over” button in the command button panel (Fig 1) will display a menu from which you can select any of seven manual recombination detection methods (GENECONV, BOOTSCAN, MAXCHI, SISCAN, LARD, 3SEQ, Distance Plot or TOPAL). You may be prompted to:

1. Select a potential recombinant sequence(GENECONV,

BOOTSCAN, MAXCHI, and Distance Plot). You should choose the potential recombinant sequence against which you would like to scan potential parental sequences.

2. Select an Outlyer Sequence (SISCAN): Select a sequence that

is more distantly related to the potential recombinant sequence than either of its parents.

3. Select parental and/or outlyer sequences(GENECONV,

BOOTSCAN, MAXCHI and Distance Plot): Select the sequences against which you would like to screen the potential recombinant sequence by clicking on the name of sequences in the left panel.

You can unselect sequences in the right panel by clicking on them.

For Distance Plots you need only select one sequence, for MAXCHI and GENECONV scans you need to select at least two sequences and for BOOTSCANs you must select at least three (two potential parental sequences and an outlyer). If you are attempting to determine the origin of sequences in a recombinant you should always try to select the likely parents of the recombinant and a sequence that is more distantly related to the parental sequences than they are to one another. Note, however, that for manual MAXCHI and GENECONV scans a very divergent outlyer may decrease the power of the scan –You should try select a outlyer that is as closely related to the parental sequences as possible. Also note that when selecting parental sequences for

a manual BOOTSCAN you should avoid selecting potential

parental sequences that are more closely related to one another than they are to the recombinant. If you are unable to avoid selecting parental sequences that are more closely related to one another than they are to the recombinant you should use the “closest relative scan” option (see below).

4. Select parental and recombinant sequences(SISCAN, LARD,

3SEQ). Select three sequences by clicking on sequence names in the left panel. Try to select one recombinant sequence and its two parental sequences. If one of the parental sequences is absent from the alignment recombination could still be detectable using these methods if you select a “parental” sequence that is more distantly related to both the recombinant and the parental sequence that is in the alignment than these two sequences are to one another. This “parental” sequence should, however, still be more closely related to both the recombinant and the parent than either of these sequences are to the actual parent that has gonr unsampled.

5. Select Sequences(TOPAL). Select four or more sequences by

clicking on their names in the left panel. The sequences chosen should include a recombinant sequence, at least one parental sequence, and an outlyer sequence that is more distantly related to the parental and recombinant sequences than they are to one another.

6. Closest relative scan option(BOOTSCAN). If any of the

parental and/or outlyer sequences used in a scan are more closely related to one another than they are to the potential recombinant, you should select this option. If you scan without this option, parts of the scan over which parental sequences are more closely related to one another than they are to the recombinant will contain no information on which of the parental sequences the recombinant most resembles.

If you have selected enough sequences pressing the “OK” button will perform the analysis. Results of the manual scan will be displayed in the Plot Display (Fig 1). A key indicating the meaning of the different plotted lines is given in the Recombination Information Display (Fig 1). Clicking on the names or coloured boxes in this display will highlight the corresponding plot in the Plot Display.

5 EXAMINING AUTOMATED ANALYSIS RESULTS

The basic RDP4 interface is broken up into six separate panels, four of which are displayed at any one time (see Fig 1). From top left, moving clockwise these are (1) the sequence display, (2) the recombination information display, (3) the dendrogram display (4) the matrix display (you can toggle between (2), (3) and (4) but they are not all displayed together), (5) the schematic sequence display, and (6) the plot display. Each display has a battery of associated features many of which are accessible through a series of display specific menus which are accessible by pressing the right mouse button when the mouse pointer is over the different displays. Whenever specific menu items are discussed below they will be identified with blue text. Because the examination of results proceeds via the schematic sequence display, it is this display that will be described first.

5.1 The Schematic Sequence Display

Once an automated analysis has concluded, schematic representations of the aligned sequences indicating positions of potential recombination events are presented in the “schematic sequence displa y” (Fig 2). This display gives a graphical overview of the recombination hypothesis that RDP4 has come up with. It is very important that you realise that the program is fallible and that it is very likely that its hypothesis can be improved with your guidance.

The program displays only the best evidence (i.e. the evidence with the best associated P-value) of recombination that it has detected. The unique recombination events that have been detected are presented in the form of coloured rectangles. Each of these rectangles represents a recombination signal. The left and right bounds of each rectangle mark the inferred breakpoints flanking a fragment of sequence transferred by recombination. Each rectangle is also labelled with the name of a sequence in the alignment that most closely resembles the presumed donor (or minor parent) of the depicted piece of sequence.

These representations of potential recombination events can be colour coded according to:

1. Their most likely parental origins (unique colours are given to

every potential donor sequence in the alignment).

2. The recombination signal detection methods that identified them.

3. Their associated P-value’s.

4. The relatedness of their inferred parental sequences.

The colour coding can be changed by pressing the “cycle through display options” button (Fig 2) on the bottom of the display. A key to the currently selected colour coding can be viewed by clicking on the left mouse button when the mouse pointer is over any grey area of the schematic sequence display (note that a key is not available for the “unique sequences” display).

Menus that provide various analysis and data management options can be accessed by right clicking in the schematic sequence display. If the mouse pointer is over a rectangle representing a specific recombination event, a menu will appear with options that relate to that event. Right clicking on any other part of the display provides a menu with options relating to the recombination display as

a whole.

5.1.1 Using the schematic sequence display.The recombination events that are depicted in the display are sensitive to the mouse pointer and when it is moved over a rectangle representing a recombination event, information relevant to that event is displayed in the “recombination information display” (see section 5.2 and Fig 3). Clicking on the left mouse button when the mouse pointer is over the rectangle will select that event for more in-depth analysis. Immediately on selecting the event, a plot of the raw data that was used to identify the event is drawn in the plot display (section 5.3 and Fig 4), the nucleotide sites used during the analysis are highlighted in the sequence display (section 5.4 and Fig 5) and UPGMA dendrograms useful for visually checking the RDP4’s identification of parental and recombinant sequences are drawn in the tree display (section 5.5 and Fig 6).

5.1.2 Saving a graphic of the display. An enhanced metafile (.emf) graphic of this display can be saved to disk by clicking on the right mouse button when the mouse pointer is over any grey area of the schematic sequence display and then selecting the “Save to .emf file” menu option that is offered. Alternatively if you select the “Copy” menu option then the graphic will be copied to the clipboard and can be pasted into other programs that accept the .emf graphic format (e.g. Word and Powerpoint).

5.1.3 Navigating through data presented in the schematic sequence display.Evidence of recombination can be presented within the schematic sequence display in various different ways. Apart

the beginning of section 5.1), you can change the types of event that are displayed. Click on the right mouse button when the mouse pointer is over any grey area of the schematic sequence display and a menu will be displayed with the following three options: (1) “Show all events for sequence X” (sequence X is the specific sequence who’s “space” the mouse pointer is closest to), (2)“Show only best events for all sequences,” and (3) “Show all events for all sequences.” If you choose to show all events RDP4 will display, stacked one on top of the other, representations of all the “best” recombination signals associated with specific recombination events that have been detected by different recombination analysis methods. Whereas obvious recombination signals might be detectable with all seven or eight of the methods that RDP4 uses to automatically check signals, less obvious signals might only be detectable with one or two different methods. If you choose to show only the best events (the default) the stacked representations of recombination signals will be collapsed and only the “best” signals (i.e. those associated with the lowest P-values) will be displayed.

Although it is possible to query the evidence for any particular recombination signal represented in the schematic sequence display it is strongly recommended that you use the tools RDP4 provides to navigate through the data in a structured way. If you select the “Go to event” menu option you will see that various alternatives are offered. You can opt to go to the “best unaccepted event,” the “previous event” “accepted events” and “rejected events” – these will be explained later in section 5.14.

During its automated recombination detection scanning phase of an analysis, RDP4 attempts to formulate a consistent recombination hypothesis to explain the detected recombination signals in an alignment (see section 4.2 for some details of what the program does to formulate this hypothesis). The hypothesis is formulated in a step-wise fashion with the most obvious recombination signals being accounted for first and the least obvious last. Unfortunately the program is fallible and will make mistakes at some stages of this process. When it makes a mistake at a particular step it will be more likely to make a mistake in all subsequent steps and it is therefore advisable that you analyse the recombination signals in the same order that RDP4 dealt with them. This way when you see the program has made a mistake you can tell it to only re-evaluate the recombination signals that it dealt with after the mistake was made. You can navigate through the events in the same order as RDP4 dealt with them by starting at the first event and moving forward. At the end of an automated scan you can select the “Go to next event” menu option you will be taken to event number 1. Alternatively you can press the left mouse button on a grey background section of the schematic sequence display and then press the “Pg Dn” button on the keyboard and you will also be taken to event 1. Alternatively, the event navigation buttons at the bottom of the schematic sequence

display (Fig 2) can be used to navigate through the events in a structured way You can navigate backwards and forwards through the events using the menu options, the “Pg Up” and “Pg Dn” buttons or the navigation buttons.

5.1.4 Managing data presented in the schematic sequence display. Pressing the right mouse button when the mouse pointer is over a recombinant region will display an “editing” menu that will allow you to accept and reject evidence of recom bination, and “correct” any mistakes that the program has made in its parental/recombinant designations. You should take care when using the parent/recombinant swapping options because: (1) there is no “undo” option; (2) correctly identifying parents and recombinants is often very difficult; and (3) the program is not infallible when identifying recombinant/parents but it is objective whereas you may not be. Make sure that you do not put too much faith in the identified (either by you or the computer) polarity of recombination events.

It is very important that you use the “Accept” or “Reject” evidence of recombination as you go along as this both helps you keep track of where you are when going through the results of an analysis, and tells RDP4 which events it should not reconsider when you tell it to reformulate an improved recombination hypothesis. As you move sequentially through the recombination events proposed you should “accept” evidence for which RDP4 has (1) correctly identified the recombinant sequence, (2) correctly identified the recombination breakpoints, and (3) has neither over- nor under-grouped sequences that have similar evidence of recombination that may/may not indicate they are descendents of a common recombinant ancestor (for help making these decisions see section 10.4 of the step-by-step guide). RDP4 will make errors of all three types. You should be aware that if RDP4 has made any of these errors during its evaluation of a specific event, it will have become more error prone when analysing all

subsequent events. You must therefore correct these errors (see section 5.15) when you find them, “Accept” your corrections and then tell the program to “Re-Identify recombinant sequences for all unaccepted events” –this is one of the menu options that appear whenever you press the right mouse button anywhere in the schematic sequence display. You can also do this by pressing the flashing red “Re-scan” button beneath the schematic sequence display (Fig 2).

When an event is “accepted” RDP4 draws a red rectangle around its representative coloured block in the schematic sequence display. The “Accept this event in all [number of sequences] sequences where it is found” option should be used when you are happy with the way that RDP4 has grouped both the recombination signals it has detected in different sequences, and the signals within individual sequences identified by different recombination detection methods. If you are not happy with how RDP4 has grouped the sequences you can opt to individually accept the event in specific sequences using the “Accept this event only in this sequence” option. When an event is accepted in a particular sequence RDP4 will not re-evaluate the event when you tell it to make an improved recombination hypothesis using either the Re-scan button or the “Re-Identify recombinant sequences for all unaccepted events” menu option.

5.1.5 Correcting RDP via the schematic sequence display. Two of the three main errors that RDP4 will make can be corrected via the menu options provided in the schematic sequence display.

Whereas the schematic sequence display can be used to identify possible inaccuracies in recombination breakpoint prediction, these must be corrected using the sequence display (see section 5.4). When you select the “show all evidence” menu option and representations of the signals detected by different methods are all displayed together, you can quickly assess whether there are differences in the breakpoint positions identified by different methods. If there are differences it will often be worthwhile to carefully check the identified breakpoint positions - even if this involves looking at the sequences by eye.

Conversely, inaccurate identification of recombinant sequences (i.e. when a sequence identified as parental is in fact the recombinant) cannot be determined using the schematic sequence display (see section 10.4 in the step-by-step guide on how such errors are identified) but it can be fixed using the menus. If you right click on the representation of a recombination event the last three menu items displayed give you the option of “swapping” the recombinant and parental sequences. For example, if the sequence identified as the “minor parent” is the sequence you think should have been identified as th e recombinant select the “Swap recombinant and the minor parent” option.

Remember to “Accept all similar” if you are satisfied that all sequences in the alignment that carry traces of the current

Figure 3.The recombination information display. Each

apparently unique recombination event is numbered according to the order in which RDP4 characterised the event. You should

start checking results from event 1 and move through the potential recombination events in the same order that RDP4 characterised them. Breakpoint positions are specific for the recombinant (or

potentially recombinant) sequence that has been selected (the

one with the flashing yellow rectangle in the schematic sequence display –see Figure 2 and Section 5.1)

in the alignment are given in parentheses. The “major parent” is usually (but not always) a sequence closely related to that from

which the greater part of the recombinant’s sequence may have been derived. The “minor parent” is usually a sequence

closely related to that from which sequences in the proposed

recombinant region may have been derived. p-values that are

displayed are either multiple comparison (MC) corrected or

uncorrected. Also displayed in BOLD RED CAPITALS

various warnings. The confirmation table gives an overview of (1) the ancestral sequence in which the recombination event occurred) have been identified. If only some of the recombination signals have been correctly identified then individually ”Accept” only the specific signals that you believe represent evidence of the recombination event If you choose to discount some signals in this way (there is another way of doing this via the phylogenetic trees – see Section 5.5) make sure that you individually accept all of the appropriate signals – If, for example, you only select the best signal (the one that that is always displayed) for a particular sequence, RDP4 will assume that all the other unselected signals (such as those detected by other methods and which are only displayed when you select the “show all evidence” menu option) are incorrect and should be discarded. If you leave some signals unaccepted but RDP4 has identified them as being evidence of

the same event you are analysing, you will in effect be telling RDP4 that you think it has over-grouped evidence of recombination. When RDP4 re-evaluates the sequences and finds that, in a particular sequence, only the evidence of one recombination detection method has been accepted (even if other methods detected the same signal) it will not re-screen for the same recombination signal and all evidence of the signal being detectable by other methods will be discarded – this evidence can be partially recouped by selecting the “Re-check all identified events with all detection methods” menu option. For example, if RDP4 had identified a group of sequences as having descended from a common recombinant ancestor but only the evidence of recombination identified in one member of the group is accepted, then the program will re-screen the other sequences in the group for evidence of recombination when either the “Re-Identify recombinant sequences for all unaccepted events” menu option is next selected or the flashing red “Re-scan” button is pressed. If the unaccepted recombination signals are re-detected, RDP4 will interpret these as being evidence of a different recombination event.

Besides using different combinations of “accepts” and “rejects” to split up mistakes the program makes in over-grouping sequences, the menus of the schematic sequence display can also be used to correct under-grouping of events – i.e. when RDP4 has identified sequences descended from the same ancestral recombinant as carrying evidence of two different unique recombination events. The “Merge events” menu option gives you the opportunity to group signals from any two identified events as having originated from the same original recombination event. Grouping and ungrouping events can also be achieved using the tree displays (Section 5.5).

If you modify breakpoint positions, recombinant designations or groupings of detectable recombination signals, you must first accept your modifications and then select either the “Re-identify recombinant sequences for all unaccepted events” menu option or press the flashing red “Re-scan” button. If you evaluate recombination events in the same order that RDP4 identified them and accurately correct mistakes that the program has made then each new recombination hypothesis RDP4 formulates when you select this option will be an improvement on the last and eventually a good consistent story should emerge from the data.

5.2 The Recombination Information Display

When the mouse pointer is moved over a coloured rectangle representing a potential recombination signal in the Schematic Sequence Display (Fig 2), information on that region is printed in the Recombination Information Display (Fig 3). This information includes the method used to detect the recombination signal, the order in which the recombination event represented by the signal was added to the current recombination hypothesis (the event number), possible breakpoints (in the sequence and in the alignment), names of sequences that are closely related to likely parental sequences (major and minor parents) and the approximate probability that the recombinant sequence could have been more closely related to the “minor parent” than the “major parent” in the specified region by chance alone (i.e. without invoking recombination). For any particular recombination signal the meaning of the P-values that is displayed here will vary slightly according to the recombination detection method used to detect the signal. The p-values displayed for the different methods are described in Section 8.

The names of the recombinant, major parent and minor parent are sensitive to the mouse pointer and left clicking on these names will result in the schematic representation of these sequences being displayed in the schematic sequence display.

Also displayed are warnings if:

1. There is only a single suitable parent-like sequence in the set of

aligned sequences.

2. There is a fair likelihood (an approximately 30% or greater chance)

that the program has misidentified the recombinant sequence (i.e.

the actual recombinant is one of the sequences identified as a parental sequence). If one or both of the identified parental sequences is almost as likely to be the recombinant then the name(s) of the sequences are given.

3. One or both breakpoints could not be identified.

4. One or both breakpoints may have been misplaced.

5. The signal represents only trace evidence (i.e. it is not statistically

significant) of a recombination event detectable in one or more other sequences (i.e. it has an associated P-value > than the cut-off) 6. If the recombination signal is a possible/probable misalignment

artefact.

These warnings are meant as a prompt for you to carefully examine the presented data and make a judgment on whether the program’s interpretations are correct or not. Even when no warning is given it is always advisable to properly examine results. There is always a fair chance that the methods implemented in RDP4 will inaccurately determine breakpoints, incorrectly identify parental and recombinant sequences and over- or under- group sequences believed to be descended from ancestral recombinants. For example, the original RDP method will misidentify recombinant sequences without giving a warning when a substantial proportion of the reference sequences being used are themselves recombinant. You should carefully examine all potential recombination events using the supplementary analyses that are offered by RDP4 (see the step-by-step guide in Section 10).

The “confirmation table” part of the recombination information display gives some indication of (1) the number of sequences in the alignment that the currently selected recombination event has been detected in and (2) the degree of agreement between different detection methods regarding the currently selected recombination event.

The histogram beneath the confirmation table summarises the results of various assays that the program uses to infer which of the sequences used to detect a recombination signal, is the recombinant. The assays are briefly outlined in Section 4.1.4. The only really relevant bit of this plot to 99% of users will be the top three bars representing the “consensus” scores of the three sequences indicated. The numbers next to these bars are the “consensus scores” of the three sequences. These scores have no real meaning other than that the higher the score the m ore confident you should be in the program’s assessment of which sequence is recombinant. A score >60 indicates that the identified sequence is almost certainly the recombinant. A score <60 but >40 means that the program may have made a mistake (but pro bably didn’t). Anything lower than this indicates that the program is VERY unsure about which sequence is the recombinant. It is under these circumstances where your input can be most useful. You should realise though that your opinion may not be very valuable if, for example, you are not very good at interpreting phylogenetic trees.

The Information display can also be used to modify how RDP4 interprets breakpoints. You will notice if you left click on the “Beginning breakpoint” or “Ending breakpoint” fields within this display, that the breakpoints will be given an “Undetermined” label. This label is important because undetermined breakpoints will be ignored when RDP4 tests breakpoint distributions for evidence of recombination hot- and cold-spots.

5.3 The Plot Display

Left clicking on the coloured rectangles that represent rescombination signals within the schematic sequence display (Fig 2) will produce a graphical plot of the actual signal (Fig 4). The whole plot is sensitive to the mouse pointer and:

1. Double clicking anywhere in this panel will take you to the

corresponding region in the sequence display panel (Fig 5).

2. Moving the pointer around the plot will display a cross hair for

which X and Y coordinate values are displayed (Fig 4).

3. When a SISCAN plot is being displayed left clicking will produce a

key that describes the meaning of the various plotted lines.

Clicking on any of the plots indicated in the key will highlight that plot in the Plot Display. For a key of what the different plots represent see Gibbs et al. (2000)

At the top of the plots is a graphical representation of the distribution of polymorphic/analytically relevant sites that were used to detect the recombination signal. In the MAXCHI, CHIMAERA, SISCAN and GENECONV plots, broken lines indicate the P-value cutoffs that were used to determine the significance of breakpoints (MAXCHI, CHIMAERA) or potentially recombinant fragments (GENECONV, SISCAN). See section 8 for specific descriptions of what is being plotted for the various methods.

When you right click on the plot display you will be given the option to (1) save a graphic of the plots (in either .emf or .bmp format) (2) save the actual raw data used to construct the plots(in comma separated value or .csv format) or (3) copy an image of the plots to the clipboard (so that the plots can be pasted into Word, Powerpoint or any other .emf viewer).

Beside the plot display is a panel with the caption “Check using.” In this panel are two buttons with the words ”Options” “and STOP” on

Figure 4.The plot display. Interpretation of plots varies between different checking methods (see section 8 for details on what is

being plotted). Different coloured lines usually indicate different

sequence pairs (the names are given in the key). Vertical lines

above the plot indicate positions of the variable nucleotide sites

that have yielded the signals being plotted (these sites can be

individually colour coded in the sequence display

them. There is also a “combo” box that should have the name of a recombination detection method displayed. This combo box can be used to test whether various other recombination analysis methods are also capable of detecting the current recombination signal. The Options button can be used to adjust parameter settings for the method currently selected in the combo box. The “STOP” button can be used to terminate a scan that is taking too long (as sometimes happens with the LARD or TOPAL methods).

Besides being used to cross-check different recombination detection methods, graphical overviews of the detected recombination events can also be accessed via this combo box. These include

1. Overview: These plots are similar to those displayed during the

automated recombination screening scan. The main additional feature in the overview plots is that the recombination signals being represented are broken down according to the methods used to detect the signals. You can see a colour key indicating the methods that detected the various signals by left clicking on the plot. The vertical lines in these plots indicate the estimated positions of breakpoints and the upper horizontal lines indicate either the genetic distance between parental sequences (PDist), the p-values associated with the detected recombination signals (PVal) or the number of times individual regions of the aligned sequences were inferred to have been transferred by recombination (#Hits).

2. Recombination event map: This plot is similar to the P-value

portion of the overview plots described above, except that the colours that are displayed represent degrees of parental sequence relatedness. Whereas cooler colours indicate that parental sequences were more distantly related, warmer colours indicate that they were more closely related.

3. Breakpoint density: This is a sliding window plot indicating the

clustering of detectable recombination breakpoints along the alignment and can be directly used to infer the existence of statistically supported recombination hot- and cold-spots. See Section 9.1 for a description of how this plot is produced and the underlying tests performed. Whereas the plotted line represents the number of breakpoints detectable within a moving window of user specified si ze (press the “options” button to change the window size), the grey and white areas around the line respectively indicate the 95% and 99% confidence intervals for the expected degrees of breakpoint clustering in the absence of recombination hot- and cold-spots. Whereas if the black line emerges above these shaded areas it indicates the existence of a recombination

hot-spot, if it drops below the shaded areas, it indicates the existence of a recombination cold-spot. The upper and lower dotted lines respect ively indicate “global” 99% and 95% confidence intervals of there being recombination hot-spots. Note that this test is extremely conservative. See Section 9.1 for a description of what the global confidence intervals mean.

4. Breakpoint P-density: This plot is a version of the breakpoint

density plot described above in which the plotted values correspond to probabilities (rather than absolute breakpoint numbers) that breakpoints are not significantly clustered. It is essentially a transformed version of the breakpoint density plot in which the dimensions of the shaded bits are held constant and the black line is plotted relative to these.

5.4 The Sequence Display

The sequence display (Fig 5) can be cycled to show (1) the entire sequence alignment, (2) only the sequences involved in identifying the currently selected recombination signal, or (3) only the informative sites within the sequences involved in identifying the currently selected recombination signal. Left clicking in the sequence display will produce a key that describes the colour coding of the nucleotides in the display.

Holding the mouse pointer over any nucleotide in the sequence display will indicate the position of that nucleotide in its unaligned sequence.

You can also save alignments in various formats and with various pieces of sequence/whole sequences omitted using the menu that is accessed when you right click anywhere in the sequence display. The alignment saving options include:

1. Save entire alignment: Will save the full alignment in whatever

format you specify.

2. Save alignment with recombinant sequences removed: Will save

an alignment minus any of the recombinant sequences identified during an automated recombination scan. To tell which sequences will be included in the alignment look, at the schematic sequence display. Any sequence that is represented by an unbroken line will be included

3. Save alignment with recombinant columns removed: All alignment

positions that fall between pairs of identified recombintion breakpoints in ANY sequence in the alignment will be removed for all sequences. If many recombinant regions have been detected with an alignment, this option could very easily yield an empty or nearly empty alignment.

4. Save alignment with recombinant regions removed: All nucleotide

positions in any sequences that are between any identified recombinantion breakpoint pair will be removed and replaced with gap (“-” or “.”) characters.

5. Save alignmnet with recombinant regions seperated: Recombinant

sequences within the alignment will be split into two or more sequences. For every detected recombination event the sequence(s) carrying evidence of the event will be split into two parts – one part between the identified recombination breakpoints, and the other from the remainder of the sequence. Gap characters will be inserted into the two sequences to properly maintain their alignment positions. The resulting alignment should be free of detectable recombination events.

6. Split alignment into common mosaics: All sequences in the

alignment that have either identical recombination mosaics (i.e. the same pattern of recombination detected events) or are non-recombinant will be split up into separate alignments.

7. Split alignment into recombination free sub-alignments: The

alignment will be split into multiple sub-alignments each containing no detectable recombination signals.

8. Save only enabled sequences:Only sequences that are “enabled”

(see section 4.1.1) will be saved. This is useful for manually splitting the sequences in the alignment up into related groups.

9. Save only disabled sequences:Only sequences that are either

disable or masked (see section 4.1.1) will be saved.

When you are saving modified alignments you will often be asked whether to consider all of the detected recombination signals or only those that you have accepted (see Section 5.1.4).

Left clicking on the names of sequences to the right of the sequence display will cyclically mask, disable and enable the sequences in the alignment. See section 10.1 for reasons why you should sometimes mask or disable sequences. Masking or disabling some sequences in an alignment will reduce the number of

Figure 5.The sequence display. The sequence conservation display is a graphical overview of the sequence alignment that

also indicates the portion of the alignment that is currently presented. Within the sequence part of the display, individual nucleotides are colour coded according to their degree of conservation. When a recombination event is selected (see

Figures 1 and 2 or Section 5.1), the “toggle sequence display

button can be used to highlight nucleotide polymorphisms that contribute to the recombination signals depicted in the plot display (see Figures 1 and 4 or Section 5.3). Red green and blue highlighted sequence names indicate recombinant, major parent

and minor parent sequences, respectively. Use the

recombination signal detection scans and thereby both speed up an analysis and reduce the severity of multiple testing correction needed during P-value calculation. Whereas masking a sequence will mean that RDP4 will avoid looking at the sequence during a primary automated recombination screen, the sequence will still be looked at during secondary screens and will also be used within the context of phylogenetic trees to determine which sequences are recombinants. Disabled sequences will not be examined at all for evidence of recombination (even during the secondary scanning phase) but will still be included within phylogenetic trees.

Right clicking over the sequence names will display a menu of options. You can “Mask all”, “Enable all”, “Disable all” or “Invert masking.” The most useful option for general recombination analysis is “Auto mask for optimal recombination detection.” This setting will focus the analysis on sequences where it is possible to detect recombination while ignoring efforts to detect recombination between sequences that are too similar. This can substantially increase the power of RDP4 to detect recombination, particularly in large alignments containing mixtures of very similar sequences (sharing <99% identity) and more diverged sequences (<90% identical).

If you are interested in looking for recombinants in a specific group of sequences but would like RDP4 to check a larger set of sequences in case some of these are good candidate parents, you can designate a group using the “Select group” menu option. To select a group choose this option and then click on the names of sequences you would like to include as candidate recombinants. When you click on the sequence names they will turn blue. If you click on a blue name it will turn black again. Whereas names in blue denote candidate recombinants, those in black denote sequences against which these recombinants will be screened.

If you would like RDP4 to adjust the schematic sequence display to show a particular sequence in the sequence display, move the mouse pointer over the name of the sequence, right click and select the “Go to” option. The representation of the sequence that the mouse pointer is over will be indicated in the schematic sequence display (Fig 2). 5.5 The Tree Displays

If you press the “Trees” buttons (Figs 3 and 7) a number of different trees expressing the relationships between the identified recombinant and other sequences in the alignment will be displayed in phylogenetic trees constructed using various different parts of the alignment. If the “Trees” button at the top of the screen in the com mand button panel (Fig 1) is pressed, two trees will be displayed side-by side. Alternatively if you press the “Trees” button above the recombination information display (Fig 3), a tree (Fig 6) will be displayed in the same space as the recombination information display. Different trees constructed using different bits of the alignment can be viewed by pressing the “cycle through trees” button (Fig 6). These trees include those constructed using (1) all regions of recombinant sequences examined separately, (2) only the identified recombinant region (the region related to the “minor” parent in the selected sequence), (3) only the identified “non-recombinant” region (the region related to the “major” parent in the selected sequence) or (4) all regions ignor ing recombination.

When the side-by-side trees are displayed in the separate window (i.e. when you press the “Trees” button in the command button panel indicated in Fig 1) it is possible to mark sequences in one tree and have the corresponding sequences in all other trees marked at the same time. This feature is very useful for tracking the “movement” of recombinant sequences around trees constructed from different parts of an alignment. Sequences can be marked/unmarked by left clicking on their names in the trees.

Right clicking in the side-by-side tree display gives you a number of options. Selecting the “Find sequence” option will allow you to search the tree for a specific sequence (which, if found, will be highlighted in the tree with a white background). The “Clear colour” option will remove all markings from the trees, the “Auto colour” option will colour all sequence names in the tree the same colours as sequences presented in the schematic sequence display, and the “Select colour” option w ill allow you to select a colour with which to mark sequences.

When the mouse pointer is moved over nodes within the displayed trees a blue spot appears. If the left mouse button is pressed then all the sequences represented on the right of the node will be marked with whatever the currently selected colour is. If the right mouse button is pressed a menu is displayed. Options on this menu include: “Mark/Unmark sequences above this node as having evidence of this recombination event” which can be used to correct mistakes that RDP4 has made in over- or under-grouping sequences it thinks have descended from a common ancestor; “Find best major/minor parent above this node” which can be used to identify the sequence above this node that, if swapped for the currently indicated major/minor parent would yield the strongest signal of recombination; “Accept/Reject all recombination events above this node” which can be used to inform the program that you are happy/unhappy with the characterised recombination signals detectable in whole groups of sequences; and “Colour/Uncolour all sequences above this node” which can be used to simultaneously colour/uncolour large groups of sequence names within the tree. The last menu option, “Determine ancestral sequence at this node,” will prompt RDP4 to attempt the determination of the ancestral sequence at this node using the maximum parsimony (with the DNAPARS component of PHYLIP; Felsenstein, 1989), maximum likelihood (with RAxML; Stamatakis, 2006) and/or Bayesian (with MRBAYES 3.2; Ronquist et al., 2012) approaches. Note that estimations of ancestral sequences using a Baysian approach can take a very long time. When an ancestral sequence has been inferred it can be saved to a .csv file by right clicking on the ancestral sequence that is displayed.

Other options on offer in the standard tree menu (the menu that is shown when you press the mouse button while the pointer is over an empty grey area of the tree display) relate to saving either the tree image (the “Copy”, “Save to .emf file” options), or the Newick format encoding (the “Newick format” option) that will allow you to reload the tree in programs like Mega (Kumar et al., 2008), FigTree (an excellent tree viewer and annotation program by Andrew Rambaut that is available for free from https://www.wendangku.net/doc/5910908807.html,/software/figtree) and TreeView (Page, 1996). Unlike with the tree display in the main RDP4 window, in the side-by-side tree display you are also given the option of changing the default trees that are constructed every time you select a new recombination event from UPGMA trees to FastNJ trees (with the “Make FastNJ the default tree” option). Individual UPGMA/FastNJtree can be redrawn as neighbour joining, maximum likelihood, or Bayesian trees by selecting the“Change tree type”option. Be very careful when selecting the latter two tree types – they might take much longer to construct than you will be prepared to wait.

Figure 6.The tree display. Green and blue highlighted

sequences indicate reasonably close relatives of major and minor parents. The red highlighted sequence is the currently selected

recombinant sequence (i.e. the sequence with the flashing yellow rectangle beneath its name in the schematic sequence display

see Figures 1 and 2 or Section 5.1). Pink and purple sequences are sequences with similar (Pink) or somewhat similar but notably different (Purple) recombination signals to that observed in the

sequence highlighted in red. These Red, pink and purple

sequences possibly evolved from a common ancestral

recombinant sequence. Enlarge or reduce the tree using the

“Zoom in”and “Zoom out”buttons. When using trees to test

whether RDP has correctly identified recombinant sequences it

will usually be best to look at the side-by-side tree display in a

separate window –Press the “View trees in a separate window

button to do this. The “cycle through different trees

change the fragment of the alignment that is used to construct the tree that is displayed –Note that only UPGMA trees are usually

displayed here. To see trees drawn with other methods press the “View trees in a separate window”button, right click on the tree

displays that are shown and select the

that is displayed –this will allow you to draw neighbour joining

(recommended for all datasets), maximum likelihood and

options that are only accessible if the mouse pointer is over one of the sequences in the tree when the right mouse button is pressed. The “Mark [sequence name] as also having evidence of this event” option alternates with the “Mark [sequence name] as not having evidence of this event.” The se menu options can be used to correct mistakes that RDP4 has made in over- or under-grouping sequences it thinks have descended from a common ancestor.

The “Accept this event only in this sequence”, “Accept this event in all [number of sequences] sequences where it is found”, “Reject this event only in this sequence”, and “Reject this event in all [number of sequences] sequences where it is found” options are the same as those found in the schematic sequence display menu (see section 5.1.4). These should be used to inform RDP4 that you are satisfied with the description of particular recombination events within specific sequences or groups of sequences so that it does not re-evaluate these during subsequent rescans (See section 10.4 for how and why accepting and rejecting sequences is done).

The “Make [sequence name] the [major/minor] parent” options let you manually assign major or minor parental sequences. Use them if you feel you are able to identify better candidate parental sequences than those which were automatically identified by RDP4. You should, however, be very careful when manually choosing “better” parental sequences. In some cases, such as when recombination events are very old or have occurred between very closely related sequences, a

Figure 7. The matrix display. Although many different matrices

can be constructed with RDP4, most of the matrix types can only be accessed once an automated recombination analysis has

been completed. Moving the mouse pointer over the matrix

window and right clicking will provide a range of addition options, including those to change the matrix type, change its colour

scheme and save the matrix to a graphics file. For large

alignments it might be necessary to enlarge the matrix with the

“Zoom in” button to see sufficient detail. The X and Y coordinates of the mouse pointer and the value depicted in the matrix beneath the mouse pointer are given in the panel beside the matrix. recombination signal can completely

sequences you assume are parental are in fact not the best pair of sequences for identifying the recombination event. This could be due to many different factors but most commonly can be attributed to misleading inaccuracies in the trees used by you to identify the parental sequences.

Before you go ahead and select an alternative parental sequence or group/ungroup recombinant sequences, the ”Recheck plot with [sequence name] as recombinant/minor parent/major parent” option can be used to test what a recombination signal would look like if one of the sequences in the currently selected sequence triplet (i.e. either the red, green or blue highlighted sequences in the tree) were replaced with the sequence the mouse pointer is over. These options can also be particularly useful for determining whether RDP4 has over- or under-grouped sequences it thinks have descended from a common recombinant ancestor (See step 10 in section 10.4 of the step-by-step guide).

The “Go to [sequence name]” option will centre the graphical representation of the sequence that the mouse pointer is over in the schematic sequence display (Fig 2).

At the bottom of the side-by side tree display is a button labelled “Run tests”. Pres sing this button will run Shimodaira-Hasegawa and approximately unbiased tests that compare the topologies of the trees on the left and the right of the side-by side tree display. P-values <0.05 for both of these tests should be interpreted to mean that the topologies of the trees are probably significantly different from one another. Note, however, that the trees in the different panels of the tree display are expected to almost always have significantly different topologies. Further, absence of evidence for significantly different tree topologies is not evidence that the tree topologies are the same –i.e. it is not evidence that recombination has not occurred. It simply means that there is an absence of phylogenetic support for a particular recombination event having occurred.

5.6 The Matrix Display

Pressing the “Matrix” button either above the recombination information display or in the command button panel at the top of the screen (Figs 1, 3 and 6) will result in the recombination information display being replaced by the matrix display.

A number of different matrix types can be drawn in this display. You may select the matrix type that you would like to view by either right clicking in the matrix display and selecting the “Change matrix type” option or by clicking on the small arrow beside the matrix button in the command button panel (Fig. 1). For a brief description of all the different matrix types see section 9.3.

Other options that are available on the menu are to “Copy” the matrix to t he clipboard, “Save to .bmp file” and “Save to .csv file.” The

latter option will save information on each cell within the matrix to a spreadsheet that can be opened in programs like Excel or Open Office. The “Change colour scheme” option allows you to ch ange the scheme used to express the range of cell values presented in the matrix.

If a MAXCHI or LARD matrix is being displayed, two additional menu options, “Place breakpoint here” and “Place ancestral breakpoint here,” are offered whenever the right m ouse button is pressed. If the former option is selected then the breakpoint positions of the recombinant being analysed will be changed to the X,Y coordinate positions at the tip of the mouse pointer –these coordinates are displayed to the right of the matrix display. If the latter option is selected then the breakpoint positions of every sequence carrying evidence of the same recombination event will be changed along with the currently selected recombinant (see points 1-4 in section 10.4 of the step-by-step guide to using RDP4 for information on when/why breakpoints should sometimes be adjusted).

6 SAVING RESULTS AND RECOMBINATION FREE DATASETS Besides the various save options that are provided when the right mouse button is clicked while the pointer is over particular display panels (which enables images of trees, matrices, plots and other graphics to be either saved in various formats or copied and pasted into other programs), RDP4 has two different classes of analysis outputs that can also be saved following a successfully completed automated scan for recombination:

(1) For people who are interested in recombination, analysis results

depicting the recombination events that are evident within a dataset can be saved in one of two different formats by pressing the “Save” button at the top of the program screen. Results saved in an RDP4 project file (a file with a “.rdp” extension) can be reloaded at a later date for further study using RDP4. Saving results to a .csv file (a text file that can be read with a spreadsheet program like Excel) will give you a tabulated summary of all of the unique recombination events that the program has detected. In order for different fields of the text file to be read correctly by a spreadsheet program (such as Excel) you may need to specify when loading the file that columns are delimited by commas. Note that for versions of RDP before 2.0 columns were delimited by TABS and for versions before 1.07 the columns were delimited by spaces.

(2) For people who are mostly interested in removing evidence of

recombination from their analysed datasets, recombination-free alignments can be saved by right clicking on the sequences in the schematic sequence display (Fig 1). Alignments can be saved in a variety of different formats with recombinant sequences completely removed, with the bits of recombinationally derived sequence removed (the recombinationally derived bits are replaced by the “gap”

character, “-“), or with the recombinant sequences split into their constituent parts (the distributed alignment option). For this latter option each recombinant sequence is “decomposed” into two or more different sequences (a sequence with one detected event will be split into two sequences, one with three detected events into three sequences and so-on) each with gap characters added to ensure that the nucleotides they retain remain aligned.

7 SUPPLEMENTARY ANALYSES

RDP4 allows you to “check” results obtained with any particular method using the original RDP method, GENECONV, BOOTSCAN/Recsan, MAXCHI, CHIMAERA, SISCAN, LARD, 3SEQ, distance plots VisRD and TOPAL/DSS. To select a method for checking results press the button in the “Check using” section of the plot display (Fig 4). The list of methods that can be used to check a result will be displayed and you can select whichever one you want.

It is recommended that once a recombinant region has been identified and appears to represent evidence of a genuine recombination event (i.e. there is evidence from at least two different analysis methods that a particular sequence has a recombinant origin), you should both carefully examine whether RDP4 has correctly identified breakpoint positions in the recombinant sequence(s) and check whether it has not over- or under- grouped recombination signals when it has tried to work out how many unique events account for the recombination signals in the alignment. See section 10.4 for a detailed walk-through of how various supplementary analyses can be used to check the accuracy of automated RDP4 results.

Other supplementary analyses that you can do in RDP4 following an automated scan for recombination are the construction of recombination breakpoint distribution plots(these are useful for identifying recombination breakpoint hotspots; see section 9.1), recombination rate plots (parametric approximation of variations in recombination rates across an alignment that can also be used to identify recombination hotspots; see section 9.3), recombination event maps(a simple graphical over-view of all the unique recombination events detected; see section 5.3), tests of recombination induced protein/nucleic acid folding disruption (see sections 9.5 and 9.6), recombination region count matrices (a more complex overview of the unique events detected indicating how often different parts of the analysed sequences are separated from one another by recombination; see section 9.4.4), recombination breakpoint matrices(useful for telling whether specific breakpoint pairs tend to occur together; see section 9.4.5), recombination matrices(an overview of recombination expressing the bits of sequence exchanged in terms of the relatedness of parental sequences; see section 9.3.2), and modularity matrices(useful for identifying bits of sequence that always tend to be co-inherited from the same parental sequence; see section 9.4.3).

Table 1. The different recombination detection and analysis methods available in RDP4

Method Implementation Identifies

Recombinants

Estimates

Breakpoints

Estimates

Regions

P-Value Calculation References

Original RDP method RDP4 + + + Binomial distribution Martin and Rybicki, 2000 GENECONV RDP4 & GENECONV + + + Blast-Like Karlin-Altschul & Permutation Padidam et al., 1999 BOOTSCAN RDP4 & PHYLIP + + + Bootstrapping & binomial distribution & 2Salminen et al., 1995 Maximum 2RDP4 + + +/- 2 & Permutation Maynard Smith, 1992 CHIMAERA RDP4 + + +/- 2 & Permutation Posada and Crandall, 2001 Sister Scan RDP4 + + + Permutation and Z-Test Gibbs et al., 2000 3SEQ RDP4 + + + Exact test Boni et al., 2007 LARD LARD - + - Likelihood ratio Holmes et al., 1999 Distance Plots RDP4 & PHYLIP - + + - - PhylPro RDP4 + + - - Weiller, 1998 DSS/TOPAL RDP4, PHYLIP & SEQ-GEN - + - Parametric bootstrap McGuire and Wright, 2000 VisRD RDP4 + + + - Lemey et al., 2009 BURT RDP4 - + + - -

基因重组技术论文

基因重组技术学生姓名：赵慧芳学号：20115071261 生命科学学院生物科学专业指导教师：张海滨职称：教授摘要：基因重组是由于不同DNA链的断裂和连接而产生DNA片段的交换和重新组合，形成新DNA分子的过程。发生在生物体内基因的交换或重新组合。包括同源重组、位点特异重组、转座作用和异常重组四大类。是生物遗传变异的一种机制。基因重组是指非等位基因间的重新组合。能产生大量的变异类型，但只产生新的基因型，不产生新的基因。基因重组的细胞学基础是性原细胞的减数分裂第一次分裂，同源染色体彼此分裂的时候，非同源染色体之间的自由组合和同源染色体的染色单体之间的交叉互换。基因重组是杂交育种的理论基础。基因突变是指基因的分子结构的改变，即基因中的脱氧核苷酸的排列顺序发生了改变，从而导致遗传信息的改变。[1] 关键词：基因重组；特点；分离；纯化 Abstract: Genetic recombination is the result of the fracture and the connection of different DNA strand DNA fragments of exchange and recombination, the formation of new DNA molecule. In organisms of gene exchange or recombine. Including the homologous recombination, site specific restructuring, transfer function and abnormal four categories. Is a mechanism of biological genetic variation. Genetic recombination is refers to the recombination between alleles. Can produce a large number of variation types, but only to create new genotypes, does not produce new genes. Genetic recombination is the cytological basis of the original cells first division of meiosis, homologous chromosomes split each other, not the free combination between homologous chromosomes and the intersection of the chromatids of homologous chromosomes swap. Genetic recombination is the theoretical foundation of the cross breeding. Gene mutation is refers to the change of the molecular structure of the gene, the gene sequence of DNA nucleotides changed, resulting in the change of the genetic information. Keywords: Genetic recombination；feature；separate；purification 1.基因工程的核心

基因重组和重组DNA技术教案

第四节基因突变和基因重组第2课时基因重组和重组DNA技术【自主学案】 1．基因重组（1）概念：生物体在进行________的过程中，控制不同性状的________重新组合的过程。（2）基因重组的类型 ①自由组合型：生物体在________，非同源染色体上的非等位基因发生自由组合。 ②交叉互换型：生物体在________，同源染色体上的非姐妹染色单体之间交换片段，使得染色单体上的基因发生重新组合。【思维激活】基因重组有什么意义呢？ 2．重组DNA技术（1）概念：将从一个生物体内分离得到或人工合成的________入另一个生物体中，使后者获得________或________的技术。（2）重组DNA技术的一般过程 ①分离目的基因：从生物细胞内直接________，获得目的基因。 ②选择基因工程载体：运载目的基因需要载体，常用的载体有________等。 ③重组DNA技术的基本步骤：体外重组DNA→________→________→目的基因表达。【思维激活】具备什么条件才能充当载体？【典题精析】重点一：基因重组的理解例1．以下有关基因重组的叙述，错误的是（） A．非同源染色体的自由组合能导致基因重组 B．非姊妹染色单体的交换可引起基因重组 C．纯合体自交因基因重组导致子代性状分离 D．同胞兄妹的遗传差异与父母基因重组有关解析：基因重组主要指减数分裂过程中非同源染色体上的非等位基因发生自由组合和同源染色体上的非姐妹染色单体间的交叉互换。纯合体自交后代不发生性状分离。答案：C 重点二：重组DNA技术技术的操作工具

例2．下列有关基因工程中限制性内切酶的描述，错误的是（） A．一种限制性内切酶只能识别一种特定的脱氧核苷酸序列 B．限制性内切酶能识别和切割RNA C．限制性内切酶的活性受温度的影响 D．限制性内切酶可以从原核生物中提取解析：限制性内切酶主要从原核生物中分离纯化出来。它们能够识别一种特定的脱氧核苷酸序列，别且使每条链中特定部位的两个核苷酸间的磷酸二酯键断开。答案：B 【学生自测】 1．进行有性生殖的生物，其亲子代之间总是存在着一定差异的主要原因是（） A．基因重组 B．基因突变C．染色体变异 D．环境条件的改变 2．下列有关基因重组叙述的说法，错误的是（） A．基因重组发生在减数分裂过程中 B．基因重组产生原来没有的新基因 C．基因重组是形成生物多样性的重要原因之一 D．基因重组能产生原来没有的新性状 3．要将目的基因与载体连接起来，在基因操作中应选用（）A．只需DNA连接酶 B．同一种限制酶和DNA连接酶 C．只需限制酶 D．不同的限制酶和DNA连接酶 4．下列有关基因工程技术的叙述中正确的是（） A．重组DNA技术所用的工具酶是限制酶、连接酶和载体 B．所有的限制酶都只能识别同一种特定的核苷酸序列

5.1基因突变和基因重组练习试题

第五章第1节基因突变和基因重组 (45分钟100分) 一、选择题(包括10小题,每小题5分,共50分) 1.下列关于基因突变的叙述中,正确的是( ) A.基因突变一定能引起性状改变 B.亲代的突变基因一定能传递给子代 C.等位基因的产生是基因突变的结果 D.DNA分子结构的改变都属于基因突变 2.谷胱甘肽(GSH)是普遍存在于生物体内的一种重要化合物。下表为GSH的密码子和氨基酸序列及控制合成GSH的DNA突变后所对应的密码子和氨基酸序列,则下列有关突变原因的叙述正确的是( ) A.增添 B.缺失 C.改变 D.易位 3.(2013·泰安模拟)某个婴儿不能消化乳类,经检查发现他的乳糖酶分子有一个氨基酸改换而导致乳糖酶失活,发生这种现象的根本原因是( ) A.缺乏吸收某种氨基酸的能力 B.不能摄取足够的乳糖酶 C.乳糖酶基因有一个碱基改换了 D.乳糖酶基因有一个碱基缺失了 4.如图为某二倍体生物精原细胞分裂过程中,细胞内的同源染色体对数的变化曲线。基因重组最可能发生在( )

A.AB段 B.CD段 C.FG段 D.HI段 5.据报道,加拿大科学家研究发现选择特定的外源DNA(脱氧核糖核酸)片段并将其嵌入到细菌基因组的特定区域,这些片段便可作为一种免疫因子,抵抗DNA裂解酶入侵,此项技术有望解决某些细菌对抗生素产生抗药性的难题。这种技术所依据的原理是( ) A.基因突变 B.基因重组 C.染色体变异 D.DNA分子杂交 6.(能力挑战题)(2013·长沙模拟)如图为雌性果蝇体内部分染色体的行为及细胞分裂图像,其中能够体现基因重组的是( ) A.①③ B.①④ C.②③ D.②④ 7.农业技术人员在大田中发现一株矮壮穗大的水稻,将这株水稻所收获的种子再种植下去,发育成的植株之间总会有差异,这种差异主要来自( ) A.基因突变 B.染色体变异 C.基因重组 D.细胞质遗传 8.(2013·温州模拟)已知家鸡的无尾(A)对有尾(a)是显性。现用有尾鸡(甲群体)

(整理)基因重组与基因工程

基因重组与基因工程一、选择题 1．F因子从一个细胞转移至另一个细胞的基因转移过程称为：A．转化 B．转导 C．转染 D．转座 E．接合 2．通过自动获取或人为地供给外源DNA使受体细胞获得新的遗传表型，称为：A．转化 B．转导 C．转染 D．转座 E．接合 3．溶原菌是指： A．整合了噬菌体基因组的细菌 B．整合了质粒基因组的细菌 C．含有独立噬菌体基因组的细菌 D．含有独立质粒基因组的细菌 E．含有独立噬菌体和质粒基因组的细菌 4．由插入序列和转座子介导的基因移位或重排称为： A．转化 B．转导

C．转染 D．转座 E．接合 5．由整合酶催化、在两个DNA序列的特异位点间发生的整合称为：A．位点特异的重组 B．同源重组 C．基本重组 D．随机重组 E．人工重组 6．发生在同源序列间的重组称为： A．位点特异的重组 B．非位点特异的重组 C．基本重组 D．随机重组 E．人工重组 7．限制性核酸内切酶切割DNA后产生： A．5'磷酸基和3'羟基基团的末端 B．3'磷酸基和5'羟基基团的末端 C．5'磷酸基和3'磷酸基团韵末端 D．5'羟基和3'羟基基团的末端 E．以上都不是 8．可识别并切割特异DNA序列的称： A．限制性核酸外切酶 B．限制性核酸内切酶

C．非限制性核酸外切酶 D．非限制性核酸内切酶 E．DNA酶 9．限制酶的识别顺序通常是： A．聚腺苷酸 B．聚胞苷酸 C．RNA聚合酶附着点 D．回文对称序列 E．甲基化“帽”结构 10．限制酶： A．从噬菌体中提取而得 B．可将单链DNA任意切开 C．可将双链DNA任意切开 D．可将双链DNA特异切开 E．不受DNA甲基化影响． 11．限制酶的作用特性不包括： A．在对称序列处切开DNA B．同时切开双链DNA C．DNA两链的切点常在同一位点 D．酶切后的DNA片段多具有粘性互补末端 E．酶辨认的碱基一般为4—6个 12．限制酶的特点不包括： A．只识别一种核苷酸序列 B．其识别不受DNA来源的限制

《基因突变和基因重组》习题精选

《基因突变和基因重组》习题精选 1．培育青霉素高产菌株的方法是（）（A）杂交育种（B）单倍体育种（C）诱变育种（D）多倍体育种 2．自然界中生物变异的主要来源是（）（A）基因突变（B）基因重组（C）环境影响（D）染色体变异 3．产生镰刀型细胞贫血症的根本原因是（）（A）红细胞易变形破裂（B）血红蛋白中的一个氨基酸不正常（C）信使RNA中的一个密码发生了变化（D）基因中的一个碱基发生了变化 4．人工诱变区别于自然突变的突出特点是（）（A）产生的有利变异多（B）使变异的频率提高（C）可人工控制变异方向（D）产生的不利变异多 5．下面列举了几种可能诱发基因突变的原因，其中哪项是不正确的（）（A）射线的辐射作用（B）杂交（C）激光照射（D）秋水仙素处理 6．人类的基因突变常发生在（）（A）减数分裂的间期（B）减数第一次分裂（C）减数第二次分裂（D）有丝分裂末期 7．人工诱变是创造生物新类型的重要方法，这是因为人工诱变（）（A）易得大量有利突变体（B）可按计划定向改良（C）变异频率高，有利变异较易稳定（D）以上都对 8．一种植物只开红花，但在红花中偶尔出现一朵白花，将白花所给种子种下，后代仍为白花。出现这种现象的原因可能是（）（A）基因突变（B）基因重组（C）染色体变异（D）基因互换 9．下列属于基因突变的是（）（A）外祖母正常，母亲正常，儿子色盲（B）杂种高茎豌豆自交，后代中出现矮茎豌豆（C）纯种红眼果蝇后代中出现白眼果蝇（D）肥水充足时农作物出现穗大粒多 10．一对夫妇所生子女中，性状差别甚多，这种变异主要来自于（）（A）基因重组（B）基因突变（C）染色体变异（D）环境的影响 11．如果基因中四种脱氧核苷酸的排列顺序发生了变化，则这种变化叫（）（A）遗传性变化（B）遗传信息变化（C）遗传密码变化（D）遗传规律变化

基因突变和基因重组教学设计

《基因突变和基因重组》的教学设计一、教材和学情分析本节课内容包含了两种可遗传变异基因突变和基因重组，而基于前面已经学习了自由组合定律和减数分裂知识，学生们对于基因重组已经有了一定的了解，在这个知识点处理上应注重对学生实际理解能力和图形分析能力的培养，通过实践提高学生的认知能力。这节课的重点和难点就集中于基因突变这个知识点，要通过多种途径来加深对基因突变的内涵和外延的理解。二、教学目标及重难点知识目标 1.结合实例.模型.游戏等方法从分子水平（碱基对替换.增添.缺失）分析基因突变发生的时间，内因，推导出基因突变概念。 2. 分析基因突变发生在体细胞和生殖细胞时对其控制合成的蛋白质.对性状与子代的影响。 3.基因突变的产生外在原因.特点及意义。 4.掌握基因重组的概念.来源.意义，会辨别不同情况下的基因重组。能力目标 1.结合减数分裂过程，学会用图示形式表示发生基因重组的原因，培养学生的作图和识图能力。 2.借助示意图的观察和对问题的思考，提高学生判断.推理等能力。

3.通过游戏、模型演示推出基因突变概念的过程，锻炼学生们合作探究的能力。情感态度价值观目标 1.通过分析引起基因突变的外部原因培养学生正确的生活态度，珍惜爱护生命。 2.认同基因简并性保持生物性状稳定性的意义，以及基因突变.基因重组对生物多样性形成的积极意义。重点 1.基因突变发生的概念.原因及特点。 2.基因重组的来源以及减数第一次分裂后期和交叉互换后对应的的基因变化图。难点基因突变及基因重组的意义。三、教学方法及教具针对教学内容和教学目标，选择的教学方法为：情境教学法.问题教学法.小组讨论法.学生分析归纳法。教学流程大体为：提出问题──观察现象──分析探索──得出结论。针对学生的认知规律，由熟悉到陌生，我对教材做了调整，先学习基因重组，再进行基因突变的学习。教具：多媒体.游戏纸条.磁条（制成基因碱基对）

高一生物《基因突变和基因重组》知识点归纳

高一生物《基因突变和基因重组》知识点归纳名词： 1、基因突变：是指基因结构的改变，包括DNA碱基对的增添、缺失或改变。 2、基因重组：是指控制不同性状的基因的重新组合。 3、自然突变：有些突变是自然发生的，这叫～。 4、诱发突变(人工诱变)：有些突变是在人为条件下产生的，这叫～。是指利用物理的、化学的因素来处理生物，使它发生基因突变。 5、不遗传的变异：环境因素引起的变异,遗传物质没有改变，不能进一步遗传给后代。 6、可遗传的变异：遗传物质所引起的变异。包括：基因突变、基因重组、染色体变异。语句： 1、基因突变 ①类型：包括自然突变和诱发突变 ②特点：普遍性;随机性(基因突变可以发生在生物个体发育的任何时期和生物体的任何细胞。突变发生的时期越早，表现突变的部分越多，突变发生的时期越晚，表现突变的部分越少。);突变率低;多数有害;不定向性(一个基因可以向不同的方向发生突变，产生一个以上的等位基因。)。 ③意义：它是生物变异的根本来源，也为生物进化提供了最初的原材料。 ④原因：在一定的外界条件或者生物内部因素的作用下，使得DNA复制过程出现小小的差错，造成了基因中脱氧核苷酸排列顺序的改变，最终导致原来的基因变为它的等位基因。这种基因中包含的特定遗传信息的改变，就引起了生物性状的改变。

⑤实例：a、人类镰刀型贫血病的形成：控制血红蛋白的DNA上一个碱基对改变，使得该基因脱氧核苷酸的排列顺序—发生了改变，也就是基因结构改变了，最终控制血红蛋白的性状也会发生改变，所以红细胞就由圆饼状变为镰刀状了。b、正常山羊有时生下短腿“安康羊”、白化病、太空椒(利用宇宙空间强烈辐射而发生基因突变培育的新品种。)。 ⑥引起基因突变的因素：a、物理因素：主要是各种射线。b、化学因素：主要是各种能与DNA发生化学反应的化学物质。c、生物因素：主要是某些寄生在细胞内的病毒。 ⑦人工诱变在育种上的应用：a、诱变因素：物理因素---各种射线(辐射诱变)，激光(激光诱变);化学因素—秋水仙素等b、优点：提高突变率，变异性状稳定快，加速育种进程，大幅度地改良某些性状。c、缺点：诱发产生的突变，有利的个体往往不多，需处理大量的材料。d、如青霉素的生产。 2、基因突变是染色体的某一个位点上基因的改变，基因突变使一个基因变成它的等位基因，并且通常会引起一定的表现型变化。 3、基因重组： ①类型：基因自由组合(非同源染色体上的非等位基因)、基因交换(同源染色体上的非等位基因)。 ②意义：非常丰富(父本和母本遗传物质基础不同，自身杂合性越高，二者遗传物质基础相差越大，基因重组产生的差异可能性也就越大。);基因重组的变异必须通过有性生殖过程(减数分裂)实现。丰富多彩的变异形成了生物多样性的重要原因之一。 4、基因突变和基因重组的不同点：基因突变不同于基因重组，基因重组是基因的重新组合，产生了新的基因型，基因突变是基因结构的改变，产生了新的基因，产生出新的遗传物质。因此，基因突变是生物产生变异的根本原因，为进

《基因突变与基因重组》说课稿

《基因突变和基因重组》说课稿一、教学背景分析。 1.教材内容、地位及学情分析本节是人教版普通高中标准实验教科书生物必修2《遗传与进化》的第五章《基因突变及其他变异》的第一节内容。通过前面各章的学习，学生对“基因是什么”、“基因在哪里”和基因如何起作用“等问题已有了基本的认识。本章内容既是对前四章内容合乎逻辑的延续，又是学习第六章《从杂交育种到基因工程》和第七章《现代生物进化理论》的重要基础。本节介绍了基因突变，从实例入手，通过对镰刀型细胞贫血症的分析，引入基因突变的概念，然后详细阐述基因突变的原因和特点、意义。本节内容引导学生从分子水平上理解遗传物质如何引起基因突变的。学生对于生物变异的现象并不陌生，通过初中生物课的学习学生已初步认识到生物变异首先与遗传物质有关，其次与环境有关，本节内容在此基础上，进一步引导学生学习遗传物质究竟是如何引体生物变异 2.教学目标 1、知识目标：（1）举例说明基因突变的概念。（2）举例说明基因突变的特点和原因。（3）说出基因突变的意义。 2、能力目标：（1）通过对课本中实例的分析，培养学生分析归纳总结的逻辑推理能力。（2）通过学生之间相互启发、相互补充、激发灵感，提高学生合作—探究的能力。 3、情感目标：（1）通过生物变异的事例，增强学生对生物世界探究的好奇心及保护意识，培养学生们严谨的科学态度和热爱科学的兴趣。（2）引领学生进入“自主—合作—探究”新课程理念氛围，让学生真正成为学习的主人。 3.教学重点、难点（1）教学重点基因突变的概念、特点及原因。（2）教学难点基因突变的意义。二、教学展开分析 1.教具准备实物投影仪、电脑演示教学软件 2.课时安排 1课时 3.教学方法和手段利用多媒体课件，创设形象生动的教学氛围；同时应用讲述法、谈话法、指导读书法

基因重组和基因工程

第十七章基因重组和基因工程一、单项选择题 1.限制性核酸内切酶切割DNA后产生 A. 5′磷酸基和3′羟基基团的末端 B. 5′磷酸基和3′磷酸基团的末端 C. 5′羟基和3′羟基基团的末端 D. 3′磷酸基和5′羟基基团的末端 E. 以上都不是 2. 可识别并切割特异DNA序列的酶是 A. 非限制性核酸外切酶 B. 限制性核酸内切酶 C. 限制性核酸外切酶 D. 非限制性核酸内切酶 E. DNA酶 3. 有关限制性核酸内切酶，以下哪个描述是错误的？ A. 识别和切割位点通常是4～8个bp长度 B. 大多数酶的识别序列具有回文结构 C. 在识别位点切割磷酸二酯键 D. 只能识别和切割原核生物DNA分子 E. 只能切割含识别序列的双链DNA分子 4. 在重组DNA技术中催化形成重组DNA分子的酶是 A. 解链酶 B. DNA聚合酶 C. DNA连接酶 D. 内切酶 E. 拓扑酶 5. 对基因工程载体的描述，下列哪个不正确？ A. 可以转入宿主细胞 B. 有限制酶的识别位点 C. 可与目的基因相连 D. 是环状DNA分子 E. 有筛选标志 6. 克隆所依赖的DNA载体的最基本性质是 A. 卡那霉素抗性 B. 青霉素抗性 C. 自我复制能力 D. 自我表达能力 E. 自我转录能力 7. 重组DNA技术中常用的质粒DNA是 A. 病毒基因组DNA的一部分 B. 细菌染色体外的独立遗传单位 C. 细菌染色体DNA的一部分 D. 真核细胞染色体外的独立遗传单位 E. 真核细胞染色体DNA的一部分 8. 下列哪种物质一般不用作基因工程的载体？ A. 质粒 B. 噬菌体

C. 哺乳动物的病毒 D. 逆转录病毒ＤＮＡ E. 大肠杆菌基因组 9. 关于pBR322质粒描述错误的是Ａ．有一些限制酶的酶切位点Ｂ．含有1个ori. Ｃ．含有来自大肠杆菌的lacZ基因片段Ｄ．含个氨卞青霉素抗性基因Ｅ．含四环素抗性基因。 10. 以ｍＲＮＡ为模板催化ｃＤＮＡ合成需要下列酶 A. RNA聚合酶 B. DNA聚合酶 C. Klenow片段 D. 逆转录酶 E. ＤＮＡ酶 11. 催化聚合酶链反应需要下列酶 A. RNA聚合酶 B. DNA聚合酶 C. Taq DNA聚合酶 D. 逆转录酶 E.限制性核酸内切酶 12. 关于ＰＣＲ的描述下列哪项不正确？ A. 是一种酶促反应 B. 引物决定了扩增的特异性 C. 扩增产物量大 D.扩增的对象是ＤＮＡ序列 E.扩增的对象是RNA序列 13. 在基因工程中，DNA重组体是指 A. 不同来源的两段DNA单链的复性 B. 目的基因与载体的连接物 C. 不同来源的DNA分子的连接物 D. 原核DNA与真核DNA的连接物 E. 两个不同的结构基因形成的连接物 14. 基因工程操作中转导是指 A. 把重组质粒导入宿主细胞 B. 把DNA重组体导入真核细胞 C. 把DNA重组体导入原核细胞 D. 把外源DNA导入宿主细胞 E. 以噬菌体或病毒为载体构建的重组DNA导入宿主细胞 15. 重组DNA的筛选与鉴定不包括哪一方法 A. 限制酶酶切图谱鉴定 B. PCR扩增鉴定 C. 显微注射 D. 蓝白筛选 E.抗药筛选

基因突变和基因重组

基因突变和基因重组【课前复习】在学习新课程前必须复习有关DNA的复制、基因控制蛋白质的合成、表现型与基因型的关系等知识，这样既有利于掌握新知识，又便于将新知识纳入知识系统中。温故——会做了，学习新课才能有保障1．DNA分子的特异性决定于 A．核糖的种类B．碱基的种类 C．碱基的比例D．碱基对的排列顺序答案：D 2．基因对性状控制的实现是通过A．DNA的自我复制 B．DNA控制蛋白质的合成 C．一个DNA上的多种基因 D．转运RNA携带氨基酸答案：B 3．下列关于基因型与表现型关系的叙述中，错误的是 A．表现型相同，基因型不一定相同B．基因型相同，表现型一定相同C．在相同生活环境中，基因型相同，表现型一定相同

D．在相同生活环境中，表现型相同，基因型不一定相同答案：B 4．实现或体现遗传信息的最后阶段是在细胞的哪一部分中进行的 A．线粒体中B．核糖体中C．染色质中D．细胞质中答案：B 知新——先看书，再来做一做 1．变异的类型有_________和_________两种。后者有三个来源_________、___________、___________。2．基因突变 (1)概念：由于DNA分子中发生碱基对的___________、___________或___________，而引起的基因结构的改变，就叫做基因突变。 (2)实例：镰刀型细胞贫血症 ①根本原因：控制合成血红蛋白的DNA 分子的一个___________发生改变。 ②直接原因：血红蛋白多肽链中___________被___________代替。(3)结果：基因突变使一个基因变成它的___________基因，并且通常会引起—定的___________型的变化。

基因重组技术基本工具

课题学科组高二生物主备人朱建国执教人朱建国课题DNA重组技术的基本工具课型时间课时教学目标1.简述DNA重组技术所需三种基本工具的作用。 2.认同基因工程的诞生和发展离不开理论研究和技术创新。教学设想重点：DNA重组技术所需的三种基本工具的作用。难点：基因工程载体需要具备的条件。教法学法指导：：启发式教学多媒体课件教学过程个性化修改 1.设置问题情境，引导学生在思索中学习新知识。本节内容主要是介绍DNA重组技术的三种基本工具及其作用。如果我们采用直白、平淡的方式介绍，不利于调动学生学习的积极性，也不利于学生科学素养的全面提高。应当通过创设情境，提出问题，诱导学生积极参与教学活动，开启他们思想的闸门。限制酶──“分子手术刀”，主要是介绍限制酶的作用，切割后产生的结果。可在进入这部分内容学习时，设置学生关心的问题“限制酶从哪里寻找”，诱导学生联想从前学过的内容──噬菌体侵染细菌的实验，进而认识细菌等单细胞生物容易受到自然界外源DNA的入侵。那么这类原核生物之所以长期进化而不绝灭，有何保护机制？进而诱导学生产生“可能是有什么酶来切割外源DNA，而使之失效，达到保护自身的目的”。这样就将书中直白的“这类酶主要是从原核生物中分离纯化出来”的写法，变成了一个自主探索的思想活动。 DNA连接酶──DNA片段的“分子缝合针”，写得比较简洁。我们可以从原有的知识出发，诱发学生思考，达到辨析、明理的作用。要想连接被切割开的DNA，学生根据从前学过的知识，第一反应就想到“DNA聚合酶”。学生这种想法的产生是很自然的。但实际上并不能用这种酶进行DNA片段的连接。应引领学生分析DNA聚合酶与DNA连接酶的不同作用，从而达到更深层次认识DNA连接酶的目的。基因进入受体细胞的载体──“分子运输车”的学习内容，提到作为载体必需的四个条件。教学不能仅仅着眼于让学生记住这几个条件，而应该通过诱导思索，明确为什么要有这四个条件才能充当载体。 2.让抽象的语言在直观的插图中找到注释，在实际动手中形成正确认识。语言文字具有抽象、概括的特点；插图等信息媒体，具有形象、直观的特点，容易

基因工程和基因重组

第十四章基因重组与基因工程内容提要：细菌的基因转移包括接合作用、转化作用、转导作用等。当细胞与细胞或细菌通过菌毛相互接触时，质粒DNA从一个细胞转移至另一个细胞，这种类型的DNA转移称为接合作用。通过自动获取或人为的供给外源DNA，使细胞或培养的受体细胞获得新的遗传表型，这就是转化作用。由病毒携带将宿主DNA片段从一个细胞转移至另一细胞的现象或机制，称为转导作用。在接合、转化、转导或转座过程中，不同DNA分子间发生的共价连接即为重组。重组DNA技术是在人们对自然界基因转移和重组的认识基础上创立的新技术。为研究基因的结构与功能，从构建的基因组DNA文库或cDNA文库分离、扩增某一感兴趣的基因就是基因克隆或分子克隆，又称重组DNA技术。一个完整的基因克隆过程应包括：1．分，即目的基因的获取及基因载体的选择。目的基因指科学家感兴趣的外源基因，其来源有几种途径：化学合成、PCR技术、基因组文库或cDNA文库中获得。载体是目的基因的携带者，常用的载体有质粒、噬菌体等。2．切，即限制性核酸内切酶的应用。限制性内切酶是识别DNA的特异序列，并在识别位点或其周围切割双链DNA的一类内切酶，是实现重组DNA技术的重要的工具酶。3．接，即将目的基因与载体连接形成重组体（或重组DNA）。4．转，即将重组体导入宿主菌（或细胞），根据采用的载体性质不同，将重组体导入宿主菌的方法有转化、转染及感染。5．筛，即重组体的筛选与鉴定，将重组体导入宿主菌后，通过适当形式的培养板生长即可获得一定的抗药菌落。利用原位杂交，和Southern印迹或免疫学方法对抗药菌落进行筛选，获得含目的基因的转化子菌落，再经扩增、分离重组DNA获得基因克隆。重组DNA 技术在疾病基因的发现，表达有药用价值的蛋白质，DNA诊断及疾病的预防等方面具有广泛应用价值，并促进了当代分子医学的诞生和发展。一、选择题【A型题】 1．下列DNA序列属于回文结构的是（） A．ATGCCG TACGGC B．GAA TTC CTTAAG C．GGCCGG CCGGCC D．TCTGAC AGACTG E．CTAGGG GA TCCC 2．DNA经限制性内切核酸酶切割后，断端易于首尾相接，自行成环。这是因为存在着（） A．钝性末端B．平端C．粘性末端D．5’端E．3’端 3．限制性内切核酸酶的通常识别序列是（） A．粘性末端B．聚腺苷酸 C．回文对称序列D．RNA聚合酶附着点E．甲基化“帽”结构 4．pBR322是（）

7基因重组和基因突变

2016-2017学年度基因重组和基因突变学校:___________姓名：___________班级：___________考号：___________ 一、选择题 1．某红眼（A）、正常刚毛（B）和灰体色（D）的正常果蝇经过人工诱变产生基因突变，下图表示该突变个体的X染色体和常染色体及其上的相关基因．下列叙述错误的是（） A．人工诱变常用的物理方法有X射线、紫外线辐射等 B．如图可知控制果蝇的眼色、体色及刚毛类型的三对等位基因，在减数分裂过程中都遵循基因的自由组合定律 C．基因型为ddX a X a和DDX A Y的雌雄果蝇杂交，F1果蝇的基因型及其比例是DdX A X a：DdX a Y=1：1 D．若只研究眼色，白眼雌果蝇与红眼雄果蝇杂交，F1雌雄果蝇的表现型及其比例是雌性红眼：雄性白眼=1：1 2．下列关于基因重组和染色体畸变的叙述，正确的是（） A．不同配子的随机组合体现了基因重组 B．染色体倒位和易位不改变基因数量，对个体性状不会产生影响 C．通过诱导多倍体的方法可克服远缘杂交不育，培育出作物新类型 D．孟德尔一对相对性状杂交实验中，F1紫花植株自交后代发生性状分离的现象体现了基因重组 3．2015年诺贝尔化学奖颁给了研究DNA修复细胞机制的三位科学家．纳米科技是跨世纪新科技，将激光束的宽度聚焦到纳米范围内，可对人体细胞内的DNA分子进行超微型基因修复，有望把尚令人类无奈的癌症、遗传疾病彻底根除，这种对DNA进行的修复属于（） A．基因重组 B．基因互换 C．基因突变 D．染色体畸变 4．将两个抗花叶病基因H导入大豆（2n=40），筛选出两个H基因成功整合到染色体上的抗花叶病植株A（每个H基因都能正常表达），植株A自交，子代中抗花叶病植株所占比例为．取植株A上的某部位一个细胞在适宜条件下培养，连续正常分裂两次，产生4个子细胞．用荧光分子检测H基因（只要是H基因，就能被荧光标记）．下列叙述正确的是（） A．获得植株A的原理是基因重组，可以决定大豆的进化方向 B．若每个子细胞都含有一个荧光点，则子细胞中的染色体数是40 C．若每个子细胞都含有两个荧光点，则细胞分裂过程发生了交叉互换 D．若子细胞中有的不含荧光点，则是因为同源染色体分离和非同源染色体自由组合二、综合题 5．果蝇是用于研究遗传学的模式生物，其四对相对性状中长翅（B）对残翅（b）、灰身（D）对黑身（d）、细眼（E）对粗眼（e）、棒眼（H）对圆眼（h）为显性。现有一批果蝇Q为实验材料，其四对染色体上的有关基因组成如左图。

基因突变和基因重组

淮阳一高高效课堂自主学习型高一生物导学案高一班姓名：日期：2013-4-29 编号： NO.32 编制人：常娟丽审核人： ____比一比看谁表现最好！拼一拼力争人人过关！课题：第四节：基因突变和基因重组【自研课导学】预习课（晚自习40分钟）自读自研§4.4基因突变和基因重组。20分钟内完成如下随堂笔记任务。资料准备：教材学法指导： 1．内容： P76-P80基因突变和基因重组内容。 2．学法：认真阅读教材，用笔画出相关内容中的关键字，并注意总结其各自的概念、类型、特点、意义及应用。 3．达成目标：理解掌握基因突变和基因重组的概念、类型、特点、意义；能够说出基因突变和基因重组在育种中的应用。【展示课导学】展示提升质疑评价环节总结归纳环节展示方案（内容·方式·时间30 分钟）随堂笔记（成果记录·知识生成·同步演练）展示单元一：基因突变的概念、类型、发生时期、原因、特点、意义、及应用。（7Min）方案预设： 1.书面展示随堂笔记中的填空内容。 2.口头展示合作探究第1-2题。（展示要求：一定要做到脱稿展示）。在丰富多彩的生物界中，生物的变异是普遍存在的。生物的变异有的仅仅是由于_______________影响造成的，遗传物质__________（是/否）发生改变，这种变异_________（能/否）遗传给后代，我们称为不可遗传的变异；有的则是由于生殖细胞内的________________发生改变而引起的，这种变异性状_________（能/否）遗传给后代，我们称为可遗传的变异，其类型主要包括______________、________________、和_________________（包括____________________和___________________）。一、基因突变 1．概念：_______________________________________________________________________________。2．类型：__________________和__________________。 3．发生时期： ___________________________________________________________________。 4．传递特点：基因突变若发生在配子中，将遵循遗传规律传递给后代；若发生在体细胞中，一般不能遗传。（但有些植物的体细胞发生基因突变，可通过无性繁殖传递。） 5．原因：①外因：________________（如_______________、________________等）、________________（如_______________、____________等）以及_________________（如_________、__________等）。 ②内因：在自然状态下，____________________偶尔发生错误、___________________发生改变等。6．特点：_____________（在生物界中普遍存在）；______________（可以发生在生物个体发育的任何时期）；_____________（在自然状态下，基因突变的频率是很低的）；____________（一个基因可以向不同的方向发生突变，产生一个以上的等位基因）；_______________（基因突变往往是有害的，少数是有利的）。7．意义：①是新基因产生的途径，是_______________________；②是生物进化的___________________。8．实例：镰刀型细胞贫血症 ①直接原因：组成血红蛋白分子的一个______________（其密码子为_________）被替换成了_____________（其密码子为_________）。从而引起蛋白质结构发生了改变。 ②根本原因：由于____________碱基对被替换成了____________碱基对。 9．应用：诱变育种（如青霉素高产菌株的获得）。【注意】：诱变育种能产生新基因、新性状，但其盲目性大，有利变异少。【合作探究】1．基因突变包括碱基对的增添、缺失、替换（即改变）三种情况，哪种影响最小？为什么？2．基因突变一定会导致生物性状发生改变吗？为什么？【注意】：基因突变能产生新的基因，但基因的数目和位置并未改变。

2019高中生物第5章第1节基因突变和基因重组教案新人教版

基因突变和基因重组一、基因突变 1．概念 DNA分子中发生碱基对的替换、增添和缺失，而引起的基因结构的改变。 2．实例(镰刀型细胞贫血症)分析 (1)症状：患者红细胞由正常中央微凹的圆饼状变为弯曲的镰刀状，易发生红细胞破裂，使人患溶血性贫血。 (2)病因图解 (3)分析 (4)结论：镰刀型细胞贫血症是由于基因的一个碱基对改变而产生的一种遗传病。 3．对后代的影响 (1)若发生在配子中，将传递给后代。 (2)若发生在体细胞中，一般不遗传，但有些植物可通过无性繁殖传递。 4．时间、原因、特点及意义 (1)时间：主要发生在有丝分裂间期和减数第一次分裂前的间期。

二、基因重组 1．概念在生物体进行有性生殖的过程中，控制不同性状的基因的重新组合。 2．类型比较 3.意义基因重组是生物变异的来源之一，对生物进化具有重要意义。一、基因突变的实例 1．阅读教材P83[思考与讨论]，探讨下列问题： (1)基因突变一般发生在细胞分裂的什么时期？提示：有丝分裂间期或减数第一次分裂前的间期。 (2)结合DNA分子的结构特点和复制过程，分析DNA分子复制时容易发生基因突变的原因。提示：DNA分子复制时，DNA双链要解旋，此时结构不稳定，易导致碱基对的数量或排列顺序改变，从而使遗传信息发生改变。

2．分析教材P70囊性纤维病的病因图解，结合镰刀型细胞贫血症的病因，探讨下列问题： (1)从DNA分子结构上分析，囊性纤维病的病因是什么？与正常人的CFTR基因相比，碱基数量、排列顺序发生了怎样的变化？提示：①CFTR基因缺失3个碱基。②与正常人的CFTR基因相比，碱基数量减少，排列顺序发生改变。 (2)镰刀型细胞贫血症的根本原因是什么？变化后导致血红蛋白基因中碱基种类、数量和排列顺序发生了怎样的变化？提示：①碱基替换。②碱基种类可能变化，数量不变，排列顺序发生改变。 (3)根据上述资料分析可知，基因突变导致基因结构的改变，这种改变具体表现在哪些方面？这种改变在光学显微镜下能观察到吗？提示：①脱氧核苷酸(碱基)的种类、数量、排列顺序的改变引起遗传信息的改变。②这种改变在光学显微镜下不能观察到。 3．基因突变一定会改变遗传信息和生物性状吗？试分析原因。提示：(1)遗传信息一定改变。基因突变是指基因中碱基对的替换、增添和缺失，基因中脱氧核苷酸的排列顺序代表遗传信息。发生基因突变后，遗传信息会发生改变。 (2)生物性状不一定发生改变。发生碱基对的改变时，由于密码子的简并性，可能并不改变蛋白质中的氨基酸序列，不改变生物的性状；发生隐性突变时，生物的性状也不一定改变。二、基因突变的原因和特点 1．癌变的原因是由于细胞内抑癌基因和原癌基因发生突变，癌细胞的特点之一是能进行无限增殖，医学上通常使用一定量的化学药剂对癌症病人进行化疗。另一方面接受化疗后的病人身体非常虚弱。结合基因突变分析并回答下列问题： (1)化疗能够治疗癌症的原理是什么？提示：化疗的作用是通过一定量的化学药剂干扰癌细胞进行DNA复制，从而抑制其分裂的能力，或者杀死癌细胞。 (2)接受化疗的癌症患者，身体非常虚弱的原因是什么？提示：化疗的药物，既对癌细胞有作用，也对正常的体细胞有作用，因此，化疗后病人的身体是非常虚弱的。

基因重组技术一、技术原理基因重组是指不同DNA链的断裂和连接而 ...

基因重组技术一、技术原理基因重组是指不同DNA链的断裂和连接而产生DNA片段的交换和重新组合，形成新DNA分子的过程。从广义上讲,任何造成基因型变化的基因交流过程,都叫做基因重组。而狭义的基因重组仅指涉及DNA分子内断裂- 复合的基因交流。真核生物在减数分裂时,通过非同源染色体的自由组合形成各种不同的配子,雌雄配子结合产生基因型各不相同的后代,这种重组过程虽然也导致基因型的变化,但是由于它不涉及DNA分子内的断裂- 复合,因此,不包括在狭义的基因重组的范围之内。二、基因重组分类基因重组，包括同源重组、位点特异性重组、转座作用和异常重组四大类。 1同源重组同源重组(Homologus Recombination)，是指发生在姐妹染色单体（sister chromatin) 之间或同一染色体上含有同源序列的DNA分子之间或分子之内的重新组合。同源重组需要一系列的蛋白质催化，如原核生物细胞内的RecA、RecBCD、Rec F、RecO、RecR等；以及真核生物细胞内的Rad51、Mre11- Rad50等等。同源重组反应通常根据交叉分子或holiday结构（Holiday Juncture Structure) 的形成和拆分分为三个阶段，即前联会体阶段、联会体形成和Holiday 结构的拆分。 2位点特异性重组在位点特异性重组(site-specific recombination)中，DNA节段的相对位置发生了移动，从而得到不同的结果─D NA序列发生重排。位点特异性重组不依赖于DNA顺序的同源性（虽然亦可有很短的同源序列），而依赖于能与某些酶相结合的DNA序列的存在。 λ噬菌体编码λ整合酶（integrase）。这个酶能指导噬菌体DNA插入E.coli染色体中。这种插入作用是通过两个DNA分子的特异位点进行重组，将两个环状

高中生物《基因突变与基因重组》优质课教案、教学设计

《5.1 基因突变与基因重组》教学设计 (一) 情境导入,引出标题播放电影《蜘蛛侠》的片段，吸引学生学习生物的兴趣，提问学生影片男主角为什么会变成一个蜘蛛侠，从而引出本章章题：变异，并展示变异的概念。 (二) 合作探究,精讲点拨探究1:基因突变的实例讲述:生物的变异有两种类型。那什么情况下的变异不遗传,什么情况下的变异可遗传?我们知道生物的表现型与基因型和外界环境条件有关。像玉米这样,子粒饱满是由于水、肥和光充足引起,也就是外界环境条件引起的,这种变异是不遗传的。而太空椒邀游过太空,宇宙辐射改变了它的遗传物质,因此这个变异性状就是可遗传的。可遗传的变异是生物变异的主要类型。它的来源主要有三方面:基因突变基因重组和染色体变异。那么什么是基因突变?基因突变是怎么产生的?又怎么导致生物变异呢?下面的图片是关于正常红细胞基因突变形成镰刀型红细胞的内容,我们先来看一下。(学生

阅读教材第80~81 页后讨论并回答镰刀型细胞贫血症是怎样引起的一种遗传病。) 问:从图片中我们看到正常红细胞是什么形状?有什么功能? (答:圆饼形状,具有运输氧气的功能。) 问:镰刀型细胞贫血症的红细胞呈镰刀状,对功能的完成有没有影响? (答:有,运输氧气的能力降低,易破裂溶血造成贫血,严重时会导致死亡。)讲述:那么是什么原因使正常红细胞变成镰刀型红细胞的化呢?分子生物学研究表明是基因突变的结果。让我们来看镰刀型细胞贫血症病因的图解，大家知道, 性状是由蛋白质来体现的,我们先来看正常血红蛋白与镰刀型血红蛋白的氨基酸组成。(学生看教材第81 页的思考与讨论,思考并回答问题。) 问:两者有什么区别? (答:正常的是谷氨酸,异常的是缬氨酸。) 问:氨基酸是由什么决定的? (答:由信使RNA 上的密码子决定的。)

基因重组软件(Recombination Detection Program version 4,RDP4)说明书

基因重组技术论文

基因重组和重组DNA技术教案

5.1基因突变和基因重组练习试题

(整理)基因重组与基因工程

《基因突变和基因重组》习题精选

最新基因重组与基因工程

基因突变和基因重组教学设计

高一生物《基因突变和基因重组》知识点归纳

《基因突变与基因重组》说课稿

基因重组和基因工程

基因突变和基因重组

基因重组技术基本工具

基因工程和基因重组

7基因重组和基因突变

基因突变和基因重组

2019高中生物第5章第1节基因突变和基因重组教案新人教版

基因重组技术一、 技术原理基因重组是指不同DNA链的断裂和连接而 ...

高中生物《基因突变与基因重组》优质课教案、教学设计

基因重组技术一、技术原理基因重组是指不同DNA链的断裂和连接而 ...