Work in progress (any suggestions for mods are welcome)
So you have rad-seq data and now you want to generate some "simple" summary statistics. Here are some options with pluses and minuses. My biased assumptions dictate requirements (in order of importance):
- must work for RAD-Seq style data (SNPs)
- must be unix based, command line is optimal (GUIs don't support automation)
- not include obscure file/data formats
- easy to use (the rest of these are just a version of this)
- integrate with the rest of my tools without a bunch of patching or manual conversion (should not be in a different language than what I'm already using, or require any tedious manual tinkering)
Pros: Really good docs. Good walkthrough. Also does PCA and a few other nice plots.
Cons: Since it's a library you have to write python to interact with it. It's not super easy to use unless you know a fair amount of python. Even then a little complicated and some design decisions seem arbitrary.
TLDR: I use this, for better or for worse.
Pros: .vcf is output by many next-gen short-read pipelines (e.g. tassel, stacks, pyrad). Does most of the common popgen sumstats.
Cons: Interface is a bit goofy, pass in flags and it creates output files for each stat. Only allows one output flag at a time, so you'd have to wrap it in a shell script.
TLDR: Super-automatable. If you already have VCF format this is a great way to go.
Pros: Calculates lots of sumstats, seems simple to operate, no fanciness. Command line, reads in from stdin.
Cons: Input data format is bialleleic snps in Hudson's ms format. Maybe this isn't a con. It operates one locus at a time, so output could get messy. Also if you want sumstat means across all loci you'd have to do it yourself.
TLDR: Could be worth a look, but I haven't tried it.
C++ & some R
Pros: Does lots of great stuff, sfs, Tajima, Watterson, etc, etc.
Cons: Complicated and wonky. People write wrappers around angsd to make it work better NGSTools (https://github.com/mfumagalli/ngsTools), and even the wrappers are wonky. Kinda wants an outgroup. Seems more complicated than I need. Operates on .bam files, which if you don't have them are a pain to get, hard to go straight from non-model reads to bam.
TLDR: This is more like population genomics. If you really want to wade in to the sequence data this is your tool.
Pros: Really comprehensive list of stats to run. Good docs including a tutorial.
Cons: R package. Overly complixified.
TLDR: This could be worth a look, even if you aren't already using R in your pipeline.
Pros: It's been around for a while.
Cons: Somewhat complicated. Wonky input files. Makes lots of wonky assumptions.
This is an R package that extends adegenet's genind object. Emphasis is on clonal or partially-clonal organisms. Really great docs. The tutorial will walk you through HWE, heterozygosity, Gst (and other kinds of genetic diversity), AMOVA, and DAPC, but not all these functions are native to poppr.
Pros: Awesome walkthrough. Update to Poppr 2.0 supports SNP data.
Cons: Seems focused on clonal/partially-clonal orgs. If you just want sumstats there might be too much overhead.
TLDR: Great if you're already using adegenet/R, otherwise there are better options.
Python Program Primarily for phylogenetics, but it has a sumstats module, so seemed worth checking out.
Cons: Limited number of sumstats. Limited to "tree-ee" input formats: “nexus”, “newick”, “nexml”, “fasta”, or “phylip”.
TLDR: Probably works well for others, but doesn't seem like a good fit for me.
Biopython is great! It's a great idea, used it for other stuff before, but not for nextgen sumstats (PopGen module).
Pros: It's written in Python and I'm hooked on ipython notebooks lately, so this is a +. Also has skel for fastsimcoal2.
Cons: Seems to only want to do multi-locus/microsatellites. No clear way how to use it for SNPs. Requires Genepop to be installed, and requires genepop file format which is esoteric. Genepop requires compilation on linux, which is fine, but adds to the complication. Same goes for fsc2.
TLDR: Doesn't do snps. It's a wrapper, neglected and somewhat undocumented.
Scripts in various languages
Pros: Yeah, it does some of the stats we want.
Cons: Complicated operation (perl/java/R), not very straightforward. For pooled sequence data. Unfortunate name.
TLDR: Complicated. There are better options.
Cons: Distributes through mediafire? wat?
Pros: The developers have a sense of humor: "Do not leave trailing blank lines at the end of your data file, as this currently causes PyPop to terminate with an error message that takes experience to diagnose." Lol.
Cons: Most recent release 2008, not active development. Esoteric input format.
TLDR: Doesn't do snps. Old.
Pros: ... idk it's been around for a while? People use it, i just didn't seriously consider it because of the "Cons".
Cons: "DnaSP is written in Visual Basic v. 6.0 (Microsoft), and it runs on an IBM-compatible PC under 32-bit Windows." ... D: