summaryrefslogtreecommitdiff
path: root/biology
AgeCommit message (Collapse)AuthorFilesLines
2021-06-29py-numpy: "Python version >= 3.7 required."nia1-1/+3
2021-06-23Revbump for MySQL default changenia1-2/+2
2021-06-15biology/Makefile: Add peak-classifierbacon1-1/+2
2021-06-15biology/peak-classifier: import peak-classifier-0.1.1bacon4-0/+43
Classify ChIP/ATAC-Seq peaks based on features provided in a GFF Peaks are provided in a BED file sorted by chromosome and position. The GFF must be sorted by chromosome and position, with gene-level features separated by ### tags and each gene organized into subfeatures such as transcripts and exons. This is the default for common data sources.
2021-06-15biology/biolibc: Update to 0.1.3.2bacon4-9/+10
Add LDFLAGS to allow RELRO
2021-06-11biology/vcf-split: Update to 0.1.2bacon3-8/+9
Updates for new biolibc API Upstream change log: https://github.com/auerlab/vcf-split/releases
2021-06-11biology/vcf2hap: Update to 0.1.3bacon3-8/+9
Updates for new biolibc API Upstream change log: https://github.com/auerlab/vcf2hap/releases
2021-06-11biology/ad2vcf: Update to 0.1.3bacon3-8/+9
Updates for new biolibc API Upstream change log: https://github.com/auerlab/ad2vcf/releases
2021-06-11biology/biolibc: Update to 0.1.3bacon4-16/+52
Import sam_buff_t class and VCF functions from ad2vcf Add BED and GFF support Isolate headers under include/biolibc Numerous small enhancements and fixes Upstream change log: https://github.com/auerlab/biolibc/releases
2021-06-11biology/ncbi-blast+: Update to 2.11.0bacon8-54/+171
Release notes: https://www.ncbi.nlm.nih.gov/books/NBK131777/
2021-06-01*: recursive PKGREVISION bump for sneaky gsl shared library version number ↵wiz1-3/+2
change
2021-05-29biology/minimap2: install minimap2 program instead of python bindingbrook2-19/+11
The distfile for minimap2 includes two different components: (i) the minimap2 sequence mapping program itself, and (ii) a python binding generally referred to as mappy. The initial version of this package included only the python binding. However, it is more appropriate that the minimap2 package should contain the program of the same name, and a new package be created with the name mappy for the python binding. Splitting these into two packages makes sense, because this allows users to install the minimap2 package without python dependencies.
2021-05-27biology/filter-fastq: add filter-fastq version 0.0.0.20210527brook1-1/+2
2021-05-27biology/filter-fastq: add filter-fastq version 0.0.0.20210527brook4-0/+56
Filter reads from a FASTQ file using a list of identifiers. Each entry in the input FASTQ file (or files) is checked against all entries in the identifier list. Matches are included by default, or excluded if the --invert flag is supplied. Paired-end files are kept consistent (in order). This is almost certainly not the most efficient way to implement this filtering procedure. I tested a few different strategies and this one seemed the fastest. Current timing with 16 processes is about 10 minutes per 1M paired reads with gzip'd input and output, depending on the length of the identifier list to filter by. usage: filter_fastq.py [-h] [-i INPUT] [-1 READ1] [-2 READ2] [-p NUM_THREADS] [-o OUTPUT] [-f FILTER_FILE] [-v] [--gzip]
2021-05-26Added biology/beagle version 5.2brook1-1/+2
2021-05-26biology/beagle: added beagle 5.2brook7-0/+143
Introduction Beagle is a software package for phasing genotypes and for imputing ungenotyped markers. Beagle version 5.2 provides significantly faster genotype phasing than version 5.1 Citation If you use Beagle in a published analysis, please report the program version and cite the appropriate article. The Beagle 5.2 genotype imputation method is described in: B L Browning, Y Zhou, and S R Browning (2018). A one-penny imputed genome from next generation reference panels. Am J Hum Genet 103(3):338-348. doi:10.1016/j.ajhg.2018.07.015 The most recent reference for Beagle's phasing method is: S R Browning and B L Browning (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome association studies by use of localized haplotype clustering. Am J Hum Genet 81:1084-1097. doi:10.1086/521987 This reference will be updated when the Beagle version 5 phasing method is published.
2021-05-26Added biology/racon 1.4.3brook1-1/+2
2021-05-26biology/racon: add racon 1.4.3brook4-0/+53
## Description Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies. Racon can be used as a polishing tool after the assembly with **either Illumina data or data produced by third generation of sequencing**. The type of data inputed is automatically detected. Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format. Output is a set of polished contigs in FASTA format printed to stdout. All input files **can be compressed with gzip** (which will have impact on parsing time). Racon can also be used as a read error-correction tool. In this scenario, the MHAP/PAF/SAM file needs to contain pairwise overlaps between reads **including dual overlaps**. A **wrapper script** is also available to enable easier usage to the end-user for large datasets. It has the same interface as racon but adds two additional features from the outside. Sequences can be **subsampled** to decrease the total execution time (accuracy might be lower) while target sequences can be **split** into smaller chunks and run sequentially to decrease memory consumption. Both features can be run at the same time as well.
2021-05-26Add biology/minimap2 2.18brook1-1/+2
2021-05-26biology/minimap2: add minimap 2.18brook4-0/+62
## Users' Guide Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%. For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint. Release 2.18-r1015 (9 April 2021) --------------------------------- This release fixes multiple rare bugs in minimap2 and adds additional functionality to paftools.js. Changes to minimap2: * Bugfix: a rare segfault caused by an off-by-one error (#489) * Bugfix: minimap2 segfaulted due to an uninitilized variable (#622 and #625). * Bugfix: minimap2 parsed spaces as field separators in BED (#721). This led to issues when the BED name column contains spaces. * Bugfix: minimap2 `--split-prefix` did not work with long reference names (#394). * Bugfix: option `--junc-bonus` didn't work (#513) * Bugfix: minimap2 didn't return 1 on I/O errors (#532) * Bugfix: the `de:f` tag (sequence divergence) could be negative if there were ambiguous bases * Bugfix: fixed two undefined behaviors caused by calling memcpy() on zero-length blocks (#443) * Bugfix: there were duplicated SAM @SQ lines if option `--split-prefix` is in use (#400 and #527) * Bugfix: option -K had to be smaller than 2 billion (#491). This was caused by a 32-bit integer overflow. * Improvement: optionally compile against SIMDe (#597). Minimap2 should work with IBM POWER CPUs, though this has not been tested. To compile with SIMDe, please use `make -f Makefile.simde`. * Improvement: more informative error message for I/O errors (#454) and for FASTQ parsing errors (#510) * Improvement: abort given malformatted RG line (#541) * Improvement: better formula to estimate the `dv:f` tag (approximate sequence divergence). See DOI:10.1101/2021.01.15.426881. * New feature: added the `--mask-len` option to fine control the removal of redundant hits (#659). The default behavior is unchanged. Changes to mappy: * Bugfix: mappy caused segmentation fault if the reference index is not present (#413). * Bugfix: fixed a memory leak via 238b6bb3 * Change: always require Cython to compile the mappy module (#723). Older mappy packages at PyPI bundled the C source code generated by Cython such that end users did not need to install Cython to compile mappy. However, as Python 3.9 is breaking backward compatibility, older mappy does not work with Python 3.9 anymore. We have to add this Cython dependency as a workaround. Changes to paftools.js: * Bugfix: the "part10-" line from asmgene was wrong (#581) * Improvement: compatibility with GTF files from GenBank (#422) * New feature: asmgene also checks missing multi-copy genes * New feature: added the misjoin command to evaluate large-scale misjoins and megabase-long inversions. Although given the many bug fixes and minor improvements, the core algorithm stays the same. This version of minimap2 produces nearly identical alignments to v2.17 except very rare corner cases. Now unimap is recommended over minimap2 for aligning long contigs against a reference genome. It often takes less wall-clock time and is much more sensitive to long insertions and deletions. (2.18: 9 April 2021, r1015)
2021-05-26Add biology/miniasm 0.3.brook1-1/+2
2021-05-26biology/miniasm: add miniasm 0.3brook4-0/+64
Miniasm is a very fast OLC-based *de novo* assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads. So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are [PacBio E. coli sample][PB-151103], [ERS473430][ERS473430], [ERS544009][ERS544009], [ERS554120][ERS554120], [ERS605484][ERS605484], [ERS617393][ERS617393], [ERS646601][ERS646601], [ERS659581][ERS659581], [ERS670327][ERS670327], [ERS685285][ERS685285], [ERS743109][ERS743109] and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab. For a *C. elegans* PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3 produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements. Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies. ## Algorithm Overview 1. Crude read selection. For each read, find the longest contiguous region covered by three good mappings. Get an approximate estimate of read coverage. 2. Fine read selection. Use the coverage information to find the good regions again but with more stringent thresholds. Discard contained reads. 3. Generate a string graph. Prune tips, drop weak overlaps and collapse short bubbles. These procedures are similar to those implemented in short-read assemblers. 4. Merge unambiguous overlaps to produce unitig sequences. ## Limitations 1. Consensus base quality is similar to input reads (may be fixed with a consensus tool). 2. Only tested on a dozen of high-coverage PacBio/ONT data sets (more testing needed). 3. Prone to collapse repeats or segmental duplications longer than input reads (hard to fix without error correction).
2021-05-24*: recursive bump for perl 5.34wiz18-30/+36
2021-05-22py-dnaio: unbreak pkgsrc tree. revert removal of PYTHON_VERSIONS_INCOMPATIBLE.nia1-1/+5
2021-05-21py-dnaio: updated to 0.5.1adam3-25/+26
v0.5.1 Add py.typed and distribute .pyi files
2021-04-25various fixes for arm64 big endian support.mrg1-2/+2
most of these simply extend matching from "aarch64" to "aarch64eb" in various forms of code. most remaining uses in pkgsrc of "MACHINE_ARCH == aarch64" are because of missing aarch64eb support, such as most of the binary-bootstrap requiring languages like rust, go, and java. no pkg-bump because this shouldn't change packages on systems that could already build all of these.
2021-04-22py-dnaio: mark incompatible with python 2nia1-1/+4
2021-04-22py-cutadpt: add missing build dependencynia1-1/+3
2021-04-21revbump for boost-libsadam8-15/+16
2021-04-21revbump for textproc/icuadam7-13/+14
2021-04-21*: remove dead download locationswiz2-6/+5
2021-04-21*: remove dead download locationwiz2-6/+6
2021-04-04biology/molsketch: update to 0.7.2pin2-7/+7
-This is just a small release to fix some issues with the (possibly) renamed *.so/*.dll files after removing Qt5 support. In case you were using Molsketch prior to version 0.7.1, it will ask you to update the corresponding settings at start up. For Windows users, there will be an online installer, as in version 0.7.1, but this will now reside in a separate folder and not be updated as frequently as Molsketch itself. Updates will instead be made available in the online repository at github from which the installer will fetch them. Just start the installer and select the update option
2021-03-31py-cutadapt: updated to 3.4adam2-7/+7
v3.4 (2021-03-30) ----------------- * :issue:`481`: An experimental single-file Windows executable of Cutadapt is `available for download on the GitHub "releases" page <https://github.com/marcelm/cutadapt/releases>`_. * :issue:`517`: Report correct sequence in info file if read was reverse complemented * :issue:`517`: Added a column to the info file that shows whether the read was reverse-complemented (if ``--revcomp`` was used) * :issue:`320`: Fix (again) "Too many open files" when demultiplexing
2021-03-24biology/Makefile: Add vcf2hapbacon1-1/+2
2021-03-24biology/vcf2hap: import vcf2hap-0.1.2bacon4-0/+32
vcf2hap is a simple tool for generating a .hap file from a VCF. The .hap file is required by haplohseq. vcf2hap is extremely fast and requires a trivial amount of memory regardless of the size of the VCF file.
2021-03-24biology/Makefile: Add ad2vcfbacon1-1/+2
2021-03-24biology/ad2vcf: import ad2vcf-0.1.2bacon4-0/+30
ad2vdf extracts allelic depth info from a SAM stream and adds it to a corresponding single-sample VCF file.
2021-03-24biology/Makefile: Add biolibc, vcf-splitbacon1-1/+3
2021-03-24biology/vcf-split: import vcf-split-0.1.1bacon4-0/+31
Vcf-split splits a multi-sample VCF into single-sample VCFs, writing thousands of output files simultaneously. Parsing the TOPMed human chromosome 1 BCF with bcftools takes two days, so extracting the 137,977 samples one at a time or using thousands of parallel readers of the same file is impractical. Vcf-split solves this by generating thousands of single-sample outputs during a single sweep through the multi-sample input.
2021-03-24biology/biolibc: import biolibc-0.1.1bacon5-0/+43
Biolibc is a library of fast, memory-efficient, low-level functions for processing biological data. Like libc, it consists of numerous disparate, general-purpose functions which could be used by a wide variety of applications. These include functions for streaming common file formats such as SAM and VCF, string functions specific to bioinformatics, etc.
2021-03-21biology/Makefile: Add generandbacon1-1/+2
2021-03-21biology/generand: import generand-0.1.2bacon4-0/+24
Generate random genomic data in FASTA/FASTQ, SAM, or VCF format, suitable for small academic examples or test inputs of arbitrary size. Output can be piped directly to programs or redirected to a file and edited to taste.
2021-03-20biology/htslib: Update to 1.12bacon16-321/+53
biology/bcftools: Update to 1.12 biology/samtools: Update to 1.12 Numerous enhancements, performance improvements, and bug fixes since 1.10 Minimized pkgsrc patches in all three packages Moved htslib to custom tarball since Github-generated distfiles are incomplete
2021-03-17openbabel: make buildlink3.mk match Makefile and require eigen3prlw13-6/+6
The dependency is visible for packages using the LBFGS solver.
2021-03-09biology/molsketch: update to 0.7.1pin3-12/+15
Unfortunately, there were quite some unintended bugs in the last version (some of them older than that, however), which are being addressed by this version. Saving files and re-opening might have sometimes led to crashes due to inconsistencies in the drawing's data. This should now be fixed in, if not all at least most of the cases. Likewise, copying, cutting, and pasting is more robust now. The last version prematurely updated some code leading to incompatibilities with older versions of Qt (especially pre-5.14). These older versions should now work again; support for Qt 4, on the other hand is completely removed, as it is doubtful whether that still worked anyway. Translations should now really work throughout Molsketch (currently supported languages: Chinese, English, German, Greek). Finally, for Windows, an installer is provided, which will download from a repository hosted at github.
2021-03-08py-cutadapt: updated to 3.3adam2-7/+7
v3.3: * :issue:`504`: Fix a crash on Windows. * :issue:`490`: When ``--rename`` is used with ``--revcomp``, disable adding the ``rc`` suffix to reads that were reverse-complemented. * Also, there is now a ``{rc}` template variable for the ``--rename`` option, which is replaced with "rc" if the read was reverse-complemented (and the empty string if not). * :issue:`512`: Fix issue :issue:`128` once more (the “Reads written” figure in the report incorrectly included both trimmed and untrimmed reads if ``--untrimmed-output`` was used). * :issue:`515`: The report is now send to stderr if any output file is written to stdout
2021-03-04(biology/py-dnaio) +BUILD_DEPENDS+= py-setuptools_scm (other than 27 are fixed)mef1-1/+3
2021-02-24biology/Makefile: + igvwiz1-1/+2
2021-02-24biology/igv: import igv-2.9.2bacon5-0/+343
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.