Specification for: Feature Distance Calculation Software 1. PURPOSE The ability to compare the relative locations of two genomic features is an essential task in a genomics compbio toolkit. The feature distance calculation program is an optimized utility that calculates the numerical distance (in bp) between two different segments. 2. SYSTEM STRATEGIES The software evaluates each input DNA segment listed in the master BED file, and outputs the signed distance to the nearest entry in the comparison BED file. Assume the comparison BED file is in sorted order. Various options modify the behavior of the program. It should be possible to eliminate comparisons between identical segments. For example, if the line: "chr1 100 200 id-1" appears in both the master file and the comparison file, the line should be ignored in the comparison file, and the distance to the next nearest segment should be reported. This feature allows a user to quickly assess the nearest data point in the same file. This program should be highly optimized, and as such, should be constructed in C or C++. Optionally, this program can be written in Python and link to routines written in native code for resource intensive calculations. Several behavior specific notes: o If the comparison file does not contain any values for the chromosome listed in the master file, output 'no-data-chrN', where 'N' is replaced with the chrom from the master file. o If the nearest segment is in the 3' direction of the target segment, the resulting value is negative. This program is designed to be an improvement to the featDistance utility, written by Bill Noble and Scott Kuehn. There is a reference copy of this program in the doc directory. 3. DATA AND PROGRAM SPECIFICATIONS The program requires the following as input: o A master BED file (optionally stdin) that is sorted. o A file of genomic DNA segments, one per line, sorted and formatted as UCSC BED. This is the comparison BED file. (optionally stdin) o A flag to indicate that only the same strand should be evaluated (-p). o A verbosity parameter. Default to a quiet setting. Did not include this parameter. o A flag to indicate that a segment should never be compared to itself (-n). Calculation results are written to stdout. Verbose information is purely diagnostic, and should be written to stderr, and consist of (at the most verbose level) results from any underlying programs, periodic estimation of progress, warnings, and errors. --Did not include a verbosity flag. Errors are written to stderr along with a usage statement. Program is fast enough that progress estimation is not useful. Shane Neph, Scott Kuehn Thu Mar 23 11:33:21 EST 2006