indexamajig - bulk indexing and data reduction program ------------------------------------------------------ The "indexamajig" program takes as input a list of diffraction image files, currently in HDF5 format. For each image, it attempts to find peaks and then index the pattern. If successful, it will measure the intensities of the peaks at Bragg locations and produce a list in the form "h k l I", with some extra information about the locations of the peaks. For minimal basic use, you need to provide the list of diffraction patterns, the method which will be used to index, a file describing the geometry of the detector, a PDB file which contains the unit cell which will be used for the indexing, and that you'd like the program to output a list of intensities for each successfully indexed pattern. Here is what the minimal use looks like on the command line, with each argument shown on a separate line. In practice, you'd put this all on one line: indexamajig -i mypatternlist.lst --indexing=dirax --geometry mygeometry.geom -p mystructure.pdb --near-bragg -o myoutputfile.txt More typical use includes all the above, but might also include a noise or common mode filter (--filter-noise or --filter-cm respectively) if detector noise causes problems for the peak detection. The HDF5 files might be in some folder a long way from the current directory, so you might want to specify a full pathname to be added in front of each filename. You'll probably want to run more than one indexing job at a time (-j ), and you might want to correct the intensities of saturated peaks according to a list stored elsewhere in the HDF5 file: indexamajig -i mypatternlist.lst --indexing=dirax --geometry mygeometry.geom -p mystructure.pdb --near-bragg --filter-noise --prefix=/some/horribly/long/pathname/ending/in/a/slash/ -j 16 --sat-corr -o myoutputfile.txt The table of saturation values for --sat-corr should be located in the HDF5 file as follows: /processing/hitfinder/peakinfo_saturated. It should be an n*3 two dimensional array, where the first two columns contain x and y coordinates and the third contains the value which should belong in a peak at location x,y. The value will be divided by 5 and spread in a small cross centred on that location. See doc/geometry for information about how to create a geometry description file. Peak Detection -------------- You can control the peak detection on the command line. Firstly, you can choose the peak detection method using "--peaks=". Currently, two possible values for "method" are available. "hdf5" will take the peak locations from the HDF5 file. It expects a two dimensional array at /processing/hitfinder/peakinfo where size in the first dimension is the number of peaks and the size in the second dimension is three. The first two columns contain the x and y coordinate (see the "Note about data orientation" in geometry.txt for details), the third contains the intensity. However, the intensity will be ignored since the pattern will always be re-integrated using the unit cell provided by the indexer on the basis of the peaks. The "zaef" method uses a simple gradient search after Zaefferer (2000). You can control the overall threshold and minimum gradient for finding a peak using the "--threshold" and "--min-gradient" options. Both of these have units of "ADU" (i.e. units of intensity according to the contents of the HDF5 file). A minimum peak separation can also be provided in the geometry description file (see geometry.txt for details). This number serves two purposes. Firstly, it is the maximum distance allowed between the peak summit and the foot point (where the gradient exceeds the minimum gradient). Secondly, it is the minimum distance allowed between one peak and another, before the later peak will be rejected "by proximity". You can suppress peak detection altogether for a panel in the geometry file by specifying the "no_index" value for the panel as non-zero. Indexing Methods ---------------- You can choose between a variety of indexing methods. You can choose more than one method, in which case each method will be tried in turn until the later cell reduction step says that the cell is a "hit". Choose from: dirax : invoke DirAx mosflm : invoke MOSFLM (DPS) template : index by template matching Depending on what you have installed. For "dirax" and "mosflm", you need to have the dirax or ipmosflm binaries in your PATH. "template" is not ready for use at the moment, so don't choose that option. Example: --indexing=dirax,mosflm Cell Reduction -------------- You can choose from various options for cell reduction with the "--cell-reduction=" option. The choices are "none", "reduce" and "compare". This choice is important because all autoindexing methods produce an "ab initio" estimate of the unit cell (nine parameters), rather than just finding the orientation of the target cell (three parameters). It's clear that this is not optimal, and will hopefully be fixed in future versions. With "none", the raw cell from the autoindexer will be used. The cell probably won't match the target cell, but it'll still get used. Use this option to test whether the patterns are basically "indexable" or not, or if you don't know the cell parameters. In the latter case, you'll need to plot some kind of histogram of the resulting parameters from the output stream to see which are the most popular. If you're lucky, this will reveal the true unit cell. With "reduce", linear combinations of the raw cell will be checked against the target cell. If at least one candidate is found for each axis of the target cell, the angles will be checked to correspondence. If a match is found, this cell will be used for further processing. This option should generate the most matches, but might produce spurious results in many cases. The "--check-sanity" option can help with this. The "compare" method is like "reduce", but linear combinations are not taken. That means that the cell must either match or match after a simple permutation of the axes. This is useful when the target cell is subject to reticular twinning, such as if one cell axis length is close to twice another. With "reduce", there is a possibility that the axes might be confused in this situation. This happens for lysozyme (1VDS), so watch out. The tolerance for matching with "reduce" and "compare" is hardcoded as 5% in the reciprocal axis lengths and 1.5 degrees in the (reciprocal) angles. Cells from these reduction routines are further constrained to be right-handed. The unmatched raw cell might be left-handed: CrystFEL doesn't check this for you. Always using a right-handed cell means that the Bijvoet pairs can be told apart. A Note about Unit Cell Settings ------------------------------- CrystFEL's core symmetry module only knows about one setting for each unit cell. You must use the same setting. That means that the unique axis (for cells which have one) must be "c". Unconventional Use ------------------ There are some less often used options, for example "--dump-peaks" to dump the peak locations found by the peak search (in turn presented to the indexer). This might be useful if you want to check the performance of the peak finder. If you run a large dataset with bot --dump-peaks and --near-bragg enabled, you'll generate a large amount of data. To separate the peaks from the indexed peaks, use scripts/stream-split as follows: scripts/stream-split myoutputfile.txt indexed.txt peaks.txt .. to generate both indexed.txt and peaks.txt. One of the last two arguments can be "/dev/null" if you're only interested in the other. "Gotchas" --------- Don't run more than one indexamajig jobs simultaneously in the same working directory - they'll overwrite each other's DirAx files, causing subtle problems which can't easily be detected.