doc/indexamajig.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219

indexamajig - bulk indexing and data reduction program
------------------------------------------------------

The "indexamajig" program takes as input a list of diffraction image files,
currently in HDF5 format.  For each image, it attempts to find peaks and then
index the pattern.  If successful, it will measure the intensities of the peaks
at Bragg locations and produce a list in the form "h k l I", with some extra
information about the locations of the peaks.

For minimal basic use, you need to provide the list of diffraction patterns,
the method which will be used to index, a file describing the geometry of the
detector, a PDB file which contains the unit cell which will be used for the
indexing, and that you'd like the program to output a list of intensities for
each successfully indexed pattern.  Here is what the minimal use might look like
on the command line:

indexamajig -i mypatterns.lst -j 10 \
            -g mygeometry.geom \
            --indexing=mosflm,dirax --peaks=hdf5 \
            --cell-reduction=reduce \
            -b myxfel..beam \
            -o test.stream -p mycell.pdb \
            --record=integrated

More typical use includes all the above, but might also include a noise or
common mode filter (--filter-noise or --filter-cm respectively) if detector
noise causes problems for the peak detection.  The HDF5 files might be in some
folder a long way from the current directory, so you might want to specify a
full pathname to be added in front of each filename.  You'll probably want to
run more than one indexing job at a time (-j <n>).

You can include a table of saturation values for in the HDF5 file, if you have
a method for estimating the intensities of saturated peaks.  It goes in
/processing/hitfinder/peakinfo_saturated, and should be an n*3 two dimensional
array, where the first two columns contain fast scan and slow scan coordinates
(in that order) and the third contains the value which should belong in a peak
at the given location.  The value will be divided by 5 and spread in a small
cross centred on that location.

See doc/geometry for information about how to create a geometry description
file.

You can control what information is included in the output stream using
' --record=<flags>'.  Possible flags are:

 pixels            Include a list of sums of pixel values within the
                    integration domain, correcting for individual pixel
                    solid angles.

 integrated        Include a list of reflection intensities, produced by
                    integrating around predicted peak locations.

 peaks             Include peak locations and intensities from the peak
                    search.

 peaksifindexed    As 'peaks', but only if the pattern could be indexed.

 peaksifnotindexed As 'peaks', but only if the pattern could NOT be indexed.

So, if you just want the integrated intensities of indexed peaks, use
"--record=integrated".  If you just want to check that the peak detection is
working, used "--record=peaks".  If you want the integrated peaks for the
indexable patterns, but also want to check the peak detection for the patterns
which could not be indexed, you might use
"--record=integrated,peaksifnotindexed" and then use "check-peak-detection" from
the "scripts" folder to visualise the results of the peak detection.


Peak Detection
--------------

You can control the peak detection on the command line.  Firstly, you can choose
the peak detection method using "--peaks=<method>".  Currently, two possible
values for "method" are available.  "hdf5" will take the peak locations from the
HDF5 file.  It expects a two dimensional array at /processing/hitfinder/peakinfo
where size in the first dimension is the number of peaks and the size in the
second dimension is three.  The first two columns contain the x and y
coordinate (see the "Note about data orientation" in geometry.txt for details),
the third contains the intensity.  However, the intensity will be ignored since
the pattern will always be re-integrated using the unit cell provided by the
indexer on the basis of the peaks.

The "zaef" method uses a simple gradient search after Zaefferer (2000).  You can
control the overall threshold and minimum gradient for finding a peak using the
"--threshold" and "--min-gradient" options.  Both of these have units of "ADU"
(i.e. units of intensity according to the contents of the HDF5 file).

A minimum peak separation can also be provided in the geometry description file
(see geometry.txt for details).  This number serves two purposes.  Firstly,
it is the maximum distance allowed between the peak summit and the foot point
(where the gradient exceeds the minimum gradient).  Secondly, it is the minimum
distance allowed between one peak and another, before the later peak will be
rejected "by proximity".

You can suppress peak detection altogether for a panel in the geometry file by
specifying the "no_index" value for the panel as non-zero.


Indexing Methods
----------------

You can choose between a variety of indexing methods.  You can choose more than
one method, in which case each method will be tried in turn until the later cell
reduction step says that the cell is a "hit".  Choose from:

 dirax    : invoke DirAx
 mosflm   : invoke MOSFLM (DPS)

Depending on what you have installed.  For "dirax" and "mosflm", you need to
have the dirax or ipmosflm binaries in your PATH.

Example:  --indexing=dirax,mosflm


Cell Reduction
--------------

You can choose from various options for cell reduction with the
"--cell-reduction=" option.  The choices are "none", "reduce" and "compare".
This choice is important because all autoindexing methods produce an "ab
initio" estimate of the unit cell (nine parameters), rather than just finding
the orientation of the target cell (three parameters).  It's clear that this is
not optimal, and will hopefully be fixed in future versions.

With "none", the raw cell from the autoindexer will be used.  The cell probably
won't match the target cell, but it'll still get used.  Use this option to test
whether the patterns are basically "indexable" or not, or if you don't know the
cell parameters.  In the latter case, you'll need to plot some kind of histogram
of the resulting parameters from the output stream to see which are the most
popular.  If you're lucky, this will reveal the true unit cell.

With "reduce", linear combinations of the raw cell will be checked against the
target cell.  If at least one candidate is found for each axis of the target
cell, the angles will be checked to correspondence.  If a match is found, this
cell will be used for further processing.  This option should generate the most
matches, but might produce spurious results in many cases.  The predicted peaks
are always checked to verify that at least 10% of the predicted peaks are close
to peaks located by the peak search.  If not, the next candidate unit cell is
tried until there are no more options.

The "compare" method is like "reduce", but linear combinations are not taken.
That means that the cell must either match or match after a simple permutation
of the axes.  This is useful when the target cell is subject to reticular
twinning, such as if one cell axis length is close to twice another.  With
"reduce", there is a possibility that the axes might be confused in this
situation.  This happens for lysozyme (1VDS), so watch out.

The tolerance for matching with "reduce" and "compare" is hardcoded as 5% in
the reciprocal axis lengths and 1.5 degrees in the (reciprocal) angles.  Cells
from these reduction routines are further constrained to be right-handed.  The
unmatched raw cell might be left-handed: CrystFEL doesn't check this for you.
Always using a right-handed cell means that the Bijvoet pairs can be told
apart.

If the unit cell is centered (i.e. if the space group begins with I, R, C, A or
F), you should be careful when using "compare" for the cell reduction, since
(for example) DirAx will always find a primitive unit cell, and this cell must
be converted to the non-primitive conventional cell from the PDB.


Tuning CPU affinities for NUMA hardware
---------------------------------------

If you are running indexamajig on a NUMA (non-uniform memory architecture)
machine, a performance gain can sometimes be made by preventing the kernel from
allowing a process or thread to run on a CPU which is distant from the one on
which it started.  Distance, in this context, might mean that the CPU is able to
access all the memory visible to the original CPU, but perhaps only relatively
slowly via a cable link.  In many cases a group of CPUs will have direct access
to a certain region of memory, and so the process may be scheduled on any CPU in
that group without any penalty.  However, scheduling the process to any CPU
outside the group may be slow.  When running under Linux, indexamajig is able to
avoid such sub-optimal process scheduling by setting CPU affinities for its
threads.  The CPU affinities are also inherited by subprocesses (e.g. MOSFLM or
DirAx).

To do this usefully, you need to give indexamajig some information about your
hardware's architecture.  Specify the size of the CPU groups using
"--cpugroup=<n>".  You also need to specify the overall number of CPUs, so that
the program knows when to 'wrap around'.  Using "--cpuoffset=<n>", where "n" is
a group number (not a CPU number), allows you to manually skip a few CPUs, which
may be useful if you do not want to use all the available CPUs but want to avoid
running all your jobs on the same ones.

Note that specifying the above options is NOT the same thing as giving the
number of analyses to run in parallel (the 'number of threads'), which is done
with "-j <n>".  The CPU tuning options provide information to indexamajig about
how to set the CPU affinities for its threads, but it does not specify how many
threads to use.

Example: 72-core Altix UV 100 machine at the author's institution

This machine consists of six blades, each containing two 6-core CPUs and some
local memory.  Any CPU on any blade can access the memory on any other blade,
but the access will be slow compared to accessing memory on the same blade.
When running two instances of indexamajig, a sensible choice of parameters might
be:

1: --cpus=72 --cpugroup=12 --cpuoffset=0 -j 36
2: --cpus=72 --cpugroup=12 --cpuoffset=36 -j 36

This would dedicate half of the CPUs to one instance, and the other half to the
other.


A Note about Unit Cell Settings
-------------------------------

CrystFEL's core symmetry module only knows about one setting for each unit cell.
You must use the same setting.  That means that the unique axis (for cells which
have one) must be "c".


"Gotchas"
---------

Don't run more than one indexamajig jobs simultaneously in the same working
directory - they'll overwrite each other's DirAx or MOSFLM files, causing subtle
problems which can't easily be detected.