1 Filter Support in netCDF-4 (Enhanced)
2 ============================
3 <!-- double header is needed to workaround doxygen bug -->
5 Filter Support in netCDF-4 (Enhanced) {#compress}
6 =================================
10 Introduction {#compress_intro}
13 The HDF5 library (1.8.11 and later)
14 supports a general filter mechanism to apply various
15 kinds of filters to datasets before reading or writing.
16 The netCDF enhanced (aka netCDF-4) library inherits this
17 capability since it depends on the HDF5 library.
19 Filters assume that a variable has chunking
20 defined and each chunk is filtered before
21 writing and "unfiltered" after reading and
22 before passing the data to the user.
24 The most common kind of filter is a compression-decompression
25 filter, and that is the focus of this document.
27 HDF5 supports dynamic loading of compression filters using the following
28 process for reading of compressed data.
30 1. Assume that we have a dataset with one or more variables that
31 were compressed using some algorithm. How the dataset was compressed
32 will be discussed subsequently.
34 2. Shared libraries or DLLs exist that implement the compress/decompress
35 algorithm. These libraries have a specific API so that the HDF5 library
36 can locate, load, and utilize the compressor.
37 These libraries are expected to installed in a specific
40 Enabling A Compression Filter {#Enable}
41 =============================
43 In order to compress a variable, the netcdf-c library
44 must be given three pieces of information:
45 (1) some unique identifier for the filter to be used,
46 (2) a vector of parameters for
47 controlling the action of the compression filter, and
48 (3) a shared library implementation of the filter.
50 The meaning of the parameters is, of course,
51 completely filter dependent and the filter
52 description [3] needs to be consulted. For
53 bzip2, for example, a single parameter is provided
54 representing the compression level.
55 It is legal to provide a zero-length set of parameters.
56 Defaults are not provided, so this assumes that
57 the filter can operate with zero parameters.
59 Filter ids are assigned by the HDF group. See [4]
60 for a current list of assigned filter ids.
61 Note that ids above 32767 can be used for testing without
64 The first two pieces of information can be provided in one of three ways:
65 using __ncgen__, via an API call, or via command line parameters to __nccopy__.
66 In any case, remember that filtering also requires setting chunking, so the
67 variable must also be marked with chunking information.
71 The necessary API methods are included in __netcdf.h__ by default.
72 One API method is for setting the filter to be used
73 when writing a variable. The relevant signature is
76 int nc_def_var_filter(int ncid, int varid, unsigned int id, size_t nparams, const unsigned int* parms);
78 This must be invoked after the variable has been created and before
79 __nc_enddef__ is invoked.
81 A second API methods makes it possible to query a variable to
82 obtain information about any associated filter using this signature.
84 int nc_inq_var_filter(int ncid, int varid, unsigned int* idp, size_t* nparams, unsigned int* params);
87 The filter id wil be returned in the __idp__ argument (if non-NULL),
88 the number of parameters in __nparamsp__ and the actual parameters in
89 __params__. As is usual with the netcdf API, one is expected to call
90 this function twice. The first time to get __nparams__ and the
91 second to get the parameters in client-allocated memory.
96 In a CDL file, compression of a variable can be specified
97 by annotating it with the following attribute:
99 * ''_Filter'' -- a string containing a comma separated list of
100 constants specifying (1) the filter id to apply, and (2)
101 a vector of constants representing the
102 parameters for controlling the operation of the specified filter.
103 See the section on the <a href="#Syntax">parameter encoding syntax</a>
104 for the details on the allowable kinds of constants.
106 This is a "special" attribute, which means that
107 it will normally be invisible when using
108 __ncdump__ unless the -s flag is specified.
110 Example CDL File (Data elided)
111 ------------------------------
115 dim0 = 4 ; dim1 = 4 ; dim2 = 4 ; dim3 = 4 ;
117 float var(dim0, dim1, dim2, dim3) ;
118 var:_Filter = "307,9" ;
119 var:_Storage = "chunked" ;
120 var:_ChunkSizes = 4, 4, 4, 4 ;
126 Using nccopy {#NCCOPY}
128 When copying a netcdf file using __nccopy__ it is possible
129 to specify filter information for any output variable by
130 using the "-F" option on the command line; for example:
132 nccopy -F "var,307,9" unfiltered.nc filtered.nc
134 Assume that __unfiltered.nc__ has a chunked but not bzip2 compressed
135 variable named "var". This command will create that variable in
136 the __filtered.nc__ output file but using filter with id 307
137 (i.e. bzip2) and with parameter(s) 9 indicating the compression level.
138 See the section on the <a href="#Syntax">parameter encoding syntax</a>
139 for the details on the allowable kinds of constants.
141 The "-F" option can be used repeatedly as long as the variable name
142 part is different. A different filter id and parameters can be
143 specified for each occurrence.
145 Note that if the input file has compressed variables, that fact
146 will be invisble to nccopy because it is handled within the
147 netcdf-c/hdf5 library code. This is true for any program that calls
148 the netcdf-c library.
150 Parameter Encoding {#ParamEncode}
153 The parameters passed to a filter are encoded internally as a vector
154 of 32-bit unsigned integers. It may be that the parameters
155 required by a filter can naturally be encoded as unsigned integers.
156 The bzip2 compression filter, for example, expects a single
157 integer value from zero thru nine. This encodes naturally as a
158 single unsigned integer.
160 Note that signed integers and single-precision (32-bit) float values
161 also can easily be represented as 32 bit unsigned integers by
162 proper casting to an unsigned integer so that the bit pattern
163 is preserved. Simple integer values of type short or char
164 (or the unsigned versions) can also be mapped to an unsigned
165 integer by truncating to 16 or 8 bits respectively and then
168 Machine byte order (aka endian-ness) is an issue for passing
169 some kinds of parameters. You might define the parameters when
170 compressing on a little endian machine, but later do the
171 decompression on a big endian machine. Byte order is not an
172 issue for 32-bit values because HDF5 takes care of converting
173 them between the local machine byte order and network byte
176 Parameters whose size is larger than 32-bits present a byte order problem.
177 This typically includes double precision floats and (signed or unsigned)
178 64-bit integers. For these cases, the machine byte order must be
179 handled by the compression code. This is because HDF5 will treat,
180 for example, an unsigned long long as two 32-bit unsigned integers
181 and will convert each to network order separately. This means that
182 on a machine whose byte order is different than the machine in which
183 the parameters were initially created, the two integers are out of order
184 and must be swapped to get the correct unsigned long long value.
185 Consider this example. Suppose we have this little endian unsigned long long.
189 In network byte order, it will be stored as two 32-bit integers.
193 On a big endian machine, this will be given to the filter in that form.
197 But note that the proper big endian unsigned long long form is this.
201 So, the two words need to be swapped.
203 But consider the case when both original and final machines are big endian.
209 where #1 is the original number, #2 is the network order and
210 #3 is the what is given to the filter. In this case we do not
213 The solution is to forcibly encode the original number using some
214 specified endianness so that the filter always assumes it is getting
215 its parameters in that order and will always do swapping as needed.
216 This is irritating, but one needs to be aware of it. Since most
217 machines are little-endian. We choose to use that as the endianness
218 for handling 64 bit entities.
220 Filter Specification Syntax {#Syntax}
223 Both of the utilities
224 <a href="#NCGEN">__ncgen__</a>
226 <a href="#NCCOPY">__nccopy__</a>
227 allow the specification of filter parameters.
228 These specifications consist of a sequence of comma
229 separated constants. The constants are converted
230 within the utility to a proper set of unsigned int
231 constants (see the <a href="#ParamEncode">parameter encoding section</a>).
233 To simplify things, various kinds of constants can be specified
234 rather than just simple unsigned integers. The utilities will encode
235 them properly using the rules specified in
236 the <a href="#ParamEncode">parameter encoding section</a>.
238 The currently supported constants are as follows.
240 <tr halign="center"><th>Example<th>Type<th>Format Tag<th>Notes
241 <tr><td>-17b<td>signed 8-bit byte<td>b|B<td>Truncated to 8 bits and zero extended to 32 bits
242 <tr><td>23ub<td>unsigned 8-bit byte<td>u|U b|B<td>Truncated to 8 bits and zero extended to 32 bits
243 <tr><td>-25S<td>signed 16-bit short<td>s|S<td>Truncated to 16 bits and zero extended to 32 bits
244 <tr><td>27US<td>unsigned 16-bit short<td>u|U s|S<td>Truncated to 16 bits and zero extended to 32 bits
245 <tr><td>-77<td>implicit signed 32-bit integer<td>Leading minus sign and no tag<td>
246 <tr><td>77<td>implicit unsigned 32-bit integer<td>No tag<td>
247 <tr><td>93U<td>explicit unsigned 32-bit integer<td>u|U<td>
248 <tr><td>789f<td>32-bit float<td>f|F<td>
249 <tr><td>12345678.12345678d<td>64-bit double<td>d|D<td>Network byte order
250 <tr><td>-9223372036854775807L<td>64-bit signed long long<td>l|L<td>Network byte order
251 <tr><td>18446744073709551615UL<td>64-bit unsigned long long<td>u|U l|L<td>Network byte order
255 1. In all cases, except for an untagged positive integer,
256 the format tag is required and determines how the constant
257 is converted to one or two unsigned int values.
258 The positive integer case is for backward compatibility.
259 2. For signed byte and short, the value is sign extended to 32 bits
260 and then treated as an unsigned int value.
261 3. For double, and signed|unsigned long long, they are converted
262 to network byte order and then treated as two unsigned int values.
263 This is consistent with the <a href="#ParamEncode">parameter encoding</a>.
265 Dynamic Loading Process {#Process}
268 The documentation[1,2] for the HDF5 dynamic loading was (at the time
269 this was written) out-of-date with respect to the actual HDF5 code
270 (see HDF5PL.c). So, the following discussion is largely derived
271 from looking at the actual code. This means that it is subject to change.
273 Plugin directory {#Plugindir}
276 The HDF5 loader expects plugins to be in a specified plugin directory.
277 The default directory is:
278 * "/usr/local/hdf5/lib/plugin” for linux/unix operating systems (including Cygwin)
279 * “%ALLUSERSPROFILE%\\hdf5\\lib\\plugin” for Windows systems, although the code
280 does not appear to explicitly use this path.
282 The default may be overridden using the environment variable
283 __HDF5_PLUGIN_PATH__.
285 Plugin Library Naming {#Pluginlib}
286 ---------------------
288 Given a plugin directory, HDF5 examines every file in that
289 directory that conforms to a specified name pattern
290 as determined by the platform on which the library is being executed.
292 <tr halign="center"><th>Platform<th>Basename<th>Extension
293 <tr halign="left"><td>Linux<td>lib*<td>.so*
294 <tr halign="left"><td>OSX<td>lib*<td>.so*
295 <tr halign="left"><td>Cygwin<td>cyg*<td>.dll*
296 <tr halign="left"><td>Windows<td>*<td>.dll
299 Plugin Verification {#Pluginverify}
301 For each dynamic library located using the previous patterns,
302 HDF5 attempts to load the library and attempts to obtain information
303 from it. Specifically, It looks for two functions with the following
306 1. __H5PL_type_t H5PLget_plugin_type(void)__ --
307 This function is expected to return the constant value
308 __H5PL_TYPE_FILTER__ to indicate that this is a filter library.
309 2. __const void* H5PLget_plugin_info(void)__ --
310 This function returns a pointer to a table of type __H5Z_class2_t__.
311 This table contains the necessary information needed to utilize the
312 filter both for reading and for writing. In particular, it specifies
313 the filter id implemented by the library and if must match that id
314 specified for the variable in __nc_def_var_filter__ in order to be used.
316 If plugin verification fails, then that plugin is ignored and
317 the search continues for another, matching plugin.
321 Debugging plugins can be very difficult. You will probably
322 need to use the old printf approach for debugging the filter itself.
324 One case worth mentioning is when you have a dataset that is
325 using an unknown filter. For this situation, you need to
326 identify what filter(s) are used in the dataset. This can
327 be accomplished using this command.
329 ncdump -s -h <dataset filename>
331 Since ncdump is not being asked to access the data (the -h flag), it
332 can obtain the filter information without failures. Then it can print
333 out the filter id and the parameters (the -s flag).
335 Test Case {#TestCase}
337 Within the netcdf-c source tree, the directory
338 __netcdf-c/nc_test4__ contains a test case (__test_filter.c__) for
339 testing dynamic filter writing and reading using
340 bzip2. Another test (__test_filter_misc.c__) validates
341 parameter passing. These tests are disabled if __--enable-shared__
342 is not set or if __--enable-netcdf-4__ is not set.
346 A slightly simplified version of the filter test case is also
347 available as an example within the netcdf-c source tree
348 directory __netcdf-c/examples/C. The test is called __filter_example.c__
349 and it is executed as part of the __run_examples4.sh__ shell script.
350 The test case demonstrates dynamic filter writing and reading.
352 The files __example/C/hdf5plugins/Makefile.am__
353 and __example/C/hdf5plugins/CMakeLists.txt__
354 demonstrate how to build the hdf5 plugin for bzip2.
361 The current matrix of OS X build systems known to work is as follows.
363 <tr><th>Build System<th>Supported OS
364 <tr><td>Automake<td>Linux, Cygwin
365 <tr><td>Cmake<td>Linux, Cygwin, Visual Studio
370 If you do not want to use Automake or Cmake, the following
371 has been known to work.
373 gcc -g -O0 -shared -o libbzip2.so <plugin source files> -L${HDF5LIBDIR} -lhdf5_hl -lhdf5 -L${ZLIBDIR} -lz
376 Appendix A. Byte Swap Code {#AppendixA}
378 Since in some cases, it is necessary for a filter to
379 byte swap from little-endian to big-endian, This appendix
380 provides sample code for doing this. It also provides
381 a code snippet for testing if the machine the
382 endianness of a machine.
384 Byte swap an 8-byte chunk of memory
388 byteswap8(unsigned char* mem)
390 register unsigned char c;
407 Test for Machine Endianness
410 static const unsigned char b[4] = {0x0,0x0,0x0,0x1}; /* value 1 in big-endian*/
411 int endianness = (1 == *(unsigned int*)b); /* 1=>big 0=>little endian
414 References {#References}
417 1. https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf
418 2. https://support.hdfgroup.org/HDF5/doc/TechNotes/TechNote-HDF5-CompressionTroubleshooting.pdf
419 3. https://support.hdfgroup.org/services/filters.html
420 4. https://support.hdfgroup.org/services/contributions.html#filters