To study the The authors of the original article realize that using transformers in D on a full voxel grid would lead to a multiple increase in the number of values for attention. Their total number will be calculate according to the following formula where _zis the number of voxels along the and z axes respectively. In the case of we find Nthat at Nwill give about one petabyte of data The solution to the problem considere is that we will calculate the attention value for a given voxel using only nonempty voxels locate within a fixe window as demonstrate in the figure.

attention for the neighbors of filled voxels (dilated voxels, shaded in yellow). This is necessary to preserve the completeness of the geometric structure of the reconstructed surface which is usually lost after downsample blocks with D convolutions. Now we are ready to look at the complete diagram of one D SDF Former and complete the analysis of the model. What's going on here? Let's get a look. First we take a volumetric feature map as input V, run it through Sparse D CNN thereby obtaining a reduced feature map Down. Then we concatenate it with the calculated.

It all together to D Sparse Window Attention. We repeat the procedure described above times in a row. It would seem that let's now expand the selected features to TSDF, but not everything is so simple. A problem arises due to the fact that all our selected features are quite local and limited to a given window. There is a so-called loss of global relationships within the scene or small receptive field.

