在95％的情况下值为0或1时，是否可以对很大的数组进行随机访问优化？

是否有可能对非常大的数组进行随机访问进行任何优化（我目前使用uint8_t，并且我在询问更好的方法）

uint8_t MyArray[10000000]; 当数组中任意位置的值是

95％的情况下为0或1，在4％的情况下，有 2 ％在其他1％的情况下介于3到255之间？那么，有什么比uint8_t数组更好的方法了吗？应该尽可能快地以随机顺序遍历整个阵列，这在RAM带宽上非常繁重，因此当有多个线程同时对不同的阵列执行此操作时，当前整个RAM带宽很快就饱和了。

我要问的是，当实际上知道几乎所有值（除5％之外）几乎都是0或1时，拥有如此大的数组（10 MB）感觉效率很低。因此，当数组中所有值的95％实际上只需要1位而不是8位，这将使内存使用量减少近一个数量级。感觉必须要有一个内存效率更高的解决方案，该解决方案将大大减少为此所需的RAM带宽，因此也大大加快了随机访问的速度。

问题来源于stack overflow

我想到的一个简单可能性是，在常见情况下，每个值保留一个压缩数组，每个值2位，而每个值分隔一个4字节（原始元素索引为24位，实际值为8位，因此(idx << 8) | value)）其他的。

查找值时，首先要在2bpp数组（O（1））中进行查找。如果找到0、1或2，则为您想要的值；如果找到3，则意味着您必须在辅助阵列中查找它。在这里，您将执行二进制搜索以查找感兴趣的索引左移8（O（log（n）的n较小，因为它应该是1％），然后从4-中提取值字节的东西。

std::vector<uint8_t> main_arr; std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) { // extract the 2 bits of our interest from the main array uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3; // usual (likely) case: value between 0 and 2 if(v != 3) return v; // bad case: lookup the index<<8 in the secondary array // lower_bound finds the first >=, so we don't need to mask out the value auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8); #ifdef _DEBUG // some coherency checks if(ptr == sec_arr.end()) std::abort(); if((*ptr >> 8) != idx) std::abort(); #endif // extract our 8-bit value from the 32 bit (index, value) thingie return (*ptr) & 0xff; }

void populate(uint8_t *source, size_t size) { main_arr.clear(); sec_arr.clear(); // size the main storage (round up) main_arr.resize((size+3)/4); for(size_t idx = 0; idx < size; ++idx) { uint8_t in = source[idx]; uint8_t &target = main_arr[idx>>2]; // if the input doesn't fit, cap to 3 and put in secondary storage if(in >= 3) { // top 24 bits: index; low 8 bit: value sec_arr.push_back((idx << 8) | in); in = 3; } // store in the target according to the position target |= in << ((idx & 3)2); } } 对于您建议的数组，第一个数组应为10000000/4 = 2500000字节，第二个数组应为10000000 * 1％ 4 B = 400000字节；因此，有2900000个字节，即少于原始数组的三分之一，并且最常使用的部分全部保存在内存中，这对于缓存来说应该是不错的选择（它甚至可以容纳L3）。

如果您需要超过24位寻址，则必须调整“辅助存储”。扩展它的一种简单方法是使用256个元素的指针数组来切换索引的前8位，然后转发到如上所述的24位索引排序数组。

快速基准 #include #include #include <stdint.h> #include #include <stdio.h> #include <math.h>

using namespace std::chrono;

/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality /// than LCG but fail some test suites struct XorShift32 { /// This stuff allows to use this class wherever a library function /// requires a UniformRandomBitGenerator (e.g. std::shuffle) typedef uint32_t result_type; static uint32_t min() { return 1; } static uint32_t max() { return uint32_t(-1); }

/// PRNG state
uint32_t y;

/// Initializes with seed
XorShift32(uint32_t seed = 0) : y(seed) {
    if(y == 0) y = 2463534242UL;
}

/// Returns a value in the range [1, 1<<32)
uint32_t operator()() {
    y ^= (y<<13);
    y ^= (y>>17);
    y ^= (y<<15);
    return y;
}

/// Returns a value in the range [0, limit); this conforms to the RandomFunc
/// requirements for std::random_shuffle
uint32_t operator()(uint32_t limit) {
    return (*this)()%limit;
}

};

struct mean_variance { double rmean = 0.; double rvariance = 0.; int count = 0;

void operator()(double x) {
    ++count;
    double ormean = rmean;
    rmean     += (x-rmean)/count;
    rvariance += (x-ormean)*(x-rmean);
}

double mean()     const { return rmean; }
double variance() const { return rvariance/(count-1); }
double stddev()   const { return std::sqrt(variance()); }

};

std::vector<uint8_t> main_arr; std::vector<uint32_t> sec_arr;

volatile unsigned out;

int main() { XorShift32 xs; std::vector<uint8_t> vec; int size = 10000000; for(int i = 0; i<size; ++i) { uint32_t v = xs(); if(v < 1825361101) v = 0; // 42.5% else if(v < 4080218931) v = 1; // 95.0% else if(v < 4252017623) v = 2; // 99.0% else { while((v & 0xff) < 3) v = xs(); } vec.push_back(v); } populate(vec.data(), vec.size()); mean_variance lk_t, arr_t; for(int i = 0; i<50; ++i) { { unsigned o = 0; auto beg = high_resolution_clock::now(); for(int i = 0; i < size; ++i) { o += lookup(xs() % size); } out += o; int dur = (high_resolution_clock::now()-beg)/microseconds(1); fprintf(stderr, "lookup: %10d µs\n", dur); lk_t(dur); } { unsigned o = 0; auto beg = high_resolution_clock::now(); for(int i = 0; i < size; ++i) { o += vec[xs() % size]; } out += o; int dur = (high_resolution_clock::now()-beg)/microseconds(1); fprintf(stderr, "array: %10d µs\n", dur); arr_t(dur); } }

fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");
printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
        lk_t.mean(), lk_t.stddev(),
        arr_t.mean(), arr_t.stddev(),
        arr_t.mean()/lk_t.mean());
return 0;

} （代码和数据始终在我的Bitbucket中更新）

上面的代码使用在其帖子中指定的OP分配的随机数据填充10M元素数组，初始化我的数据结构，然后：

使用我的数据结构执行10M元素的随机查找通过原始数组执行相同操作。（请注意，在顺序查找的情况下，数组总是在很大程度上赢得胜利，因为它是您可以进行的最便于缓存的查找）

最后两个块重复50次并定时；最后，计算并打印每种类型的查询的平均值和标准差，并显示提速率（lookup_mean / array_mean）。

我-O3 -static在Ubuntu 16.04 上使用g ++ 5.4.0（以及一些警告）编译了上面的代码，并在某些计算机上运行了该代码；其中大多数运行Ubuntu 16.04，一些较旧的Linux，一些较新的Linux。在这种情况下，我认为操作系统根本不重要。

        CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)

Xeon E5-1650 v3 @ 3.50GHz | 15360 KB | 60011 ± 3667 | 29313 ± 2137 | 0.49 Xeon E5-2697 v3 @ 2.60GHz | 35840 KB | 66571 ± 7477 | 33197 ± 3619 | 0.50 Celeron G1610T @ 2.30GHz | 2048 KB | 172090 ± 629 | 162328 ± 326 | 0.94 Core i3-3220T @ 2.80GHz | 3072 KB | 111025 ± 5507 | 114415 ± 2528 | 1.03 Core i5-7200U @ 2.50GHz | 3072 KB | 92447 ± 1494 | 95249 ± 1134 | 1.03 Xeon X3430 @ 2.40GHz | 8192 KB | 111303 ± 936 | 127647 ± 1503 | 1.15 Core i7 920 @ 2.67GHz | 8192 KB | 123161 ± 35113 | 156068 ± 45355 | 1.27 Xeon X5650 @ 2.67GHz | 12288 KB | 106015 ± 5364 | 140335 ± 6739 | 1.32 Core i7 870 @ 2.93GHz | 8192 KB | 77986 ± 429 | 106040 ± 1043 | 1.36 Core i7-6700 @ 3.40GHz | 8192 KB | 47854 ± 573 | 66893 ± 1367 | 1.40 Core i3-4150 @ 3.50GHz | 3072 KB | 76162 ± 983 | 113265 ± 239 | 1.49 Xeon X5650 @ 2.67GHz | 12288 KB | 101384 ± 796 | 152720 ± 2440 | 1.51 Core i7-3770T @ 2.50GHz | 8192 KB | 69551 ± 1961 | 128929 ± 2631 | 1.85 结果好坏参半！

通常，在大多数这些机器上都有某种程度的加速，或者至少它们是同等的。在阵列真正胜过“智能结构”查找的两种情况下，它们是在具有大量缓存但并不特别忙的机器上：上面的Xeon E5-1650（15 MB缓存）是一台夜间构建机器，目前相当闲置；Xeon E5-2697（35 MB高速缓存）是一台用于高性能计算的计算机，它也处于空闲状态。确实有道理，原始数组完全适合其巨大的缓存，因此紧凑的数据结构只会增加复杂性。在“性能范围”的另一面-但是阵列又稍微快一点，有不起眼的Celeron为我的NAS提供动力。它的缓存非常少，以至于数组和“智能结构”都无法放入其中。缓存足够小的其他计算机也具有类似的性能。 Xeon X5650必须格外小心-它们是非常繁忙的双路虚拟机服务器上的虚拟机；可能是，尽管名义上它具有相当数量的缓存，但是在测试期间，它多次被完全不相关的虚拟机抢占了先机。

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

在95％的情况下值为0或1时，是否可以对很大的数组进行随机访问优化？