Accueil > Research > Code-based cryptography > Other implementations > Keccak implementation on GPU

Keccak implementation on GPU

dimanche 13 février 2011, par Cayrel Pierre-Louis, Gerhard Hoffman

This page is dedicated to the description of our GPU implementation of a cryptographic hash function family called Keccak, which is submited as a SHA-3 candidate.

Implementation of Keccak on the GPU Keccak (pronounced [kεt∫ak], like "ketchak") is a family of hash functions that has been submitted as candidate to NIST's hash algorithm competition (SHA-3). Keccak is based on the sponge construction, and therefore it is a sponge function family Keccak Specs-Summary. Implementations are available in different languages and different platforms. We add to this pool an implementation on the GPU (GTX 295) using CUDA. As noted on Keccak Specs-Summary, the underlying Keccak-function is a permutation chosen in a set of seven Keccak-f permutations, denoted Keccak-f[b], where b ∈ {25, 50, 100, 200, 400, 800, 1600} is the width of the permutation. The width of the permutation is also the width of the state in the sponge construction. The state is organized as an array of 5×5 lanes, each of length w ∈ {1, 2, 4, 8, 16, 32, 64} (b=25w). When implemented on a 64-bit processor, a lane of Keccak-f[1600] can be represented as a 64-bit CPU word. We obtain the Keccak[r,c,d] sponge function, with parameters capacity c, bitrate r and diversifier d, if we apply the sponge construction to Keccak-f[r+c] and by applying a specific padding to the message input.

(insert some picture here). What Keccak actually does is this: Keccak-f[1600] works on an array of 1600 bits (therefore its name for this setting). The array is called the state. r bits are added (XOR'ed) to this state and the Keccak permutation applied to it. If more bits are available, the procedure is repeated. In case less than r bits are available, some padding has to be done before the addition step.

After all data has been fed to Keccak, it moves to the sponge phase. The user reads the first r bits from the state. In case more bits are required, the user calls the Keccak permutation again on the state and reads again the first r bits. This way the length of the digest can be arbitrary long.

Our implementation uses only Keccak-f1600, which is also the available reference implementation. But depending on the hardware resources one can choose smaller values like 800, which would divide the memory needs of Keccak by half. The smallest value considered to be secure is 25. But it should be noted that for other values than 1600 it is up to the implementor to create the correct internal tables (round constant table, rho offset table).

At the heart of a Keccak permutation are five functions, named as theta, rho, pi, chi and iota. One of the difficulties of the GPU in general is to avoid branching as far as possible. Our implementation merges these five functions above into five lines of code. Using proper special tables, it runs branch-free. There is one problem left, though. The implementation generates so-called bank conflicts. One reason is that Keccak-f1600 is 64-bit, another one is that the state access pattern of Keccak. We tried a 32-bit implementation of Keccak-f1600 with an interleaving technique, but the profiling results have shown no advantage compared the 64-bit reference implementation. Using the new Fermi architecture would be very helpful here for reducing the bank conflicts. Another possibility might be to exploit texture memory.

We first give the code of the five functions mentioned above (taken from KeccakPermutationReference.c contained in the reference implementation). Then we show how to merge them into the code used by our implementation.

#define ROUNDS 24
#define index(x, y) (((x)%5)+5*((y)%5))
#define ROL64(a, offset) ((offset != 0) ? ((((UINT64)a) << offset) ^ (((UINT64)a) >> (64-offset))) : a)

UINT64 KeccakRoundConstants[ROUNDS] = {
    0x0000000000000001ULL, 0x0000000000008082ULL, 0x800000000000808AULL,
    0x8000000080008000ULL, 0x000000000000808BULL, 0x0000000080000001ULL,
    0x8000000080008081ULL, 0x8000000000008009ULL, 0x000000000000008AULL,
    0x0000000000000088ULL, 0x0000000080008009ULL, 0x000000008000000AULL,
    0x000000008000808BULL, 0x800000000000008BULL, 0x8000000000008089ULL,
    0x8000000000008003ULL, 0x8000000000008002ULL, 0x8000000000000080ULL,
    0x000000000000800AULL, 0x800000008000000AULL, 0x8000000080008081ULL,
    0x8000000000008080ULL, 0x0000000080000001ULL, 0x8000000080008008ULL};

int KeccakRhoOffsets[25] =  {
    0 , 1 , 62, 28, 27, 36, 44, 6, 55, 20, 3, 10,
    43, 25, 39, 41, 45, 15, 21, 8, 18, 2, 61, 56, 14};

void theta(UINT64 *A) {
    unsigned int x, y;
    UINT64 C[5], D[5];

    for(x=0; x<5; x++) {
        C[x] = 0; 
        for(y=0; y<5; y++) 
            C[x] ^= A[index(x, y)];
        D[x] = ROL64(C[x], 1);
    }
    for(x=0; x<5; x++)
        for(y=0; y<5; y++)
            A[index(x, y)] ^= D[(x+1)%5] ^ C[(x+4)%5];
}

void rho(UINT64 *A) {
    unsigned int x, y;

    for(x=0; x<5; x++) for(y=0; y<5; y++)
        A[index(x, y)] = ROL64(A[index(x, y)], KeccakRhoOffsets[index(x, y)]);
}

void pi(UINT64 *A) {
    unsigned int x, y;
    UINT64 tempA[25];

    for(x=0; x<5; x++) for(y=0; y<5; y++)
        tempA[index(x, y)] = A[index(x, y)];
    for(x=0; x<5; x++) for(y=0; y<5; y++)
        A[index(0*x+1*y, 2*x+3*y)] = tempA[index(x, y)];
}

void chi(UINT64 *A) {
    unsigned int x, y;
    UINT64 C[5];

    for(y=0; y<5; y++) { 
        for(x=0; x<5; x++)
            C[x] = A[index(x, y)] ^ ((~A[index(x+1, y)]) & A[index(x+2, y)]);
        for(x=0; x<5; x++)
            A[index(x, y)] = C[x];
    }
}

void iota(UINT64 *A, unsigned int indexRound) {
    A[index(0, 0)] ^= KeccakRoundConstants[indexRound];
}

The Keccak permutation of Keccak-f[1600] consists of 24 rounds, each round calling those five function in sequence:

The code above has to be compiled into a more GPU-friendly form to be executable by multiple threads in parallel. Here we give the main parts of our implementation of the Keccak permutation.

CUDA threads in the same block communicate via so-called shared memory. We declare four arrays in shared memory: A denotes the state of Keccak, while B, C and D are used as temporary buffer.

We redefine ROL64 as: It works on the GPU also for b = 64 or c = 64, so we can get rid of the ternary operator in ROL64.

As the Keccak state A consists of 5×5 lanes of 64 bits size, the Keccak permutation can be executed by 25 threads in parallel. A so-called warp in CUDA language denotes a group of 32 threads. A warp is the entity which is actually schedulded by the thread manager on the GPU. Hence the Keccak permutation is executed by a warp. The remaining seven threads cannot be used on the GTX 295, for this would mean crossing a warp boundary, creating serious thread synchronization problems.

As already mentioned, we have to aim for an implementation which is as branch-free as possible. In order to achieve that, we extended the standard tables and introduced some new ones. They are saved in constant memory of the GPU, for constant memory on the GPU is cached. Keep in mind for the following tables that constant values cannot be initialized as shown below. The values have to be copied from the host CPU to the GPU at kernel call time. It written here in this way for brevity.

We extend the round constants table by zeros. iota can the be executed by a thread without checking the index of A it is dealing with. Therefore, we avoid introducing a branch by using the following table.

__device__ __constant__ uint64_t rc[5][ROUNDS] = {
    {0x0000000000000001ULL, 0x0000000000008082ULL, 0x800000000000808AULL,
     0x8000000080008000ULL, 0x000000000000808BULL, 0x0000000080000001ULL,
     0x8000000080008081ULL, 0x8000000000008009ULL, 0x000000000000008AULL,
     0x0000000000000088ULL, 0x0000000080008009ULL, 0x000000008000000AULL,
     0x000000008000808BULL, 0x800000000000008BULL, 0x8000000000008089ULL,
     0x8000000000008003ULL, 0x8000000000008002ULL, 0x8000000000000080ULL,
     0x000000000000800AULL, 0x800000008000000AULL, 0x8000000080008081ULL,
     0x8000000000008080ULL, 0x0000000080000001ULL, 0x8000000080008008ULL},
    {0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
    {0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
    {0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
    {0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL}};

The next table id the rho-offsets. Note that for each entry pair the respective sum is 64. Only the first entry of each pair is a rho-offset. The second part is used in the R64 macro. This way it can be written without the ternary operator.

__device__ __constant__ uint32_t ro[25][2] = {
       /*y=0*/         /*y=1*/         /*y=2*/         /*y=3*/         /*y=4*/
/*x=0*/{ 0,64}, /*x=1*/{44,20}, /*x=2*/{43,21}, /*x=3*/{21,43}, /*x=4*/{14,50},
/*x=1*/{ 1,63}, /*x=2*/{ 6,58}, /*x=3*/{25,39}, /*x=4*/{ 8,56}, /*x=0*/{18,46},
/*x=2*/{62, 2}, /*x=3*/{55, 9}, /*x=4*/{39,25}, /*x=0*/{41,23}, /*x=1*/{ 2,62},
/*x=3*/{28,36}, /*x=4*/{20,44}, /*x=0*/{ 3,61}, /*x=1*/{45,19}, /*x=2*/{61, 3},
/*x=4*/{27,37}, /*x=0*/{36,28}, /*x=1*/{10,54}, /*x=2*/{15,49}, /*x=3*/{56, 8}};


__device__ __constant__ uint32_t a[25] = {
    0,  6, 12, 18, 24,
    1,  7, 13, 19, 20,
    2,  8, 14, 15, 21,
    3,  9, 10, 16, 22,
    4,  5, 11, 17, 23};


__device__ __constant__ uint32_t b[25] = {
    0,  1,  2,  3, 4,
    1,  2,  3,  4, 0,
    2,  3,  4,  0, 1,
    3,  4,  0,  1, 2,
    4,  0,  1,  2, 3};

    
__device__ __constant__ uint32_t c[25][3] = {
    { 0, 1, 2}, { 1, 2, 3}, { 2, 3, 4}, { 3, 4, 0}, { 4, 0, 1},
    { 5, 6, 7}, { 6, 7, 8}, { 7, 8, 9}, { 8, 9, 5}, { 9, 5, 6},
    {10,11,12}, {11,12,13}, {12,13,14}, {13,14,10}, {14,10,11},
    {15,16,17}, {16,17,18}, {17,18,19}, {18,19,15}, {19,15,16},
    {20,21,22}, {21,22,23}, {22,23,24}, {23,24,20}, {24,20,21}};

    
__device__ __constant__ uint32_t d[25] = {
          0,  1,  2,  3,  4,
         10, 11, 12, 13, 14,
         20, 21, 22, 23, 24,
          5,  6,  7,  8,  9,
         15, 16, 17, 18, 19};


for(int i=0;i<24;++i) {
   C[t] = A[s]^A[s+5]^A[s+10]^A[s+15]^A[s+20];
   D[t] = C[b[20+s]] ^ R64(C[b[5+s]],1,63);
   C[t] = R64(A[a[t]]^D[b[t]], ro[t][0], ro[t][1]);
   A[d[t]] = C[c[t][0]] ^ ((~C[c[t][1]]) & C[c[t][2]]);
   A[t] ^= rc[(t==0) ? 0 : 1][i];
}

Keccak implementation on GPU

Documents joints