During my thesis on "Training Neural Networks on IoT Devices," I faced challenges building neural network models in plain C and deploying them for on-device training. This involved creating operators and making sure they could calculate both forward and backward passes, handling inputs, and gradients from the loss function. It was tricky, and everything needed to fit just right. Since then, I've wanted to make a tool to automate this process. Recently, I found Andrej Karpathy's micrograd project on GitHub, which inspired me to start. After two weeks of work, I've managed to make it.
Operators implemented so far in C:
- Linear
- Conv2d
- MaxPool2d
- LogSoftmax
- NllLoss
- MSE
- Sigmoid
Example usage:
A simple neural network with two linear layers and nll loss.
input = Param(None, var_name='input', shape=(1,28*28))
output = Param(None, var_name='output', shape=(1,10))
w1 = Param(None, shape=(32,28*28), var_name='w1', print_init=True)
w2 = Param(None, shape=(10,32), var_name='w2', print_init=True)
w_t = w1.t()
z = matmul(input,w_t)
a = sigmoid(z)
z2 = matmul(a, w2.t())
a2 = log_softmax(z2)
loss = nll_loss(a2, output)
Which generates the following C code ready for deployment:
float* buf = (float*)calloc(51828, sizeof(float));
mat_mul(&input[0] /* (1, 784) */, &buf[0] /* (784, 32) */, &buf[26202] /* (1, 32) */, 1, 784, 784, 1, 784, 32, 1, 784); // (1, 32) 5
sigmoid(&buf[26202] /* (1, 32)*/ , &buf[26234] /*(1, 32)*/, 32); // (1, 32) 6
mat_mul(&buf[26234] /* (1, 32) */, &buf[25088] /* (32, 10) */, &buf[26266] /* (1, 10) */, 1, 32, 32, 1, 32, 10, 1, 32); // (1, 10) 8
log_softmax(&buf[26266], &buf[26276], 10); // (1, 10) 9
exp(&buf[26276], &buf[26286], 10); // (1, 10) 18
buf[26296] = nll_loss(&buf[26276], &y[0], 10); // (1, 10) 10
for(uint32_t k=0;k<10;++k){
buf[26306+k] = 1;
buf[26316+k] = -1; // (1, 10) 12
}
mul(&buf[26306], &buf[26316], &buf[26326], 10); // (1, 10) 16
mul(&buf[26326], &y[0], &buf[26336], 10); // (1, 10) 17
add(&buf[26286], &buf[26336], &buf[26346], 10); // (1, 10) 19
mat_mul(&buf[26346] /* (1, 10) */, &buf[25088] /* (10, 32) */, &buf[26356] /* (1, 32) */, 1, 10, 10, 1, 10, 32, 32, 1); // (1, 32) 23
sigmoid_diff(&buf[26202], &buf[26356], &buf[26388], 32); // (1, 32) 24
mat_mul(&buf[26388] /* (32, 1) */, &input[0] /* (1, 784) */, &buf[26420] /* (32, 784) */, 32, 1, 1, 32, 1, 784, 784, 1); // (32, 784) 27
mat_mul(&buf[26346] /* (10, 1) */, &buf[26234] /* (1, 32) */, &buf[51508] /* (10, 32) */, 10, 1, 1, 10, 1, 32, 32, 1); // (10, 32) 22
for (uint32_t k=0;k<25088;++k){
buf[0 + k] -= buf[26420 + k] * lr;
}
for (uint32_t k=0;k<320;++k){
buf[25088 + k] -= buf[51508 + k] * lr;
}
I'm sharing this project here in the forum to see if anyone can contribute or help optimize the C part operators. Your input and contributions would be greatly appreciated! Feel free to try it out and join the development effort. Here's the link to the GitHub repository.