NeuroFlux: memory-efficient CNN training using adaptive local learning

Saikumar, Dhananjay; Varghese, Blesson

Show simple item record

Files in this item

Name:: NeuroFlux_EuroSys2024_preprint.pdf
Size:: 813.5Kb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	Saikumar, Dhananjay
dc.contributor.author	Varghese, Blesson
dc.date.accessioned	2024-05-03T14:30:09Z
dc.date.available	2024-05-03T14:30:09Z
dc.date.issued	2024-04
dc.identifier	301891691
dc.identifier	5d7fa2fd-57f4-43ed-9072-549f51acb6be
dc.identifier	85191987987
dc.identifier.citation	Saikumar , D & Varghese , B 2024 , NeuroFlux: memory-efficient CNN training using adaptive local learning . in EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems . ACM , pp. 999-1015 , The European Conference on Computer Systems , Athens , Greece , 22/04/24 . https://doi.org/10.1145/3627703.3650067	en
dc.identifier.citation	conference	en
dc.identifier.isbn	9798400704376
dc.identifier.uri	https://hdl.handle.net/10023/29805
dc.description.abstract	Efficient on-device Convolutional Neural Network (CNN) training in resource-constrained mobile and edge environments is an open challenge. Backpropagation is the standard approach adopted, but it is GPU memory intensive due to its strong inter-layer dependencies that demand intermediate activations across the entire CNN model to be retained in GPU memory. This necessitates smaller batch sizes to make training possible within the available GPU memory budget, but in turn, results in substantially high and impractical training time. We introduce NeuroFlux, a novel CNN training system tailored for memory-constrained scenarios. We develop two novel opportunities: firstly, adaptive auxiliary networks that employ a variable number of filters to reduce GPU memory usage, and secondly, block-specific adaptive batch sizes, which not only cater to the GPU memory constraints but also accelerate the training process. NeuroFlux segments a CNN into blocks based on GPU memory usage and further attaches an auxiliary network to each layer in these blocks. This disrupts the typical layer dependencies under a new training paradigm - 'adaptive local learning'. Moreover, NeuroFlux adeptly caches intermediate activations, eliminating redundant forward passes over previously trained blocks, further accelerating the training process. The results are twofold when compared to Backpropagation: on various hardware platforms, NeuroFlux demonstrates training speed-ups of 2.3× to 6.1× under stringent GPU memory budgets, and NeuroFlux generates streamlined models that have 10.9× to 29.4× fewer parameters
dc.format.extent	17
dc.format.extent	833114
dc.language.iso	eng
dc.publisher	ACM
dc.relation.ispartof	EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems	en
dc.subject	CNN training	en
dc.subject	Memory efficient training	en
dc.subject	Local learning	en
dc.subject	Edge computing	en
dc.subject	QA75 Electronic computers. Computer science	en
dc.subject	3rd-NDAS	en
dc.subject.lcc	QA75	en
dc.title	NeuroFlux: memory-efficient CNN training using adaptive local learning	en
dc.type	Conference item	en
dc.contributor.institution	University of St Andrews. School of Computer Science	en
dc.identifier.doi	10.1145/3627703.3650067

This item appears in the following Collection(s)

University of St Andrews Research

Show simple item record