Publications | Xiangyu Li

2024

MobiCom 2024
FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices

Xiangyu Li , Yuanchun Li, Yuanzhe Li, Ting Cao, and Yunxin Liu

In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, Washington D.C., DC, USA, 2024

Abs DOI Bib PDF Code

Due to the popularity of deep neural networks (DNNs) and considerations over network overhead, data privacy, and inference latency, there is a growing interest in deploying DNNs to edge devices in recent years. However, the limited memory becomes a major bottleneck for on-device DNN deployment, making it crucial to reduce the memory footprint of DNN. The mainstream model customization solutions require intensive deployment efforts and may lead to severe accuracy degradation, and existing deep learning (DL) frameworks don’t take memory as a priority. Besides, recent works to enhance the memory management scheme cannot be directly applied because of several challenges, including the unbalanced memory footprint across layers, the inevitable overhead of memory management, and the memory budget dynamicity. To tackle these challenges, we introduce FlexNN, an efficient and adaptive memory management framework for DNN inference on memory-constrained devices. FlexNN uses a slicing-loading-computing joint planning approach, to achieve optimal memory utilization and minimal memory management overhead. We implemented FlexNN atop NCNN, and conducted comprehensive evaluations with common model architectures on various devices. The results have shown that our approach is able to adapt to different memory constraints with optimal latency-memory trade-offs. For example, FlexNN can reduce the memory consumption by 93.81% with only a 3.64% increase in latency, as compared with the original NCNN on smartphones.
@inproceedings{li2024flexnn, author = {Li, Xiangyu and Li, Yuanchun and Li, Yuanzhe and Cao, Ting and Liu, Yunxin}, title = {FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices}, year = {2024}, isbn = {9798400704895}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3636534.3649391}, doi = {10.1145/3636534.3649391}, booktitle = {Proceedings of the 30th Annual International Conference on Mobile Computing and Networking}, pages = {709–723}, numpages = {15}, keywords = {edge device, deep learning, DNN inference, memory management}, location = {Washington D.C., DC, USA}, series = {ACM MobiCom '24} }
arXiv
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li , Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, and 15 more authors

arXiv preprint arXiv:2401.05459, 2024

Abs Bib PDF Code

Since the advent of personal computing devices, intelligent personal assistants (IPAs) have been one of the key technologies that researchers and engineers have focused on, aiming to help users efficiently obtain information and execute tasks, and provide users with more intelligent, convenient, and rich interaction experiences. With the development of the smartphone and Internet of Things, computing and sensing devices have become ubiquitous, greatly expanding the functional boundaries of intelligent personal assistants. However, due to the lack of capabilities such as user intent understanding, task planning, tool using, and personal data management etc., existing intelligent personal assistants still have limited practicality and scalability. In recent years, the emergence of foundation models, represented by large language models (LLMs), brings new opportunities for the development of intelligent personal assistants. With the powerful semantic understanding and reasoning capabilities, LLM can enable intelligent agents to solve complex problems autonomously. In this paper, we focus on Personal LLM Agents, which are LLM-based agents that are deeply integrated with personal data and personal devices and used for personal assistance. We envision that Personal LLM Agents will become a major software paradigm for end-users in the upcoming era. To realize this vision, we take the first step to discuss several important questions about Personal LLM Agents, including their architecture, capability, efficiency and security. We start by summarizing the key components and design choices in the architecture of Personal LLM Agents, followed by an in-depth analysis of the opinions collected from domain experts. Next, we discuss several key challenges to achieve intelligent, efficient and secure Personal LLM Agents, followed by a comprehensive survey of representative solutions to address these challenges.
@article{li2024personal_llm_agents, title = {Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security}, author = {Li, Yuanchun and Wen, Hao and Wang, Weijun and Li, Xiangyu and Yuan, Yizhen and Liu, Guohong and Liu, Jiacheng and Xu, Wenxing and Wang, Xiang and Sun, Yi and Kong, Rui and Wang, Yile and Geng, Hanfei and Luan, Jian and Jin, Xuefeng and Ye, Zilong and Xiong, Guanjing and Zhang, Fan and Li, Xiang and Xu, Mengwei and Li, Zhijun and Li, Peng and Liu, Yang and Zhang, Ya-Qin and Liu, Yunxin}, year = {2024}, url = {https://arxiv.org/abs/2401.05459}, journal = {arXiv preprint arXiv:2401.05459} }

2022

ISCA 2022
DIMMining: pruning-efficient and parallel graph mining on near-memory-computing

Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li , Yuan Xie, Huazhong Yang, and Yu Wang

In Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, New York, 2022

Abs DOI Bib PDF

Graph mining, which finds specific patterns in the graph, is becoming increasingly important in various domains. We point out that accelerating graph mining suffers from the following challenges: (1) Heavy comparison for pruning: Pruning technique is widely used to reduce search space in graph mining. It applies constraints on vertex indices and involves massive index comparisons. (2) Low parallelism of set operations: The typical graph mining algorithms can be expressed as a series of set operations between neighbors of vertices, which suffer from low parallelism if vertices are streaming to the computation units. (3) Heavy data transfer: Graph mining needs to transfer intermediate data with two orders of magnitude larger than the original data volume between CPU and memory.To tackle these challenges, we propose DIMMining with four techniques from algorithm to architecture perspectives. The Index Pre-comparison scheme is proposed for efficient pruning. We introduce the self anchor and neighbor partition to enable pre-comparison for vertex indices. Thus, we can reduce comparisons during runtime. We propose a Flexible BCSR (Bitmap with Compressed Sparse Row) format to enable parallelism for set operations from the data structure perspective, which works on continuous vertices without memory space overheads. The Systolic Merge Array is designed to further explore the parallelism on discontinuous vertices from the architecture perspective. Then, we propose a DIMM-based Near-Memory-Computing architecture, which eliminates the large-volume data transfer between the computation and the memory. Extensive experimental results on real-world graphs show that DIMMining achieves 222.23X and 139.51X speedup compared with FPGAs and CPUs, and 3.61X speedup over the state-of-the-art graph mining architecture.
@inproceedings{dai2022dimmining, author = {Dai, Guohao and Zhu, Zhenhua and Fu, Tianyu and Wei, Chiyue and Wang, Bangyan and Li, Xiangyu and Xie, Yuan and Yang, Huazhong and Wang, Yu}, title = {DIMMining: pruning-efficient and parallel graph mining on near-memory-computing}, year = {2022}, isbn = {9781450386104}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3470496.3527388}, doi = {10.1145/3470496.3527388}, booktitle = {Proceedings of the 49th Annual International Symposium on Computer Architecture}, pages = {130-145}, numpages = {16}, keywords = {systolic merge array, near-memory-computing, graph mining}, location = {New York, New York}, series = {ISCA '22} }