Enhancing GPTQv2 Format Support in vLLM: Analysis and Implementation
Deep technical analysis of GPTQv2 format limitations in vLLM, and implementation of CUDA kernel adaptations to enable efficient low-bit/asymmetric quantization inference.
Deep technical analysis of GPTQv2 format limitations in vLLM, and implementation of CUDA kernel adaptations to enable efficient low-bit/asymmetric quantization inference.
Recent VLAs evolve from discrete to continuous, and from single-system (system 1 only) to dual-system.
Reading Notes of Dario Amodei's Blog.
Quickly setting up Android smartphones for development.
Quickly setting up Termux on Android smartphones for development.
Quickly setting up new single-board computers like Raspberry Pi.
唠一唠端侧大模型部署那些事。