Janghwan Lee

I am currently in my sixth year of the integrated Ph.D. program in Electronic Engineering at Hanyang University. My research primarily focuses on developing deep learning algorithms for efficient hardware, with an emphasis on the areas of Quantization on Transformer Model and Reduced-Precision Numerical Formats. Furthermore, I am also interested in sparsity for lightweight AI inference. I conduct my research in the Artificial Intelligence Hardware and Algorithm Lab under the guidance of Prof. Jungwook Choi.

selected publications

Preprint

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee*, Janghwan Lee*, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, and Jungwook Choi

In Preprint, 2026

▸Abstract PDF

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose ReSET, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-M NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to ∼2 points over the NVFP4 baseline. Our CUDA-core small-M kernel further improves latency-critical decoding, delivering up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16.
ICML 2026 Oral

ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

Janghwan Lee, Sihwa Lee, Jinseok Kim, Yongjik Kim, Jieun Lim, Jinwook Oh, and Jungwook Choi

In Forty-third International Conference on Machine Learning (ICML, Oral), 2026

▸Abstract PDF

Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats enable efficient FP4 deployment; however, fully quantizing weights, activations, and KV caches (W4A4KV4) causes severe reasoning degradation that existing PTQ and QAT fail to recover. We identify that FP4 failures concentrate on low-entropy tokens—precise symbolic commitments such as digits and operators—where quantization noise inflates sampling errors that cascade through reasoning traces. Based on this insight, we propose ReQAT, a reasoning-centric FP4 training framework with three components: (i) Trace-Aligned QAT (TAQ), which revisits identical reasoning traces to focus updates on critical low-entropy decisions; (ii) Selective Entropy Minimization (SEM), which reinforces confidence at low-entropy positions; and (iii) Q-FIT, a quantization-friendly initialization that jointly calibrates RoPE-consistent KV cache transformations to stabilize QAT. Under the same training budget, ReQAT not only recovers but surpasses BF16 fine-tuning accuracy—achieving while delivering up to 3.9✕ throughput speedup on NVIDIA DGX Spark and 3.1✕ on B200. This is the first demonstration that FP4 QAT can exceed full-precision accuracy for LRMs with over 3✕ speedup on production hardware.
ACL 2025 Findings

AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

Janghwan Lee, Jiwoong Park, Jinseok Kim, Yongjik Kim, Jungju Oh, Jinwook Oh, and Jungwook Choi

In Findings of the Association for Computational Linguistics (ACL Findings), 2025

▸Abstract PDF

Scaling Large Language Models (LLMs) with extended context lengths has increased the need for efficient low-bit quantization to manage their substantial computational demands. However, reducing precision to 4 bits frequently degrades performance due to activation outliers. To address this, we propose Asymmetric Microscaling 4-bit Floating-Point (AMXFP4) for efficient LLM inference. This novel data format leverages asymmetric shared scales to mitigate outliers while naturally capturing the asymmetry introduced by group-wise quantization. Unlike conventional 4-bit quantization methods that rely on data rotation and costly calibration, AMXFP4 uses asymmetric shared scales for direct 4-bit casting, achieving near-ideal quantization accuracy across various LLM tasks, including multi-turn conversations, long-context reasoning, and visual question answering. Our AMXFP4 format significantly outperforms MXFP4 and other leading quantization techniques, enabling robust, calibration-free 4-bit inference.
AAAI 2025

RILQ: Rank-Insensitive LoRA-based Quantization Error Compensation for Boosting 2-bit Large Language Model Accuracy

Geonho Lee*, Janghwan Lee*, Sukjin Hong*, Minsoo Kim, Euijai Ahn, Du-Seong Chang, and Jungwook Choi

In The 39th Annual AAAI Conference on Artificial Intelligence (AAAI), 2025

▸Abstract PDF

Low-rank adaptation (LoRA) has become the dominant method for parameter-efficient LLM fine-tuning, with LoRA-based quantization error compensation (LQEC) emerging as a powerful tool for recovering accuracy in compressed LLMs. However, LQEC has underperformed in sub-4-bit scenarios, with no prior investigation into understanding this limitation. We propose RILQ (Rank-Insensitive LoRA-based Quantization Error Compensation) to understand fundamental limitation and boost 2-bit LLM accuracy. Based on rank analysis revealing model-wise activation discrepancy loss’s rank-insensitive nature, RILQ employs this loss to adjust adapters cooperatively across layers, enabling robust error compensation with low-rank adapters. Evaluations on LLaMA-2 and LLaMA-3 demonstrate RILQ’s consistent improvements in 2-bit quantized inference across various state-of-the-art quantizers and enhanced accuracy in task-specific fine-tuning. RILQ maintains computational efficiency comparable to existing LoRA methods, enabling adapter-merged weight-quantized LLM inference with significantly enhanced accuracy, making it a promising approach for boosting 2-bit LLM performance.
ACL 2024 Oral

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment

Janghwan Lee*, Seongmin Park*, Sukjin Hong, Minsoo Kim, Du-Seong Chang, and Jungwook Choi

In The 62nd Annual Meeting of the Association for Computational Linguistics (ACL, Oral), 2024

▸Abstract PDF

The rapid advancement of large language models (LLMs) has facilitated their transformation into conversational chatbots that can grasp contextual nuances and generate pertinent sentences, closely mirroring human values through advanced techniques such as instruction tuning and reinforcement learning from human feedback (RLHF). However, the computational efficiency required for LLMs, achieved through techniques like post-training quantization (PTQ), presents challenges such as token-flipping that can impair chatbot performance. In response, we propose a novel preference alignment approach, quantization-aware direct preference optimization (QDPO), that aligns quantized LLMs with their full-precision counterparts, improving conversational abilities. Evaluated on two instruction-tuned LLMs in various languages, QDPO demonstrated superior performance in improving conversational abilities compared to established PTQ and knowledge-distillation fine-tuning techniques, marking a significant step forward in the development of efficient and effective conversational LLMs.
EMNLP 2023

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee*, Minsoo Kim*, Seungcheol Baek, Seokjoong Hwang, Wonyong Sung, and Jungwook Choi

In The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

▸Abstract PDF

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency—a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2× hardware efficiency improvement compared to 8-bit integer MAC unit.
ICASSP 2023

Finding Optimal Numerical Format for Sub-8-Bit Post-Training Quantization of Vision Transformers

Janghwan Lee, Youngdeok Hwang, and Jungwook Choi

In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

▸Abstract PDF

Vision Transformers (ViTs) have gained significant attention for their exceptional model accuracies on computer vision applications, but their demanding memory requirements and computational complexity have hindered active deployment. Post-training quantization (PTQ) is a practical method to tackle this challenge by directly reducing ViT’s bit-precision. However, diverse data characteristics across different operations of ViT cannot be well captured solely by a single numerical format (fixed or floating-point). This work proposes an analytical framework that optimizes the numerical format of each matrix multiplication of ViTs for mixed-format sub-8bit quantization. The extensive evaluation demonstrates that the proposed method can reduce the PTQ error and achieve state-of-the-art accuracy for popular ViT models.