TheUnknownBlog

Enabling KVM GPU Passthrough

Sun, 27 Apr 2025 10:55:00 GMT

Credits

In this article, the "Enabling IOMMU" and the "GPU Passthrough" sections are basically a copy of Drakeor's Blog with some clarifications and modifications. The original article is very well written and I highly recommend reading it.

If this article is helpful, make sure to check out Drakeor's blog and support him. Thanks to Drakeor for the great work!

Enabling IOMMU

Setup

In my setup, I have a host machine with an NVIDIA GeForce RTX 4090 GPU and a guest machine running Ubuntu 24.04 Server for AI training. The host machine is running Ubuntu 24.04 LTS with 6.11 kernel. The host machine has a integrated Intel UHD Graphics 770 GPU, which is used for the host display. The NVIDIA GPU is passed through to the guest machine.

The host machine has the following hardware:

CPU: Intel Core i9-14900K
Motherboard: Gigabyte Z790 AORUS XTREME
GPU: ZOTAC GeForce RTX 4090
RAM: 64GB DDR5
Storage: 2TB NVMe SSD

Enabling IOMMU is a crucial step for GPU passthrough. It allows the host machine to access the GPU directly. It takes two steps to enable IOMMU: enabling it in the BIOS and enabling it in linux.

BIOS Settings

This tutorial assumes that you have IOMMU support for both your motherboard and CPU. Most modern server motherboards should support it, but your mileage may vary with desktop motherboards. Here are the options in BIOS corresponding to IOMMU related features:

Intel Based: Enable "Intel VT-d". May also be called "Intel Virtualization Technology" or simply "VT-d" on some motherboards.
AMD Based: Enable "SVM". May also be called "AMD Virtualization" or simply "AMD-V". Note: I've seen "IOMMU" as it's own separate option on one of my motherboards, but not on any of my other motherboards. Make sure it's enabled if you do see it. If you don't see it, it's likely rolled into one of the former VT-d or AMD-V options listed above.

Checking for IOMMU Support on your CPU

On Ubuntu/Debian for my Intel processor, it's as easy as this:

cat /proc/cpuinfo | grep --color vmx

If you see colored vmx in the output, you have IOMMU support. If you see nothing, your CPU does not support IOMMU.

The AMD equivalent is this:

cat /proc/cpuinfo | grep --color svm

There are one other BIOS settings that I recommend enabling before you move on to the next section.

Make sure the Primary GPU is set to integrated and not using your passthrough graphics card. This is called "Boot GPU" and "Primary Graphics" in my BIOS. Also remember to plug your monitor into the integrated graphics port on your motherboard. This is important because the host machine will use the integrated graphics for display and the passthrough graphics card will be used by the guest machine.

It is also worth notice that some motherboards have a setting called "Above 4G Decoding" or "Resizable Bar Support". This is not the same as IOMMU. It is used for PCIe devices that require more than 4GB of address space. It is not required for IOMMU to work, but it is recommended to enable it if you have a GPU with more than 4GB of VRAM.

Once you've enabled the above settings, save and exit the BIOS. This is a one-time operation. You will not need to do this again unless you reset your BIOS settings.

Linux GRUB Settings

Add the following options to your GRUB_CMDLINE_LINUX option in the /etc/default/grub file:

nano /etc/default/grub

For Intel CPUs, add the following options:

GRUB_CMDLINE_LINUX="... intel_iommu=on iommu=pt video=efifb:off"

The ... in the above line is the existing options. Make sure to keep them.

For AMD CPUs, add the following options:

GRUB_CMDLINE_LINUX="... amd_iommu=on iommu=pt video=efifb:off"

And then update GRUB:

sudo grub-mkconfig -o /boot/grub/grub.cfg

Make sure to reboot your system.

Then, to check that IOMMU is enabled, we can run the following command

sudo dmesg | grep -i -e DMAR -e IOMMU

You should see at least a message or two about it loading like below:

Feb 10 17:55:23.119993 opaleye kernel: pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
Feb 10 17:55:23.123622 opaleye kernel: pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
Feb 10 17:55:23.123691 opaleye kernel: perf/amd_1ommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank) •
Feb 10 17:55:23.124108 opaleye kernel: AMD-Vi: AMD IOMMUv2 loaded and initialized

GPU Passthrough

Find IOMMU Groups

Finding IOMMU Groups

Before looking at the IOMMU Groups, I want to make sure that my graphics card is visible to the OS. I run the following command:

lspci -nnk | grep VGA

For me, this results in 2 graphics controllers being shown:

00:02.0 VGA compatible controller [0300]: Intel Corporation Raptor Lake-S GT1 [UHD Graphics 770] [8086:a780] (rev 04)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)

The first one is the integrated graphics card and the second one is the NVIDIA GPU. To list all the IOMMU groups they are part of, I'll run the following command (TheUnknownThing notes: I've modified the command because the original one from drakeor's blog was not working for me):

for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done | sort -V

As is shown in the figure, my RTX 4090 is in IOMMU group 12.

Loading the Correct Kernel Modules

Okay, so now that we have IOMMU all set, we need to make sure to load the correct modules for our passthrough graphics card. By default, nouveau will try to grab the graphics card when we boot.

I created a new file called /etc/modprobe.d/vfio.conf and added the following lines:

blacklist nouveau
options vfio_pci ids=10de:2684,10de:22ba

Note that I got the IDs from the IOMMU Group above. I need to pass in EVERY device in that IOMMU group or it won't work! Even though I'm not using audio, I still need to pass in the audio device in that group.

Side note: why we need to block nouveau? Because it will try to grab the graphics card and we don't want that. We want vfio-pci to grab it instead.

In /etc/modules-load.d/modules.conf, we'll ensure vfio_pci is loaded at boot:

Add vfio_pci to the file:

echo "vfio_pci" | sudo tee -a /etc/modules-load.d/modules.conf

Now reboot your system.

Now run the following to make sure the correct module is being used:

lspci -nnk

Make sure you see vfio-pci in the driver column for your graphics card.

Passing the GPU to the Guest VM

If you haven't installed the virt-manager or created your VM yet, please move on to the Creating a VM section.

So recall that the PCI address is on the left-side of when I ran lspci -Dnn earlier:

0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)

We want to take that value (0000:01:00.0) and convert all the colons and dots into underscores. So for 0000:01:00.0, it will be 0000_01_00_0.

Now we need to detach the PCI device from the host machine. We can do this with the following virsh command:

virsh nodedev-detach pci_0000_01_00_0

Then we'll edit the VM we want to attach the GPU to with the following virsh command:

virsh edit <vm_name>

Under the devices tag, we'll add the GPU. Note that address, bus, slot, and function matches the PCI address we saw earlier. You could add the following to wherever you want in the devices section, but I like to put it at the end.

..
<devices>
...
    <hostdev mode='subsystem' type='pci' managed='yes'>
        <driver name='vfio'/>
        <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
        </source>
    </hostdev>
...
</devices>
...

Now save the file and reboot your VM, and you should see the NVIDIA GPU in the VM. Remember to install the NVIDIA drivers in the guest machine. For a quick test, I will run the following command in the guest machine:

sudo apt update
sudo ubuntu-drivers autoinstall

And testthe following command to check if the NVIDIA drivers are installed correctly:

sudo nvidia-smi

Creating a VM

Prerequisites: Check Hardware Virtualization Support

KVM requires hardware virtualization extensions (Intel VT-x or AMD-V) to be enabled in your system's BIOS/UEFI. As we discussed earlier, I'll assume you have this enabled.
Check if the KVM modules are loaded (after installation step below):

lsmod | grep kvm

You should see kvm_intel or kvm_amd listed.

Install Libvirt

Ensure your package list is up-to-date:

sudo apt update

You'll need the Libvirt daemon, the QEMU/KVM hypervisor, and management tools.

The Libvirt package installation includes several components:

qemu-kvm: The KVM hypervisor backend.
libvirt-daemon-system: The main Libvirt daemon that runs as a system service.
libvirt-clients: Command-line tools for managing Libvirt (like virsh).
bridge-utils: Utilities for creating and managing network bridges (often needed for VM networking).
virtinst: Tools to create virtual machines (like virt-install).
virt-manager: (Optional, but Recommended) A graphical user interface for managing VMs.

sudo apt install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils virtinst virt-manager

This command installs all the essential components, including the graphical virt-manager. If you are setting up a headless server, you can omit virt-manager.

Add Your User to the `libvirt` Group

By default, only the root user can manage system-wide Libvirt virtual machines. To allow your regular user account to manage VMs without using sudo for every command, add it to the libvirt group.

sudo adduser <your_username> libvirt

Replace <your_username> with your actual username.

Important: You need to log out and log back in for this group change to take effect. Alternatively, you can activate the group membership for your current shell session using newgrp libvirt (but logging out/in is generally recommended).

Verify the Installation

Check the Libvirt daemon status by executing the following command:

sudo systemctl status libvirtd

It should show as active (running). If not, try starting and enabling it:

sudo systemctl start libvirtd
sudo systemctl enable libvirtd

And check Libvirt connection (as your user, after logging back in):

virsh list --all

This command should run without errors (even if it shows an empty list of VMs). If you get a permission error, double-check that you've logged out and back in after adding your user to the libvirt group.

Create a Virtual Machine

First, download the ISO image for the OS you want to install. For this tutorial, I will use Ubuntu 24.04 Server. You can download it from the official Ubuntu website.

I will recommend using the virt-manager GUI for creating and managing VMs, as it simplifies the process significantly. However, if you prefer command-line tools, you can use virt-install. To simplify the process, I will use virt-manager.

Launching `virt-manager`

To launch virt-manager, run the following command in your terminal:

virt-manager

This will open the graphical interface for managing virtual machines. And the experience is quite straightforward, so I won't go into detail here. Just follow the prompts to create a new VM.

Accessing VM through `virsh console`

The virsh console command connects you to a serial console device that libvirt exposes to the virtual machine. For this to work bidirectional (input and output), two things need to be properly configured:

Virsh console

In the Virtual Machine's Libvirt XML, tt needs to have a <console type='pty'> or similar device defined, connected to a serial port (like target port='0'). You can double-check this by running virsh dumpxml ubuntu24.04 and looking within the <devices> section for a <console> or <serial> entry.

<serial type='pty'>
  <source path='/dev/pts/3'/>
  <target type='isa-serial' port='0'>
    <model name='isa-serial'/>
  </target>
  <alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/3'>
  <source path='/dev/pts/3'/>
  <target type='serial' port='0'/>
  <alias name='serial0'/>
</console>

If this is missing, you'll need to add it using virsh edit ubuntu24.04.

Inside the Guest VM

Edit the GRUB configuration:

Open the GRUB default file in a text editor:

sudo nano /etc/default/grub

Find the line that starts with GRUB_CMDLINE_LINUX_DEFAULT. It might look something like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

You need to add console redirection parameters. Add console=tty0 console=ttyS0,115200. _ console=tty0: Ensures output also goes to the primary virtual console (if you still have one, which you likely do for initial setup). _ console=ttyS0,115200: Directs kernel and boot messages to the first serial port (ttyS0) at a baud rate of 115200. This corresponds to the port='0' in the libvirt XML.

The line should become something like:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash console=tty0 console=ttyS0,115200"

If you already have other parameters in this line, just add the console=... parts inside the quotes, separated by spaces.

Enable a Serial Getty Service:

Ubuntu uses systemd to manage services. You need to enable the service that provides a login prompt on the serial port.

sudo systemctl enable serial-getty@ttyS0.service
sudo systemctl start serial-getty@ttyS0.service

The enable command ensures it starts on boot, and start attempts to start it immediately.

After editing the GRUB configuration file, you must update the GRUB bootloader:

sudo update-grub

Also update Initramfs (necessary for console changes to take full effect early in boot):

sudo update-initramfs -u

And remember to reboot your VM, and it should now be accessible via the virsh console command.

CS188 Notes 4 - Reinforcement Learning

Sun, 20 Apr 2025 00:00:00 GMT

Note:

You could view previous notes on CS188: Lecture 9 - Markov Decision Processes (MDPs).

Also note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

Reinforcement Learning

In this note I will go through the key concepts in the Reinforcement Learning (RL) lecture. I will also try to clarify my understanding of the Q-learning algorithm, which is a key concept in RL.

First let's categorize the topics. I'll use the same categories as in the lecture slides also adding some of my own notes.

Passive Learning
- Model-based
- Model-free
Active Learning
Approximate Q-learning
Policy Gradient

ONE SENTENCE SUMMARY: Passive learning involves evaluating a fixed policy (likely human will control it), while active learning seeks to improve the policy through exploration (likely model itself would operate); model-based methods use environment models, model-free methods learn directly from experience, approximate Q-learning generalizes learning to large state spaces, and policy gradient methods optimize policies directly using gradient ascent.

I believe this is a good summary of the key concepts in RL. I will go through each of these categories in detail below. Also, I will use the structure of "HOW? -> WHY? -> PROBLEM" to explain each concept.

Passive RL

Model-based

How?

The agent learns a model of the environment (e.g., transition probabilities, rewards) and uses this model to evaluate the policy. This is done by estimating the expected value of each action in each state based on the model.

Then Solve for values as if the learned model were correct. (Trust the model)

Why?

Answering "why" in this section is basically answering "why do we need a model?" The answer is that we do not have a model of the environment, so we need to learn it. This is done by estimating the transition probabilities and rewards based on the observed data. This is a key concept in RL, as it allows the agent to learn from its experiences and improve its policy over time.

Problem

The problem with this approach is that it requires a lot of data to learn the model accurately. If the model is not accurate, the agent may make suboptimal decisions based on the learned model.

Model-free

How?

In this case there are no models to guide us "what to do". We need to learn the value function directly from the data.

The simplest thought is to Average together observed sample values. Every time you visit a state, write down what the sum of discounted rewards turned out to be, and average it out. But what's bad about this is that it do not take account of state connections. For example, there is a graph A -> B -> C (end). How to calculate $V$ for state $A$ and $B$? We would evaluate every single starting state separately, for example, when evaluating A, we would NOT take the previous evaluation of B in to account, it only cares the final outcome and to average it. This is not a good idea, because we are wasting a lot of data. We could use the data from state B to help us evaluate state A. So we need to take into account the connections between states.

So an evolution of this is to use the Bellman equation. The idea is to use the value of the next state to help us evaluate the current state. This is done by using the Bellman equation similar to the one we used in the MDP lecture. However, we need modifications for this.

The ORIGINAL Bellman equation is: $$ V(s) = \sum_{s'} T(s, a, s')[R(s, a, s') + \gamma V(s')] $$

And its ADAPTED version is: $$ V(s) = \frac{1}{n}\sum \mathrm{sample}{s'} \quad \text{where} \ \mathrm{sample}{s'} = R(s, a, s') + \gamma V(s') $$

What's improved from the naive version is that we are utilizing the existing data to evaluate. However, as we notice that there are problems with this: We are waiting until the end of an episode to update values as we are using the average of all samples. We could update values more frequently.

So this is where the Temporal Difference (TD) Learning comes in. The idea is to update the value of the current state based on the value of the next state, without waiting for the end of the episode. Because updates happen after every transition, states and transitions that are experienced more frequently will have a greater influence on the learned values over time.

The specific type of TD learning shown here is for policy evaluation. This means we have a fixed policy π (a fixed way of choosing actions in each state), and we want to figure out the value function Vπ(s) for that policy. We are not trying to find the best policy yet, just evaluating the current one.

In TD, we have samples, and the update rule. $$ \mathrm{sample} = R(s, \pi(s), s') + γV^\pi(s') $$ The sample (or TD Target) is: "the reward I just got, plus the discounted value of where I landed (according to my current beliefs)".

The update rule is: $$ V^\pi(s) \leftarrow (1 - α)V^\pi(s) + α \ \mathrm{sample} $$

We calculate the TD Error: sample - Vπ(s). This error represents the difference between our target (sample) and our current estimate (Vπ(s)). We then adjust our current estimate Vπ(s) by moving it a small step (α) in the direction of that error.

It shows that TD learning is essentially maintaining a running average of the TD targets it observes for each state. It gradually "forgets" older, potentially less accurate, information because initial value estimates might be far off. Using a decreasing learning rate α over time can help the value estimates converge more stably.

However, there are still problems. Mentioned in the previous lecture, what really GUIDES the agent is the $Q$-values. So we need to learn the $Q$-values instead of the $V$-values.

Q-Learning (Active RL)

How?

Q-Learning is a model-free reinforcement learning algorithm used to learn the optimal action-value function (Q-values). Unlike TD learning which focuses on state values, Q-learning focuses on (state, action) pairs.

The Q-learning update rule is: $$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] $$

Similar to above, where: $Q(s, a)$ is the current estimate of the Q-value for state $s$ and action $a$, $\alpha$ is the learning rate, and the term $r + \gamma \max_{a'} Q(s', a') - Q(s, a)$ is the TD error. You might wonder "why do we need to use the max operator here?" The answer is that we are trying to learn the optimal Q-value for each state-action pair. The max operator allows us to select the best action in the next state $s'$ based on the current Q-values.

Why?

Q-learning allows us to select the best action in each state, unlike TD learning which only evaluates a fixed policy. It's called "off-policy" because it learns the optimal policy regardless of how the agent is currently behaving (exploration). The agent can follow any exploratory policy during training while still learning the greedy optimal policy.

With Q-values, we can derive our policy directly: $$ \pi(s) = \arg\max_a Q(s, a) $$

This means choosing the action that maximizes the expected future rewards for each state.

Problem

The main challenge in Q-learning is balancing exploration and exploitation, i.e., balancing the behaviour that "Trying new actions to discover potentially better rewards" and "Using known Q-values to maximize rewards based on past experience"

This is typically addressed using an $\epsilon$-greedy policy or exploration functions:

The $\epsilon$-greedy policy works as follows:

With probability $1-\epsilon$, choose the best action (exploit)
With probability $\epsilon$, choose a random action (explore)
Gradually decrease $\epsilon$ over time to favor exploitation as learning progresses

And the exploration function works as follows:

Define "exploration bonus" based on the uncertainty of Q-values. Let $n$ be the number of times action $a$ has been taken in state $s$. The exploration bonus can be defined as $\frac{1}{n(s, a)}$.
When choosing actions, add the exploration bonus to the Q-value: $Q(s, a) + \frac{1}{n(s, a)}$.
This encourages the agent to explore less frequently visited actions, balancing exploration and exploitation.
Gradually decrease the exploration bonus over time to favor exploitation as learning progresses

This approach can be more efficient than $\epsilon$-greedy, as it focuses exploration on less certain actions rather than uniformly random actions. So it IS used in practice.

Experience Replay

Experience replay is a optimization technique used in reinforcement learning, particularly in deep Q-learning. So I'll add it as a subtopic here.

How?

Experience replay enhances Q-learning by storing the agent's experiences (transitions) in a replay buffer. Instead of updating Q-values using only the most recent experience, the agent stores the recent experience to buffer, and randomly samples batches of past experiences from this buffer for training.

Why?

In Q-learning, the agent learns from its experiences sequentially. This can lead to correlations between consecutive experiences, making learning inefficient. By using experience replay, the agent can break these correlations and learn from a more diverse set of experiences. Consecutive experiences are often similar, making learning inefficient. Random sampling creates more independent training examples.

This is especially important in deep reinforcement learning where neural networks are used to approximate Q-values.

With all the problems addressed above, we still could not put the Q-learning algorithm into practice. The problem is that the state space is too large. We cannot store the Q-values for every single state-action pair. So we need to use function approximation to generalize across similar states, and this is where Approximate Q-Learning comes in.

Hold on a second

But before that, I do believe I need to clarify some points here.

You might think: "Why Q-Learning is discussed in active learning? Q-learning could be used in passive learning, while TD could also be used in active learning, is that correct?"

Yes, you are right. In CS188 (and many RL courses), the algorithms are typically presented in this order:

TD Learning is introduced first as a way to learn value functions for passive learning

Q-Learning is introduced next as a way to extend these ideas to active learning

This pedagogical approach sometimes creates the impression that these algorithms are strictly tied to their respective learning categories, but they're more flexible than that.

The main difference is that TD learning (as typically presented) learns state values V(s) while Q-learning learns state-action values Q(s,a). Q-values naturally lend themselves to policy improvement (just take argmax), which is why Q-learning is often presented in the active learning context.

Approximate Q-Learning

How?

In environments with large or continuous state spaces, it's impractical to maintain a separate Q-value for each state-action pair. Approximate Q-learning uses function approximation to generalize across similar states.

Simple solution is that recall the "feature function" that we discussed in the game tree lecture. We describe a state using a vector of features (properties) $f_1, f_2, \ldots, f_n$ and learn a linear function of these features: $$ Q(s, a) = w_1 f_1(s, a) + w_2 f_2(s, a) + \ldots + w_n f_n(s, a) $$ Where $w_1, w_2, \ldots, w_n$ are weights that we learn through experience. This is a linear function approximation. We can also use non-linear function approximators like neural networks, but the basic idea is the same: learn a function that maps states (and actions) to Q-values.

And you might wonder: "How to learn the weights?" The answer is that we can use the same Q-learning update rule, but instead of updating the Q-value directly, we update the weights using some tricks. This tricks is a simple notion of "if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state’s features".

So the update rule becomes: $$ w_i \leftarrow w_i + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] f_i(s, a) $$ Where $f_i(s, a)$ is the value of the $i$-th feature for state $s$ and action $a$. This means we are updating the weights based on the features that were present in the current state-action pair.

The update rule of $Q$ is still the same: $$ Q(s, a) \leftarrow Q(s, a) + \alpha \ \mathrm{Difference} $$

Where $\mathrm{Difference} = r + \gamma \max_{a'} Q(s', a') - Q(s, a)$

Why?

Approximate Q-learning allows reinforcement learning to scale to complex environments with huge state spaces (like Atari games, robotics, etc.) where tabular methods would be impossible.

It enables generalization across similar states, so learning in one state can improve performance in similar states, even those the agent hasn't encountered yet.

Problem

Approximate Q-learning faces these two challenges:

Forgetting - learning in one region of the state space can undo learning in another region
Feature selection - choosing the right representation for states is critical for good generalization

Policy Gradient Methods

How?

Instead of learning a value function and deriving a policy from it, policy gradient methods directly parameterize the policy itself. That is, the agent's behavior is described by a function $\pi(a|s; \theta)$, where $\theta$ are the parameters (often the weights of a neural network). The goal is to adjust $\theta$ so that the expected return (the sum of rewards) is maximized.

The core idea is to use gradient ascent: we estimate how changing the parameters would affect the expected return, and then nudge the parameters in that direction. The update rule looks like this: $$ \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) $$ where $J(\theta)$ is the expected return under the current policy.

But how do we compute this gradient? The answer is the policy gradient theorem, which tells us that the gradient of the expected return can be estimated using samples from the environment: $$ \nabla_\theta J(\theta) \approx \mathbb{E}{\pi\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot G \right] $$ Here, $G$ is the return (sum of discounted rewards) following the action $a$ in state $s$. In practice, we run episodes, collect rewards, and use these samples to estimate the gradient.

This approach is called REINFORCE, the simplest policy gradient algorithm. Each time the agent takes an action, it computes the gradient of the log-probability of that action, multiplies it by the return, and uses that as the update direction.

Why?

Policy gradient methods are powerful for several reasons. First, they allow us to optimize the policy directly, which is what we ultimately care about. This is especially useful in environments with continuous or high-dimensional action spaces, where value-based methods struggle. Policy gradients can also learn stochastic policies, which can be optimal in environments with inherent randomness or partial observability.

Another advantage is that policy gradient methods can be combined with function approximation (e.g., neural networks) to handle very large or continuous state spaces. This is the foundation of modern deep reinforcement learning algorithms.

CS188 Notes 2 - Markov Decision Processes (MDPs)

Sat, 19 Apr 2025 00:00:00 GMT

Note:

You could view previous notes on CS188: Lecture 4 - Constraint Satisfaction Problems (CSPs)

CS188: Lecture 8 - Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs)

A Markov Decision Process (MDP) represents sequential decision-making in environments where actions produce stochastic (random) outcomes, and an agent's goal is to maximize its cumulative reward over time. In an MDP, the agent faces uncertainty: it cannot always predict the result of its actions, but it must still try to act optimally.

The key components of an MDP are:

States $S$: Possible situations the agent can find itself in.
Actions $A$: The set of possible moves or decisions the agent can make in each state.
Transition Function $T(s, a, s')$: The probability that action $a$ in state $s$ leads to state $s'$.
Reward Function $R(s, a, s')$: The reward received after transitioning from $s$ to $s'$ via action $a$.
Discount Factor $\gamma$: How much the agent values future rewards compared to immediate rewards.

We yield the value function $V(s)$ for each state $s$, which represents the expected cumulative reward starting from state $s$ and following the optimal policy thereafter. And the action-value function $Q(s, a)$ for each action state $(s,a)$, which represents the expected cumulative reward starting from state $s$, taking action $a$, and then following the optimal policy thereafter.

You might think "why not just use the value function $V(s)$?" The reason is actions are easier to select from $Q$-values than values! You will see this in the following part of this lecture.

The goal is to find an optimal policy $\pi^*$, which is a mapping from states to actions ($\pi(s) = a$), maximizing the expected cumulative (usually discounted) reward from any state. In this sense, an MDP defines both the "game rules" and what it means to "play well" in that environment.

Stationary Preferences

The assumption of stationary preferences means that your relative preference between two future sequences of rewards doesn't change just because you receive the same immediate reward before both. This property imposes a recursive structure on the utility function for reward sequences.

Formally, the utility $U$ of a sequence $[r_0, r_1, r_2, ...]$ must satisfy: $$ U([r_0, r_1, r_2, ...]) = f(r_0, U([r_1, r_2, ...])) $$ where $f$ is some consistent function. If we assume $f$ is linear, this recursion unrolls to only two possible forms for the utility function (after appropriate normalization):

Additive Utility: $U = r_0 + r_1 + r_2 + \cdots$ (corresponds to $\gamma = 1$)
Discounted Utility: $U = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$ (where $0 \leq \gamma < 1$)

The discounted utility is the standard in MDPs, as it ensures convergence for infinite horizons and reflects the diminishing importance of rewards further in the future.

Why MDPs?

MDPs are particularly suitable when:

The environment is stochastic: the same action in the same state can yield different results.
Rewards may be delayed: the value of an action now may be realized only after multiple future steps.

Unlike simple search algorithms (e.g., greedy or expectimax), MDPs explicitly model both uncertainty (via $T$) and the accumulation of rewards over time (via $R$ and $\gamma$). While solving an MDP requires knowledge of $T$ and $R$, reinforcement learning (RL) methods learn optimal policies directly from experience, using the MDP framework as a theoretical foundation.

MDPs vs Expectimax

Both MDPs and expectimax handle uncertainty and aim for maximum expected utility. Expectimax, however, is typically used to compute the expected value of actions from a specific starting point, often with a tree structure and a finite horizon. MDPs, in contrast, compute a policy—the best action for every possible state—naturally handling cycles and infinite (discounted) horizons.

In short: expectimax is a limited lookahead from the current state; solving an MDP finds a full strategy for all states.

MDPs and Multi-Agent Games

Standard MDPs are designed for a single agent interacting with a stochastic environment. They do not directly accommodate multiple strategic agents whose actions affect each other's outcomes. Multi-agent situations typically require other formalisms, such as stochastic games or Markov games.

MDPs vs Greedy Search

Greedy algorithms make decisions based solely on immediate rewards, without considering long-term consequences. MDPs, by calculating the expected sum of (possibly discounted) future rewards, are inherently long-sighted. Optimizing for the value function $V^(s)$ or the action-value function $Q^(s,a)$, MDPs look ahead through the space of future possibilities, not just the next step.

ONE SENTENCE SUMMARY:
Markov Decision Processes are mathematical models for sequential decision-making under uncertainty, aiming to find policies that maximize expected (possibly discounted) cumulative reward, and forming the theoretical foundation for reinforcement learning.

CS188 Notes 3 - Markov Decision Processes (MDPs) II

Sat, 19 Apr 2025 00:00:00 GMT

Note:

You could view previous notes on CS188: Lecture 8 - Markov Decision Processes (MDPs).

Markov Decision Processes (MDPs)

After the previous lecture, I realized I had some misunderstandings about the Policy Iteration algorithm, especially when compared to Value Iteration. So here, I'll clarify my understanding of these two core approaches for solving MDPs.

Why use a "fixed policy" in Policy Iteration?

It can be confusing at first that Policy Iteration evaluates a fixed policy. You might ask: does using a fixed, possibly non-optimal policy ever lead to the optimal one?

The answer is that evaluating a fixed policy is an essential intermediate step towards finding the optimal policy. We might "evaluate" a policy that is not optimal, but we it yields valuable information about the expected future rewards of that policy, so finnaly what we act on is the optimal policy.

In Policy Iteration, we loop between two key phases:

Step 1: Policy Evaluation

We begin with an initial policy $\pi$ (random, greedy, whatever). For this $\pi$, we compute the exact utility $V^{\pi}(s)$ for each state $s$ under the assumption that we always follow $\pi$. The Bellman equation for this is:

$$ V^{\pi}(s) = \sum_{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}(s') ] $$

This evaluates the policy's long-term value at every state, given that policy.

Step 2: Policy Improvement

Now that we have $V^{\pi}$, we look at each state $s$ and ask: "Is there an action $a$ that would improve my expected future rewards if I took it immediately, then continued with $\pi$?"

For each state, we consider:

$$ Q^{\pi}(s, a) = \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^{\pi}(s') ] $$

We then build a new policy by setting:

$$ \pi_{\text{new}}(s) = \arg\max_a Q^{\pi}(s, a) $$

That is, for each state, choose the action that looks best based on the values under the old policy. This is the policy improvement step.

Repeat: We now re-evaluate the new policy $\pi_{\text{new}}$, and the process continues until the policy stops changing. This guarantees convergence to the optimal policy $\pi^$ and optimal value function $V^$. Evaluating a fixed policy at each stage is essential for knowing both how good our current strategy is and how to improve it.

What is the difference between Policy Iteration and Value Iteration?

In short:

Value Iteration is always searching for the best action at each step, directly refining the estimate of the optimal value function.
Policy Evaluation (as used in Policy Iteration) simply calculates the consequences of following a predefined plan $\pi$, without improvement during evaluation itself. Policy improvement occurs as a separate step.

Let's break down the differences in detail.

Value Iteration Equation:

$$ V_{k+1}(s) = \max_{a} \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V_{k}(s') ] $$

Goal: Directly compute the optimal value function $V^*(s)$.
How: Each iteration, for each state $s$, considers all possible actions $a$. For each action, it calculates the expected value (reward + discounted future value), then takes the maximum over all actions.
Policy: Implicit. The $\max$ operation is finding the best action, and the final optimal policy $\pi^*$ is extracted after $V_k$ converges.
What it computes: Iteratively refines the best possible long-term value from each state.

Policy Evaluation Equation (for a fixed policy $\pi$):

$$ V^{\pi}{k+1}(s) = \sum{s'} T(s, \pi(s), s') [ R(s, \pi(s), s') + \gamma V^{\pi}_k(s') ] $$

Goal: Compute the value function $V^\pi(s)$ for the given, fixed policy $\pi$ (which may not be optimal).
How: Each iteration, for each state $s$, uses only the action prescribed by $\pi$: $a = \pi(s)$. Calculates the expected value (reward + discounted future value) following this fixed action. There is no $\max$ because the action is predetermined by $\pi$.
Policy: Explicit and fixed throughout evaluation.

Comparison Table

| Feature | Value Iteration (VI) | Policy Evaluation (PE for fixed $\pi$) | |:-----------------------|:-------------------------------------------------------------|:------------------------------------------------------------| | Equation Core | $\max_a \sum T(s,a,s')[R + \gamma V_k(s')]$ | $\sum T(s, \pi(s), s')[R + \gamma V^\pi_k(s')]$ | | $\max_a$ Present? | Yes | No | | Action Choice | Considers all $a$, picks the best | Only the action $\pi(s)$ given by policy | | Policy Role | Policy is implicit (via $\max$) | Policy is explicit and fixed | | Goal | Compute optimal value function $V^$ | Compute value function $V^\pi$ for the given policy $\pi$ | | Used Where? | Standalone algorithm to find $V^$ | Subroutine within Policy Iteration | | Convergence | $V_k$ converges to $V^*$ | $V^\pi_k$ converges to $V^\pi$ |

Does Policy Evaluation converge after more iterations than Value Iteration?

It's tempting to think that Policy Evaluation takes more iterations to converge, since it does not optimize at every step, but in practice, Policy Iteration often converges in fewer outer iterations (policy updates) than Value Iteration, though the work per iteration can differ.

The real power of Policy Iteration comes after Policy Evaluation. Once we have $V^\pi$ for our current policy, we can often make a large jump to a better policy by improving all states at once: $$ \pi_{\text{new}}(s) = \arg\max_a \sum_{s'} T(s, a, s') [ R(s, a, s') + \gamma V^\pi(s') ] $$

We only repeat this process until the policy stops changing, which often happens quickly and requires fewer overall iterations than Value Iteration.

CS188 Notes 1 - Constraint Satisfaction Problems (CSPs)

Fri, 18 Apr 2025 00:00:00 GMT

Note:

This is a work in progress. I will be adding more notes and examples as I go through the course. The course is available on the Berkeley website.

Note that my notes are based on the Spring 2025 version of the course, and my understanding of the material. So they MAY NOT be 100% accurate or complete. Also, THIS IS NOT A SUBSTITUTE FOR THE COURSE MATERIAL. I would only take notes on parts of the lecture that I find interesting or confusing. I will NOT be taking notes on every single detail of the lecture.

I will begin my notes with Lec.4 (CSPs I) and continue from there.

CS188: Lecture 4 - Constraint Satisfaction Problems (CSPs)

The goal is to find a complete assignment (every variable has a value from its domain) such that all constraints are satisfied. CSPs are a special kind of search problem where the path to the goal doesn't matter, only the final state.

Backtracking Search

The fundamental algorithm for solving CSPs systematically is Backtracking Search. It works as follows:

Start with an empty assignment.
Select an unassigned variable.
Try assigning a value from its domain.
Check if this assignment violates any constraints with already assigned variables.
- If no violation, recursively call backtracking for the next variable. If the recursive call succeeds, we're done (or continue if finding all solutions).
- If violation, or if the recursive call returns failure, try the next value for the current variable.
If all values for the current variable have been tried and failed, backtrack: return failure to the previous call, forcing it to try a different value.

This explores the space of partial assignments in a depth-first manner. While complete (guaranteed to find a solution if one exists), basic backtracking can be very slow.

Filtering (Constraint Propagation)

Filtering techniques aim to prune the search space before or during backtracking by removing values from domains that cannot possibly lead to a solution.

Forward Checking: When a variable X is assigned a value v, look at all unassigned neighboring variables Y connected to X by a constraint. Remove any value y from Y's domain that is inconsistent with X=v.
- Why: Simple, relatively cheap check that prevents immediate failures down the line.
- Limitation: It only checks constraints between the newly assigned variable and its future neighbors. It doesn't detect inconsistencies between two unassigned variables, even if their domains have been reduced (e.g., if both NT and SA are reduced to only {Blue}, Forward Checking won't notice the NT-SA conflict until one of them is assigned).
Arc Consistency (2-Consistency): An arc X -> Y is consistent if for every value x remaining in X's domain, there exists at least one value y remaining in Y's domain such that (x, y) satisfies the constraint between X and Y. So you could think of it as a "two-way" check, a update of the previous mentioned Forward Checking.
- How (AC-3 Algorithm Idea): Maintain a queue of all arcs. While the queue is not empty, pop an arc X -> Y. Check if it's consistent. If not, remove the inconsistent value(s) x from X's domain ("delete from the tail"). Crucially: If any value was removed from X, add all arcs Z -> X (where Z is a neighbor of X, other than Y) back into the queue, because the removal might make some values in Z inconsistent. Repeat until the queue is empty (no more values can be removed).
- Why: More powerful than Forward Checking. It propagates constraints between variables, potentially detecting failures much earlier (like the NT-SA {Blue} conflict). Can be used as preprocessing or maintained during search. However, it is more computationally expensive than Forward Checking, but it is often worth it.
K-Consistency & Strong K-Consistency: Generalizes consistency checks to k variables. K-Consistency means any consistent assignment to k-1 variables can be extended to a k-th variable. 1-Consistency = Node Consistency (unary constraints). 2-Consistency = Arc Consistency. 3-Consistency = Path Consistency.

Strong K-Consistency: Means the CSP is J-Consistent for all J from 1 to K.
- Fact: Strong n-Consistency (where n is the number of variables) guarantees a solution can be found without backtracking.
- My misunderstanding: Why "Strong"? Because the backtrack-free construction process requires the guarantee at every step k. Step k requires k-Consistency assuming the first k-1 assignments were consistent. Plain n-Consistency only guarantees the last step (n-1 to n) works, but doesn't guarantee the intermediate steps (like 2 to 3) are possible if the problem isn't also 3-Consistent, etc. A problem could be n-Consistent (vacuously, if no consistent n-1 assignments exist) but fail lower levels of consistency, requiring backtracking or even having no solution. Strong n-Consistency ensures all necessary intermediate guarantees hold.

Speeding Up Backtracking

These heuristics don't prune the search space but guide the backtracking search to potentially find solutions faster or detect failures earlier.

Variable Ordering: Minimum Remaining Values (MRV): Choose the next unassigned variable that has the fewest legal values left in its domain.
- Why ("Fail-Fast"): If a variable has 0 values, failure is detected immediately. If it has 1 value, it's forced, simplifying the problem. Variables with few values are often bottlenecks; dealing with them early is likely to prune large parts of the search tree quickly if they lead to failure. Also called "most constrained variable".
Value Ordering: Least Constraining Value (LCV): Once a variable is selected (e.g., by MRV), try assigning values from its domain in an order. Choose the value that rules out the fewest values in the domains of neighboring unassigned variables.
- Why ("Succeed-First"): Tries to keep options open for the future, increasing the chance that the current path leads to a solution without immediate backtracking. It prioritizes choices that seem less likely to cause conflicts later.

MRV and LCV often work very well together.

The Hidden Cost of try-catch

Sat, 12 Apr 2025 14:44:00 GMT

The Problem

So I was implementing my own version of standard library containers like std::map. It's a fantastic learning exercise! I get to the operator[], the access-or-insert function. And I was looking at the existing at() method (which provides bounds-checked access) and think, "Aha! I can reuse at() and just catch the exception if the key isn't there!"

It seems elegant, right? I wrote something like this:

T &at(const Key &key) {
    if (root == nullptr) {
        throw index_out_of_bound();
    }
    return find(key, root);
}

T &operator[](const Key &key) {
    try {
        return at(key);
    } catch (index_out_of_bound &) {
        // insert
        value_type value(key, T());
        pair<iterator, bool> result = insert(value);
        return result.first->second;
    }
}

I compile it, feeling pretty good about the code reuse. Then, I run your benchmarks, comparing my sjtu::operator[] against std::map::operator[], especially focusing on scenarios involving insertions (where the key doesn't initially exist), and boom - Time Limit Exceeded. Why? So I looked at the benchmark script, and it got something like

	//	test: erase()
	while (map.begin() != map.end()) {
		map.erase(map.begin());
	}
	assert(map.empty() && map.size() == 0);
	//	test: operator[]
	for (int i = 0; i < 100000; ++i) {
		std::cout << map[Integer(i)];
	}
	std::cout << map.size() << std::endl;

So probably you have already identified the problem now, but not so lucky for me. I was just thinking, "Oh, maybe the insert function is slow."

The Performance Surprise and the Profiling Deep Dive

The benchmark results are shocking. This implementation is dramatically slower – in my case, it is 88% slower – than std::map specifically when operator[] results in inserting a new element. Accessing existing elements might be fine, but the insert path is killing performance.

What gives? Is your tree balancing algorithm inefficient? Is memory allocation slow? This is where debugging tools become essential. Simple code inspection doesn't immediately reveal why it's so much slower as it DO ACHIEVE $O(\log N)$ Time complexity.

Time to bring out the profilers. Tools like perf (on Linux) and callgrind (part of the Valgrind suite) are designed to answer the question: "Where is my program actually spending its time?"

Beginning with perf record ./code followed by perf report is a great start as it already provides simple CLI views to see which functions are "hot" – consuming the most CPU cycles. The perf report points towards functions with names _Unwind_Find_FDE, and various functions involved in stack unwinding and exception handling. This already reminded me to focus on some syntax issues (improper coding) instead of my code. However, I’m unfamiliar with something like _Unwind_Find_FDE, so I use callgrind to further view the instruction counts.

Running callgrind: I run valgrind --tool=callgrind ./code. And I am using macOS, so I use qcachegrind to visualize the results.
- The visualization confirms perf's findings but with more detail. I can see that when sjtu::operator[] calls sjtu::at and at executes throw, a massive cascade of function calls related to exception handling follows - costing 88% of execution time!!!!!
- Crucially, callgrind shows the cost associated not just with the throw itself, but with the entire stack unwinding process – the runtime searching for the catch block and meticulously destroying any local objects created within the try block and intervening function calls.

The "Aha!" Moment

The profilers leave no doubt. The performance bottleneck is the deliberate, designed-in overhead of the C++ exception handling mechanism being triggered repeatedly for a non-exceptional condition (key not found during an insertion).

What Actually Happens When C++ Throws an Exception? (And Why Profilers Flag It)

After chatting with some AI Chatbots and doing some googling, I realize that throwing and catching an exception isn't just a fancy goto. Instead, it involves a complex runtime process that the profilers pick up as costly operations:

Exception Object Creation: throw std::out_of_range(...) creates an object, often involving dynamic memory allocation (heap allocation shows up in profilers).
Stack Unwinding: (The main cost flagged by profilers) The runtime walks backward up the call stack.
- It destroys local objects (RAII cleanup). Profilers show time spent in destructors during unwinding.
- It consults compiler-generated "unwinding tables". Accessing and processing this data takes time/instructions.
Handler Matching: The runtime checks catch blocks using RTTI, adding overhead.
Control Transfer: Jumping to the catch block disrupts linear execution flow, potentially causing instruction cache misses and branch mispredictions (subtler effects seen in very low-level profiling).

Why is This So Slow Compared to a Simple Check?

The profiling results, combined with understanding the mechanics, paint a clear picture:

Stack Unwinding Overhead: As callgrind showed, walking the stack, looking up cleanup actions, and calling destructors is expensive, especially compared to a simple if check.
Runtime Machinery: The hidden machinery (dynamic allocation, RTTI, table lookups) adds significant overhead absent in direct conditional logic.
Optimization Barriers: Exception handling constructs can limit compiler optimizations compared to simpler control flow, contributing to higher instruction counts seen in callgrind.

In our operator[] example, the case where the key doesn't exist is expected. By using exceptions here, we frequently trigger the heavyweight process the profilers flagged, leading to poor performance.

So what does a normal operator[] look like? It should be something like this:

    T &operator[](const Key &key) {
        Node *node = find_node(key, root);
        if (node != nullptr) {
            return node->data.second;
        } else {
            // Insert new element
            value_type value(key, T());
            pair<iterator, bool> result = insert(value);
            return result.first->second;
        }
    }

and the profiler results should look like something like this:

As you can see in the image, the top CPU-consuming functions are now actual function in your code, not the exception handling machinery. The find_node function is now the most expensive operation, which is expected since it involves $O(\log N)$ tree traversal.

Key Takeaway

Exceptions are vital for unexpected errors. However, they have a cost. Your debugging journey using perf and callgrind perfectly illustrates this: they pinpointed the exception handling mechanism as the source of the slowdown when it was used for expected control flow.

Trust your profiler! When benchmarks show a surprising slowdown, tools like perf and callgrind are indispensable for digging into the why. In this case, they confirmed that C++ exceptions, while powerful, should be reserved for true error conditions, not for everyday logic branches in performance-sensitive code. For predictable paths like "find-or-insert", stick to conditional logic.

我的第一个 VS Code 插件开发之旅 - 从零到一开发ACMOJ 助手

Sun, 06 Apr 2025 06:25:00 GMT

在编辑器（VS Code）和浏览器之间频繁切换实在繁琐。在浏览器查看题目描述，样例，输出后再比对结果，复制代码到 VS Code，编写调试，再复制回浏览器提交，最后又切回浏览器查看结果... 尽管有分屏，但是我平时用 macOS 的台前调度分屏体验不好。这个过程不仅打断思路，效率也不高。

能不能在 VSCode 里一站式完成这些操作呢？看到班级里有同学也在开发插件，似乎也不是这么难？况且前阵子学习了golang，感觉typescript也不是这么难学了😋怀着这个想法，我踏上了我的第一个 VSCode 插件开发之旅，目标是为 ACMOJ 打造一个便捷的助手。这篇文章记录了从构思到实现，再到踩坑和最终成型的过程。

一、启程

VS Code 插件主要使用 TypeScript (或 JavaScript) 编写，运行在 Node.js 环境中。开始之前，必要的工具是：

Node.js & npm/yarn: 基础运行环境和包管理器。
Yeoman & generator-code: VS Code 官方推荐的脚手架工具，快速生成项目结构。
```
npm install -g yo generator-code
yo code # 选择 TypeScript Extension
```
VS Code: 开发和调试插件本身。

生成的项目结构清晰明了：

src/extension.ts: 插件的入口文件，包含 activate (激活时调用) 和 deactivate (停用时调用) 函数。
package.json: 核心清单文件，定义插件的元数据、贡献点 (Contributions)（如命令、视图、配置）和激活事件 (Activation Events)（决定何时加载插件）。
tsconfig.json: TypeScript 配置文件。

我的初步蓝图是实现以下核心功能：

认证: 连接到 ACMOJ API。
题目/作业浏览: 在 VS Code 侧边栏查看题目列表。
题目详情: 在 Webview 中展示题目描述、样例等。
代码提交: 从当前编辑器快速提交代码。
结果查看: 在侧边栏或 Webview 中查看提交状态和结果。

二、API 交互与认证

ACMOJ 提供了一套 OpenAPI 规范的 API，这是实现功能的基础。

1. API 客户端: 我选择了 axios 作为 HTTP 请求库，并封装了一个 ApiClient 类来统一处理请求发送、Base URL 配置和错误处理。关键在于设置请求拦截器，自动在 Authorization Header 中附加 Bearer <token>。

2. 认证的“小插曲” - OAuth vs PAT: API 文档同时提到了 OAuth2 (Authorization Code Flow) 和个人访问令牌 (Personal Access Token, PAT) 两种认证方式。

初试 OAuth2: 我最初尝试实现 OAuth2 流程。这涉及到引导用户到浏览器授权，然后在本地启动一个临时 HTTP 服务器监听回调 URI 以获取 code，再用 code 和 client_secret 去换取 access_token。这个流程对于需要多用户授权的应用是标准的，但对于个人使用的插件来说，实现起来相当复杂，尤其是在 VS Code 插件这种环境中安全地处理 client_secret 和本地回调比较棘手。（当然事实上阻止我第一步是我需要有一个 client secret 要向管理组索取。在当时我并不认识管理组的人员，虽然开发了这个插件后他们貌似认识我了XD）
转向 PAT: 考虑到目标用户（主要是自己和同学）可以方便地在 ACMOJ 网站生成 PAT，我决定转向更简单的 PAT 认证。这大大简化了流程：
- 创建一个 AuthService (或 TokenManager)。
- 提供一个 acmoj.setToken 命令，使用 vscode.window.showInputBox({ password: true }) 提示用户输入 PAT。
- 使用 VS Code 的 SecretStorage API (context.secrets.store / context.secrets.get) 安全地存储和读取 PAT。
- 提供 acmoj.clearToken 命令清除存储的 PAT。
- 在 ApiClient 的请求拦截器中，直接从 AuthService 获取存储的 PAT 添加到请求头。
- 在响应拦截器中，如果遇到 401 Unauthorized 错误，则调用 AuthService 的方法清除无效 token 并提示用户重新设置。

三、构建用户界面使用 TreeView 与 Webview

要在 VS Code 中展示信息和提供交互，主要使用了 TreeView 和 Webview。

1. TreeView (侧边栏): 使用 vscode.TreeDataProvider 接口为活动栏（Activity Bar）创建了两个视图：

Problemsets (比赛/作业):
- 最初简单地列出所有题目，但很快发现信息量太大。
- 改进: 改为显示用户加入的 Problemsets。
- 再改进: 根据 Problemset 的开始/结束时间，将其分类为 "Ongoing", "Upcoming", "Passed" 三个顶级节点。这需要获取所有 Problemsets，然后在 getChildren 方法中根据当前时间和分类节点进行过滤和排序。使用了 CategoryTreeItem 和 ProblemsetTreeItem 两种自定义 TreeItem。
- 每个 Problemset 节点设置为可展开 (vscode.TreeItemCollapsibleState.Collapsed)，点击后加载其包含的题目列表 (ProblemBriefTreeItem)。
Submissions (提交记录):
- 显示用户的提交列表，包含 ID、题目、状态、语言、时间等。
- 为不同的提交状态（AC, WA, TLE, RE...）设置了不同的图标 (ThemeIcon)，使其更直观。

实现 TreeView 的关键在于 getChildren (获取子节点) 和 getTreeItem (定义节点外观和行为) 两个方法。通过 EventEmitter 和 onDidChangeTreeData 事件可以通知 VS Code 刷新视图。

2. Webview (详情展示): 当用户点击 TreeView 中的题目或提交记录时，使用 vscode.window.createWebviewPanel 创建一个 Webview 来展示详细信息。为什么要用 webview? 因为要渲染 tex 公式，并且JSON请求回来的是Markdown结果。

内容渲染: Webview 本质上是一个嵌入的浏览器环境，其内容是 HTML。我使用 markdown-it 库来将从 API 获取的 Markdown 格式的题目描述、输入输出格式等转换为 HTML。
挑战：数学公式渲染: OJ 题目描述经常包含 LaTeX 公式。
- 尝试一 (失败): 最初尝试在 Webview 的 HTML 中包含 KaTeX 的 JS 库和 auto-render 脚本，让客户端渲染。但这导致了公式被渲染两次的奇怪问题（一次是原始文本，一次是 KaTeX 渲染结果）。
- 尝试二 (成功): 意识到问题在于渲染流程重复。最终方案是使用 markdown-it 的 KaTeX 插件 (@vscode/markdown-it-katex（这玩意使用npm装的时候是另外一个开发者的，然而已经年久失修并且存在跨域危险，然而好消息是vscode官方注意到了这个项目并且做了后续的修补，所以我用这个玩意）)。在扩展端（Node.js 环境）使用 md.render() 时，这个插件直接将 Markdown 中的 LaTeX ( $...$ , $$...$$) 转换为最终的 KaTeX HTML 结构。这样，发送到 Webview 的 HTML 就已经是预渲染好的，Webview 端只需要包含 KaTeX 的 CSS (katex.min.css) 来正确显示样式即可，不再需要 KaTeX JS 和 auto-render 脚本。

3. 命令与状态栏:

使用 vscode.commands.registerCommand 注册各种用户操作（设置 Token、刷新视图、提交代码、按 ID 查看题目等）。
使用 vscode.window.createStatusBarItem 在状态栏左侧显示当前登录状态和用户名，点击时可以触发相应命令（如显示用户信息或设置 Token）。

四、打包与发布

开发和调试 (F5) 时一切顺利，但当我使用 vsce package 打包成 VSIX 文件，安装到另一台电脑上时，遇到了经典问题：Command 'acmoj.setToken' not found 或 Cannot find module 'axios'。

调试过程:

检查开发者工具: 在测试电脑上打开 VS Code 开发者工具 (Developer: Toggle Developer Tools) 的 Console。发现激活扩展时直接报错 Cannot find module 'axios'。
检查 VSIX 内容: 使用 vsce ls 命令（或将 .vsix 重命名为 .zip 并解压）查看包内容。发现 node_modules 文件夹根本没有被打包进去！

根本原因: 我错误地将运行时必需的库（如 axios, markdown-it, katex, @vscode/markdown-it-katex）放在了 package.json 的 devDependencies 下，而不是 dependencies 下。

dependencies: 扩展运行时必需的库，会被 vsce package 打包。
devDependencies: 开发时使用的库（编译器、类型定义、linter、打包工具等），不会被打包。

解决方案: 仔细检查 package.json，将所有运行时依赖项（axios 等）移到 dependencies 部分，而开发工具（typescript, @types/*, eslint, @vscode/vsce 等）保留在 devDependencies。

{
  "dependencies": {
    "@vscode/markdown-it-katex": "...",
    "axios": "...",
    "katex": "...",
    "markdown-it": "..."
  },
  "devDependencies": {
    "@types/vscode": "...",
    "@types/node": "...",
    "@types/markdown-it": "...",
    "@vscode/vsce": "...", // The packaging tool itself is a dev dependency
    "typescript": "...",
    "eslint": "..."
  }
}

关键步骤: 修改 package.json 后，务必执行 “净空 & 重装”，一开始继续报错就是没有清空 node_modules 以及 package-lock.json：

这次生成的 VSIX 文件终于包含了正确的 node_modules，安装后命令可以正常找到，扩展也成功激活了。

五、TypeScript 小插曲

作为 TypeScript 项目，也遇到了一些典型的类型问题：

找不到模块/类型: Cannot find module 'vscode' 或其他 @types 包，通常通过 npm install --save-dev @types/vscode @types/node ... 解决。
隐式 any: 开启 strict 模式后，需要为回调函数参数（如 progress in withProgress, text in validateInput）显式添加类型。
API 签名不匹配: 调用 vscode.window.showQuickPick 时，如果提供选项对象，则需要传入 QuickPickItem[] 而不是 string[]，需要进行映射。

六、告一段落了吗？

虽然 acmoj-helper 已经能够跑起来，甚至在日常使用中帮了我不少忙，但在开发过程中，我逐渐感受到了一些“成长的烦恼”。随着功能的迭代（哪怕只是微小的调整），我发现代码开始变得有些混乱：

职责不清: commands.ts 文件不仅负责注册命令，还包含了大量像 submitCurrentFile 这样的复杂业务逻辑实现。这使得这个文件异常臃肿，修改起来牵一发而动全身。
高度耦合: 修改一个模块（比如处理 API 缓存的 cache.ts）可能会意外地影响到视图（submissionProvider.ts）或者命令处理。我之前提到修改 submissionProvider 时几乎重写了它，就是一个典型的例子——视图层和数据获取、业务逻辑耦合太紧了。
注册混乱: 命令的注册散落在 extension.ts 和 commands.ts 中，不够集中和清晰。
扩展困难: 如果我想添加新的功能，比如“比赛(Contest)”视图，或者更复杂的题目筛选逻辑，在现有结构下会非常痛苦，需要小心翼翼地在各个文件中穿梭，确保不会破坏现有功能。
测试障碍: 混合了 UI 逻辑、API 调用和业务处理的代码，非常难以进行单元测试。

这些问题让我意识到，当前的架构虽然能工作，但它并不“优雅”，也缺乏长期的生命力。为了让这个项目能够健康地发展下去，也为了提升我自己的代码设计能力，我决定进行一次彻底的重构。

重构目标：解耦、分层、职责单一

我现在正在进行修改的新架构大致分为以下几个层次：

VS Code 集成层 (extension.ts, src/commands/index.ts)
服务层 (src/services/)
- 职责: 封装核心的业务逻辑和与外部资源的交互（如 API、缓存）。每个服务对应一个明确的领域。
命令处理层 (src/commands/)
- 职责: 命令处理器接收来自 VS Code 的调用，然后使用服务层来完成具体的任务。它们是 VS Code 命令和业务逻辑之间的桥梁。复杂的逻辑（如 submitCurrentFile）现在被清晰地封装在对应的命令处理器中。
UI 层 (src/views/, src/webviews/)
- 职责: 负责数据的展示和 UI 交互。
- views/: 包含 TreeDataProvider（如 ProblemsetProvider, SubmissionProvider）。它们从服务层获取数据，并将其格式化为 VS Code TreeView 需要的结构。
- webviews/: 包含 Webview Panel 的逻辑。重构后，我们为题目详情和提交详情创建了专门的类（ProblemDetailPanel, SubmissionDetailPanel），封装了各自的 HTML 生成、消息处理和生命周期管理。它们同样通过服务层获取数据，并且 Webview 内的操作（如“复制代码”）现在通常会通过 postMessage 发送消息给 VS Code，由对应的命令处理器来响应。
核心/数据层 (src/core/, src/types.ts)
- 职责: 提供最基础的组件和定义。
- 重构过程中一个非常典型的例子就是 core/apiClient.ts: 一个更纯粹的 HTTP 客户端，只负责发送请求、处理认证头、重试逻辑和基本的错误解释。它不再包含具体的业务端点逻辑。之前他妈的 getUserProfile, getSubmission, ...之类的全在这里面。

虽然重构的过程实在是让我吃了一坨狗屎，甚至会暂时引入新的 Bug，但它为 ACMOJ 助手的长期发展打下了坚实的基础。现在，我可以更有信心地去实现那些在 1.0 版本末尾构想的、更完善的功能了。

如果你也对 VSCode 插件开发感兴趣，或者想为自己常用的工具或平台构建集成，不要犹豫，动手去做吧！从 yo code 开始，遇到问题，解决问题，这个过程本身就是最好的学习。

项目仓库: [https://github.com/TheUnknownThing/vscode-acmoj]

感谢阅读！希望我的经历能对你有所帮助。

I'll Never Use memset Again...

Mon, 10 Mar 2025 05:11:40 GMT

0. Foreword

This problem originated from my first programming exam during my freshman year... It was a question involving block decomposition (data chunking), and in my program, I had an operation like this:

memset(mul_tag, 1, sizeof(mul_tag));

Unsurprisingly, the program resulted in a WA (Wrong Answer). I spent a very, very long time debugging. This line looked completely harmless, didn't it? But as it turned out, simply changing this line fixed the program! Why??? The answer becomes clear when we look at the memset function prototype.

1. `memset` Function Introduction

The prototype for the memset function is as follows:

void *memset(void *s, int c, size_t n);

s: A pointer to the block of memory to fill.
c: The value to be set. Note: Although c is of type int, memset actually converts c to an unsigned char before filling.
n: The number of bytes to be set to the value.

The purpose of memset is to set the first n bytes of the memory block pointed to by s to the value specified by c.

2. The Trap

memset performs its filling operation byte by byte. When a is an int array (assuming int occupies 4 bytes), memset(a, 1, sizeof(a)) will set each byte of each int element to 1. This results in each int element having the value 0x01010101, which is 16843009 in decimal, not the 1 we were hoping for.

3. Exceptions

Using memset(a, 1, sizeof(a)) is dangerous in most scenarios. However, there are a few exceptional cases where it works as expected or is safe:

If a is a char array, memset(a, 1, sizeof(a)) is correct because the char type occupies only one byte.
memset(a, 0, sizeof(a)) can be safely used for arrays of any type to initialize the entire array to 0. (This is what we typically do! And it's precisely why I initially thought memset(a, 1, sizeof(a)) would be fine!)
memset(a, -1, sizeof(a)) is safe for int arrays and will correctly initialize the elements to -1. Why? Hint: Computers store negative numbers using two's complement representation. The two's complement of -1 (for a 32-bit int) is 11111111 11111111 11111111 11111111, which means every byte is 0xFF. Therefore, memset(a, -1, sizeof(a)) fills every byte with 0xFF, effectively setting each int element to -1.

4. You Should Use `std::fill`

Instead of memset for non-zero/non-minus-one initializations (especially in C++), you should use std::fill.

std::fill Example (C++):

#include <algorithm>
#include <array> // Or use raw arrays

std::array<int, 10> a;  // Or: int a[10];
std::fill(a.begin(), a.end(), 1); // Or: std::fill(a, a + 10, 1);

std::fill operates on elements of the container or array, assigning the specified value (1 in this case) correctly to each element, regardless of its underlying byte representation.

Installing Windows on a IPv6 VPS

Wed, 15 Jan 2025 14:59:00 GMT

If you happen to have a high-configuration cloud server (like my Afly Black Friday VPS) that doesn't provide Windows images, you might want to try installing Windows yourself using the DD method.

What is DD System Installation?

As the name suggests, DD system installation uses the dd command to transfer a vhd file to a specific partition, then configures boot files to make it bootable. As scripts have evolved, many features have been added (like installation from img or iso images, system rescue). However, this isn't the main focus - this tutorial aims to cover the pitfalls I encountered while using such scripts, and how to solve them.

My Environment Configuration

First, let me introduce my environment (these configurations might seem unusual, but these specific characteristics led to some interesting problems):

CPU: 3 Core AMD Ryzen 9 9950X
RAM: 4.5GB
SSD: 125GB
Network: IPv6 /128 Only (Yes, pure IPv6 environment with no IPv4 access! And only a /128 IPv6 allocation, which becomes important later)

Preparation

Script Used

I chose this script: https://github.com/bin456789/reinstall

I strongly recommend carefully reading the README first, as the repository contains detailed instructions on how to use the script.

System Image Selection

I used an image from TeddySun's collection, which you can find by searching https://teddysun.com/?s=DD to find your preferred image. I selected Windows 10 LTSC because it's relatively clean.

Quick Installation Commands

If you're in a hurry, here are the basic installation commands:

# Download the script
curl -O https://raw.githubusercontent.com/bin456789/reinstall/main/reinstall.sh || wget -O reinstall.sh

# Execute the installation
bash reinstall.sh dd --img https://dl.lamp.sh/vhd/zh-cn_windows10_ltsc.xz

Remember to install curl beforehand (if your system doesn't have it)

First Issue: Incorrect DNS Configuration

This problem was mainly caused by my specific network environment. The DNS configuration in the script's Alpine environment was incorrect, preventing files from being downloaded. Here's my solution:

#!/bin/sh

# Modify /etc/resolv.conf file
echo "nameserver 2001:4860:4860::8888" > /etc/resolv.conf
echo "nameserver 2001:4860:4860::8844" >> /etc/resolv.conf

if [ -f /etc/systemd/resolved.conf ]; then
    echo "[Resolve]" >> /etc/systemd/resolved.conf
    echo "DNS=2001:4860:4860::8888" >> /etc/systemd/resolved.conf
    echo "DNS=2001:4860:4860::8844" >> /etc/systemd/resolved.conf
    systemctl restart systemd-resolved
fi

echo "DNS successfully changed to Google IPv6 DNS"

Of course, if you have a normal dual-stack environment, you probably won't encounter this issue.

Second Issue: Password Setup

I found this particularly interesting: when the script first runs, it prompts you to enter a password, but this password is not the one you'll use to log into Windows! Despite the script's README mentioning this, I missed it.

In fact, the Windows login password is determined by the image. For TeddySun's image that I used:

Username: Administrator
Password: Teddysun.com

Third Issue: Windows IPv6 Privacy Protection

This problem puzzled me for a long time. If you run ipconfig /all on a Windows computer, you might notice something called "temporary address." This is because Windows "protects your online privacy," but in my environment with only a /128 IPv6 allocation, this became a problem: external access to your machine is through that fixed IP address, but your machine accesses external websites using a temporary address. This means you can connect via Remote Desktop but can't access the internet.

The solution is simple - open Command Prompt as administrator:

netsh interface ipv6 set privacy state=disable
# Then restart the network adapter

Fourth Issue: Workarounds for Pure IPv6 Environment

This issue also stems from my special network environment. Not having IPv4 access is quite inconvenient, so I used Cloudflare WARP to provide IPv4 access. However, note that if you directly use the Windows version of WARP, after enabling it, your IPv6 address will also change to WARP's address, preventing you from connecting to Remote Desktop!

I used a solution provided by a user on the Nodeseek forum (original post):

Download and install the official CloudFlare WARP client
In WARP settings:
- Click the gear icon in the bottom right → Preferences
- Advanced → Configure Proxy Mode
- Enable proxy mode and set a memorable port

This effectively gives you a locally available Cloudflare-provided IPv4 exit socks proxy, which you can use however you like - with SwitchyOmega or other tools, configure as you prefer. This way, you can maintain Remote Desktop connections while gaining IPv4 access.

Fifth Issue: LTSC Minor Problem

If you chose the LTSC 2021 version of Windows like I did, you might notice that the wsappx service is always running in the background. This issue has a solution on the PCbeta forum; if you're interested, check out this post: LTSC Optimization Guide

Caddy配置Typecho—Revisited

Tue, 14 Jan 2025 17:12:00 GMT

貌似这个Blog上的第一篇文章就讲了怎么配置Caddy，但是，当时我用的别人的docker镜像，which，集成了nginx, php, 以及typecho，然后我当时直接Caddy反代端口来搭建。

现在入手了阿里云这台服务器，512M内存实在是有点拘谨，那为了抛弃nginx以及docker带来的额外内存占用我就准备手搓Typecho的环境。

首先安装世界上最好的编程语言PHP

1. 添加Sury PPA存储库

首先，添加包含最新PHP包的PPA。为此，需要安装一些依赖包。

sudo apt update
sudo apt install lsb-release apt-transport-https ca-certificates software-properties-common -y

安装工具后，导入Sury库的GPG密钥。Sury包含了几乎一切的PHP版本。Typecho要求PHP版本>7.4，所以我们安装8.2

sudo wget -O /etc/apt/trusted.gpg.d/php.gpg https://packages.sury.org/php/apt.gpg

然后将存储库添加到你的源列表中。

sudo sh -c 'echo "deb https://packages.sury.org/php/ $(lsb_release -sc) main" > /etc/apt/sources.list.d/php.list'

更新包列表以验证其功能。

sudo apt update

2. 安装PHP 8.2包

安装PHP 8.2及其常用扩展。

sudo apt install php8.2 php8.2-cli php8.2-fpm php8.2-mysql php8.2-curl php8.2-gd php8.2-mbstring php8.2-xml php8.2-zip php8.2-opcache php8.2-sqlite3 -y

安装Caddy v2

网上找了一圈都是用Caddy v1 + 特殊的伪静态配置来实现了。但是Caddy v2一个非常重要的升级就是不需要额外的伪静态配置，显然v2是更方便的。我们不希望削足适履。

Caddy官方提供了脚本，我当然最推荐这个：

sudo apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | sudo gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | sudo tee /etc/apt/sources.list.d/caddy-stable.list
sudo apt update
sudo apt install caddy

当然，如果你需要额外的插件（比如dns-providers之类的），你自然可以使用xcaddy来自己编译一个，但这不是这篇博文的重点，于是略去。

配置Caddyfile

还是那句话，我不希望削足适履，所以我更希望你学习使用Caddyfile，而非传统的JSON配置文件。

以下配置文件经试验，粘贴即可使用：

YOUR WEBSITE {
        encode gzip
        log
        tls YOUR EMAIL
        header Strict-Transport-Security max-age=31536000
        root * /var/www/YOUR WEBSITE
        php_fastcgi unix//run/php/php8.2-fpm.sock
        file_server
}

此时你是不是真切地对Caddy v2的便捷性有了认识？你不需要配置php-fpm相关的内容，你不需要配置伪静态……可以说是开箱即用的体验！

最后一步！添加你的Typecho站点文件

使用 wget 下载 Typecho 的最新版本：

wget https://github.com/typecho/typecho/releases/latest/download/typecho.zip

使用 unzip 命令将 Typecho 解压到 /var/www 目录：

确保 /var/www 目录存在：
```
sudo mkdir -p /var/www
```
然后创建你的网页目录，比如我的网页是20051110.xyz，那么你就创建
```
sudo mkdir /var/www/20051110.xyz
```

解压 typecho.zip 到 /var/www/你的网站目录：

sudo cd /var/www/你的网站目录
sudo unzip /root/typecho.zip #记得替换成你Typecho实际下载的位置

最后一步！修改文件权限：将 /var/www 目录及其子目录的所有权更改为 www-data 用户和用户组（这是 Web 服务器通常使用的用户）：

sudo chown -R www-data:www-data /var/www

你可以检查一下你的目录结构是否如下图所示：

/var/www/你的网站地址
├── admin/
├── install/
├── usr/
├── var/
├── index.php
├── install.php
└── ...

如果正确，那么就Caddy，启动！

caddy run --config=Caddyfile #这里因为我跟Caddyfile处于同一个目录中，所以就这样写

caddy会自动帮你申请证书，申请完成后，即可访问网站，进行Typecho的安装（不用怕，都是图形界面，点点鼠标即可）

最后如果你喜欢我的Typecho主题，欢迎来这个主题的Github仓库看看！这个主题我基于原有的Fantasy主题开发，几乎完全重写了css，支持了深色模式，还添加了一些功能（比如碎碎念页面，比如底部的一言，……）欢迎给我个Star！我的仓库链接

感谢你读到这里！希望这篇文章对你有用！

A Few Things About OpenWRT Compilation

Sun, 20 Oct 2024 01:05:00 GMT

Let me answer a few questions first

Why compile it myself? I'm a mature computer science student (lol).
Why use Github Actions? Because Github is really convenient ~~I originally wanted to use my dedicated AMD 9950X as the build machine, but it failed after 15 minutes and I was too lazy to troubleshoot~~
Why can't I understand this? ~~If you don't understand, don't read it.~~ Just download pre-compiled packages from others.

The following is based on the latest OpenWRT (23.05) + Github Actions online compilation

1. Preparation

Clone the repository locally

First, you need to Fork the LEDE source code. LEDE repository [Github][1]

Clone the repository you just forked to your local machine:

git clone https://github.com/your-username/lede

Don't download ZIP! The ZIP file is not a Git repository and doesn't contain the .git folder, so you can't use git commands on it.

Update Feeds

cd lede
./scripts/feeds update -a
./scripts/feeds install -a

If you don't update the feeds, you won't see Luci apps later! This step is mandatory!

Enter the configuration menu

Use the following command to enter the configuration menu:

make menuconfig

Configuration menu explanation

Generally, you only need to modify these:

Target System: Processor architecture
Subtarget: Select processor
Target Profile: Preconfigured profile
LuCI: LuCI plugins
- Applications: Applications
- Themes: Themes

For example, I selected:

Target System: Mediatek-ARM
Subtarget: Filogic
Target Profile: ASR3000
LuCI: LuCI plugins
- Applications: Many fun plugins for you to explore!
- Themes: luci-theme-material

After making changes, select Save to save as a .config file.

For Luci plugins, please refer to [this article on the Enshan forum][2]

Commit to your forked repository

Delete the /.config line in the .gitignore file to stop ignoring the config file. Very important!!! Otherwise, the .config file won't be included when you commit!!!
Commit changes to GitHub:

git add .
git commit -m "upd: personal config"
git push origin master

~~The branch is called master, has a master-servant flavor to it~~

Pitfalls:

Enable WIFI before compilation

If you need to enable WIFI by default for easy management, I searched many tutorials online but they were useless, mostly from around 2015 with no reference value. I figured it out myself: Go to the package/lean/default-settings/files/ directory, edit the file zzz-default-settings Comment out these two lines by adding # at the beginning:

sed -i '/option disabled/d' /etc/config/wireless
sed -i '/set wireless.radio${devidx}.disabled/d' /lib/wifi/mac80211.sh

Github Actions Compilation

Online guides still say "submit a Release and it will automatically trigger Github Actions" but that didn't work for me, so I needed to make some changes:

When using Github Actions for compilation, remember to go to the Workflow page and enable the Workflow, and also enable OpenWrt-CI (because Workflows in forked repositories are disabled by default)
Also modify the repository's .github/workflows/openwrt-ci.yml, changing the cron task at the beginning (line 10) to the following to allow manual workflow triggering:
```
on:
  repository_dispatch:
  workflow_dispatch:
```

It will take about two hours, ~~but what does that have to do with me since I'm using Github's resources~~

Modifying various miscellaneous settings

Change the default theme

sed -i "s/luci-theme-bootstrap/luci-theme-material/g" feeds/luci/collections/luci/Makefile

 (Nowadays people's aesthetics seem to prefer the argon theme, anyway this should match what you installed in the `luci-themes` section of your `.config`)
- Add compiler information

sed -i "s/OpenWrt /TheUnknownThing build $(TZ=UTC-8 date "+%Y.%m.%d") @ OpenWrt /g" package/lean/default-settings/files/zzz-default-settings

You probably don't want to keep "TheUnknownThing" as your builder name, change it to something else.
- Modify the default management address
The default management address is `192.168.1.1`, if it conflicts with your upstream network segment, you can modify it

sed -i 's/192.168.1.1/192.168.2.1/g' package/base-files/files/bin/config_generate

This changes it to `192.168.2.1`



[1]: https://github.com/coolsnowwolf/lede
[2]: https://www.right.com.cn/forum/thread-3682029-1-1.html

如何优雅地使用LaTeX做批注

Wed, 09 Oct 2024 00:00:00 GMT

先甩结论，太晚了，想睡觉。思路来源于 Stackexchange。我之前也尝试过几个软件，但我还是想上课只带iPad，然后远程Linux使用VNC或者xrdp实在是太不优雅，而且延迟颇高。最终还是准备用Overleaf+LaTeX插入PDF Pages来做批注。

Stackexchange上作者给出的解答是：

\documentclass{article}
%\url{http://tex.stackexchange.com/q/85651/86}
\usepackage[svgnames]{xcolor}
\usepackage{pdfpages}
\usepackage{tikz}

\tikzset{
  every node/.style={
    anchor=mid west,
  }
}

\makeatletter
\pgfkeys{/form field/.code 2 args={\expandafter\global\expandafter\def\csname field@#1\expandafter\endcsname\expandafter{#2}}}

\newcommand{\place}[3][]{\node[#1] at (#2) {\csname field@#3\endcsname};}
\makeatother
\newcommand{\xmark}[1]{\node at (#1) {X};}

\begin{document}

\foreach \mykey/\myvalue in {
  ctsfn/{Defined in Week 1},
  metsp/{Defined in Week 3},
} {
  \pgfkeys{/form field={\mykey}{\myvalue}}
}

\includepdf[
  pages=1,
  picturecommand={%
    \begin{tikzpicture}[remember picture,overlay]
%%% The next lines draw a useful grid - get rid of them (comment them out) on the final version
    \draw[gray] (current page.south west) grid (current page.north east);
\foreach \k in {1,...,28} {
      \path (current page.south east) ++(-2,\k) node {\k};
}
\foreach \k in {1,...,20} {
      \path (current page.south west) ++(\k,2) node {\k};
}
%%% grid code ends here
\tikzset{every node/.append style={fill=Honeydew,font=\large}}
\place[name=ctsfn]{14cm,17cm}{ctsfn}
\place[name=metsp]{11cm,9cm}{metsp}
\draw[ultra thick,blue,->] (ctsfn) to[out=135,in=90] (9cm,17.3cm);
\draw[ultra thick,blue,->] (metsp) to[out=155,in=70] (6cm,9cm);
    \end{tikzpicture}
  }
]{tikzmark_example.pdf}

\end{document}

原作者的效果图：

这个我一看就很心水，因为：

有坐标格子，插入annotation很方便；
可扩展性强，图文混排，插TiKZ的图，公式，都可以做到。

但是, 有几个问题：

老师下发的Beamer文件是横屏的，这个编译出来是竖版；
这里宏定义有点乱，而且我不需要飞线啊之类的东西；而且每次include page都太长，不优雅。
这里坐标太丑了。

所以我来解决一下：

\usepackage[paperwidth=12cm, paperheight=16cm, landscape]{geometry} 使用Geometry让他变为横板。
定义一个宏来简化 includepdf 的使用，并支持多个批注

\newcommand{\includePDFWithAnnotations}[2]{
\includepdf[
  pages=#1,
  picturecommand={%
    \begin{tikzpicture}[remember picture,overlay]
    %%% The next lines draw a useful grid - get rid of them (comment them out) on the final version
    \draw[very thin, lightgray] (current page.south west) grid (current page.north east);
    \foreach \k in {0,...,11} {
      \path (current page.south east) ++(-0.55,\k + 0.2) node[font=\tiny] {\k};
    }
    \foreach \k in {0,...,14} {
      \path (current page.south west) ++(\k,0.2) node[font=\tiny] {\k};
    }
    %%% grid code ends here
    \tikzset{every node/.append style={fill=Honeydew,font=\huge}}
    % 遍历批注列表并放置批注
    #2
    \end{tikzpicture}
  }
]{YOUR PDF NAME.pdf}
}

优雅地使用宏，插入多个批注：

\includePDFWithAnnotations{1}{
\place{5, 4}{$123avd$}
\place{7, 8}{$456xyz$}
}

\includePDFWithAnnotations{7}{
\place{5, 4}{$123avd$}
\place{7, 8}{$456xyz$}
}

把坐标的grid移动到页面边缘，字体变成tiny，线条变细、颜色变淡。美学效果好。

是不是很爽！

最后贴上完整的TeX示例：

\documentclass[UTF8]{ctexart}
\usepackage[svgnames]{xcolor}
\usepackage[paperwidth=12cm, paperheight=16cm, landscape]{geometry}
\usepackage{pdfpages}
\usepackage{tikz}

\tikzset{
  every node/.style={
    anchor=mid west,
  }
}

\makeatletter
\pgfkeys{/form field/.code 2 args={\expandafter\global\expandafter\def\csname field@#1\expandafter\endcsname\expandafter{#2}}}

\newcommand{\place}[2]{\node at (#1) {#2};}
\makeatother

\newcommand{\xmark}[1]{\node at (#1) {X};}

\newcommand{\includePDFWithAnnotations}[2]{
  \includepdf[
    pages=#1,
    picturecommand={%
      \begin{tikzpicture}[remember picture,overlay]
      %%% The next lines draw a useful grid - get rid of them (comment them out) on the final version
      \draw[very thin, lightgray] (current page.south west) grid (current page.north east);
      \foreach \k in {0,...,11} {
        \path (current page.south east) ++(-0.55,\k + 0.2) node[font=\tiny] {\k}; % 调整字体大小
      }
      \foreach \k in {0,...,14} {
        \path (current page.south west) ++(\k,0.2) node[font=\tiny] {\k}; % 调整字体大小
      }
      %%% grid code ends here
      \tikzset{every node/.append style={fill=Honeydew,font=\huge}}
      % 遍历批注列表并放置批注
      #2
      \end{tikzpicture}
    }
  ]{LA1-4.pdf}
}

\begin{document}

\includePDFWithAnnotations{1}{
  \place{5, 4}{$123avd$}
  \place{7, 8}{$456xyz$}
}

\includePDFWithAnnotations{7}{
  \place{5, 4}{$123avd$}
  \place{7, 8}{$456xyz$}
}

\end{document}

实时文件同步：从 `rsync` 到 `lsyncd` 的折腾之路

Sun, 06 Oct 2024 10:56:00 GMT

自从成为mjj以后手头上VPS很多啊,真的是遏制不住自己的欲望去买买买。但是,为了保证我的博客能在多台服务器上保持数据同步,我也是花了不少精力。我厌烦cron自动打包博客的目录然后手动备份,说到底还是懒惰,希望能全自动解决问题。正好,我的typecho博客也是docker镜像部署的,那也就再麻烦一下下折腾一下多端数据同步的问题,做到一个站点更改,所有站点响应。

在多台服务器之间同步文件,rsync 是一个常用的工具。它通过增量传输、压缩和删除操作,实现了高效的目录同步。然而,rsync 的经典工作模式是"手动或定时触发",这对于那些需要实时同步的场景而言显得力不从心。

1. rsync 的工作模式

rsync 通过比较源和目标目录之间的差异,仅传输发生变化的文件或数据块,从而减少带宽消耗。这种方式非常适合备份和同步大量数据,特别是在带宽有限的环境下。然而,rsync 通常需要通过手动或定时任务(如 cron)触发。对于需要实时更新的应用场景,这种模式会导致数据滞后和资源浪费。

2. rsync + inotify 的不足

为了解决实时同步的问题,可以使用 inotify 监控文件系统的变化,并在发生变化时触发 rsync。然而,这种方式有几个明显的不足：

inotify 需要额外的脚本来结合 rsync 工作,增加了系统的复杂度。
这种方案通常是单向的,无法实现多源实时同步。这与我的目的相互违背了。

3. lsyncd 的优势

为了解决上述问题,lsyncd 结合了 inotify 的实时监控能力和 rsync 的高效传输能力,实现了简单而强大的多源实时同步。lsyncd 的优势在于：

lsyncd 通过一个简单的配置文件即可完成复杂的实时同步任务,无需编写额外的脚本。
支持多台服务器间的双向或多向同步,确保每台服务器上的数据都是最新的。这点对于我而言十分重要。

4. lsyncd 的保姆级配置教程

以下是如何使用 lsyncd 实现多源实时同步的步骤：

安装 lsyncd 和 rsync：

在所有参与同步的服务器上,运行以下命令安装必要工具：

sudo apt-get install lsyncd rsync

配置 lsyncd：

在每台服务器上,创建配置文件 /etc/lsyncd.conf,内容如下：

settings {
    logfile = "/var/log/lsyncd/lsyncd.log",
    statusFile = "/var/log/lsyncd/lsyncd.status",
    inotifyMode  = "CloseWrite or Modify",
    maxProcesses = 1,
    maxDelays = 1,
    -- nodaemon =true,
}

sync {
    default.rsyncssh,
    source = "/var/www",
    targetdir = "/var/www",
    host = "45.*.*.*",
    delete = true,
    rsync = {
    binary = "/usr/bin/rsync",
    archive = true,
    compress = true,
    verbose = true,
    },
    delay = 1,
}

解释：

source：本地的监控目录 /var/www/。(替换成你自己的)
host：远程目标服务器(排除自身)。
targetdir：远程目标目录 /var/www/。(替换成你自己的)
delay：设置延迟同步时间(秒),可以防止频繁变动时过多同步。
delete：在目标服务器上删除在源服务器上已删除的文件。

注：使用rsyncssh时maxProcesses必须为1,使用rsync时可以选择大一点的数值(比如5)

注意：为了查错,建议先用 lsyncd /etc/lsyncd.conf 启动一下,来排除错误。另外,请先建立好 log 的目录,即 mkdir /var/log/lsyncd。

这里还要另外提一嘴,为了让各服务器能够无密码登录到彼此,需要配置 SSH 无密码登录。

为了实现自动化的实时同步,需要确保源服务器能够通过 SSH 无密码登录到目标服务器。

在源服务器上生成 SSH 密钥：

ssh-keygen -t rsa

按提示完成操作,通常不设置密码短语。

将公钥复制到目标服务器：

ssh-copy-id user@target-server

这会将生成的公钥复制到目标服务器,以便能够无密码登录。注意：两边的服务器都要配置。

配置完成后启动验证即可。

启动 lsyncd：

lsyncd -log Exec /etc/lsyncd.conf

验证配置：

在任意一台服务器的 /var/www/ 目录中进行文件操作,检查其他服务器的同步情况。

5. 冲突解决的方法

当多台服务器同时对同一文件进行修改时,可能会产生冲突。但是我懒,而且我的使用场景应该不会出现冲突。所以不改了 :D

阉割Adobe Acrobat的OCR功能

Wed, 11 Sep 2024 19:11:00 GMT

Acrobat的OCR真的烦到我了。每次编辑PDF都会卡一下，要等当前那页OCR识别完了再把自动文字识别关了。所以现在就一劳永逸的阉割了它。前往这个目录：

C:\Program Files (x86)\Adobe\Acrobat DC\Acrobat\plug_ins

看到下面有一个"PaperCapture"了吗？重命名为"PaperCapture阉割"就好了。

方正书版10.0从安装到入门

Sat, 16 Mar 2024 21:05:00 GMT

今儿花了一下午总算是把方正书版10.0搞定了。

1. 安装PDF Creator

方正PDFCreator 3.0

重要提示：请务必在系统装完后第一时间就安装字库和PDF Creator（虽然我不信这个邪，但是确实这样会少很多乱七八糟字体的干扰或者你从另外什么地方安装了字库的干扰）

安装顺序如下：

安装PDFCreator3108；
把破解文件覆盖在安装文件的目录下C:\ProgramFiles\Founder\PDFCreator\Bin
导入注册表〖根据你安装在哪个盘上要修改盘符和路径〗；(没安装RIP软件的，才导入这个注册表文件。目的是“欺骗”系统，让系统认为你安装了RIP软件)
先安装CID5.01(748_GB)字库，“方正CID V5.00〖全套〗”安装密码：安装系列号：000000000 安装密码：42C2D35B4735036B; 字体密码：5918347506891A57（包括GBK、GB/748，都一样！）再安装CID5.0(GBK)字库，安装序列号000000000安装密码：ce9d84241294e529;字体密码：2e4965af7e74ad68 ；安装字库时选择“方正世纪RIP”(我这里没有弹出选择这个选项，不过不要紧，还是顺利安装了)；
字库路径为:C:\ProgramFiles\Founder\PDFCreator\Font，此时会在这个目录下生成一个fonts的目录C:\Program Files\Founder\PDFCreator\Fonts；（也可能不会生成！这个时候需要另外一个文件帮忙！安装完两个字库一半会生成一个FONTS文件夹，但很多人电脑偏偏没有生成，有些简单的后端字体识别不出，所以要借用PSPNT的FONTS文件夹字体来补充。）
打开PDFCreator 3108；
字体重置：字体路径为: C:\ProgramFiles\Founder\PDFCreator\Resource\CIDfont；TrueType字体路径为：C:\WINDOWS\Fonts
PDFCreator重置字库时候千万不要去点它或者动它，不然会死机，只能重启。
我这边提示安装成功了1100多个字体，最后可以正常输出PDF。

2. 安装书版10.0

这个不用多说了，安装完成以后把修改的文件复制到安装目录下。

注意：在这一步就安装书版10.0还有女娲补字就好了，别的一切都不要安装，包括这里面什么字体啊什么的，免得到时出问题。

3. PS输出设置

在FBD输出PS/EPS时候点开左下角的“选项”，我这里比较粗暴，把后端748字库和后端GBK字库都勾选了“全部已安装”。按照我的测试来看，只要你的PDFCreator配置良好，这样就可以正常输出PDF了。在输出之前不妨去网络上随便找个正常的PS文件试试看，避免是因为自己PS有问题错怪了PDFCreator。

4. Word文件转FBD小样的一些问题

已知书版10.0的doc文件转换是broken的，不要用。
在使用这个网络上大神开发的软件时，小样的输出会有一些问题，这里总结如下：

如果你在用Word转FBD 6.0版本，那么最终doc文件里的所有MathType公式都保留其原样即可，不要跟着网上的教程转换啊什么的；如果你用的是5.6版本，那么需要跟着网上教程走。

MathType转换过后sin, cos, π, ln……这种数学符号会变斜，需要自己纠正。我写了个VSCode里用的正则表达式可以参考（Ctrl+F查找替换，选择正则表达式）（这里看的会是乱码，但是粘贴进去就是那个圈z）：

上面填这些：(cos|sin|tan|π|lim|ln|i)
下面填这些：$1

同样对于根号和字符贴在一起的情况，需要在〖KF(〗前加圈1/2，同样可以使用替换来实现。

对于选择题选项的排版，这里写了一个Python小程序，你只需要配置好第一个选择题的WB即可，它实现了以下功能：

在开头和结尾加上〖ZK(〗和〖ZK)〗（ZK+换行符）
替换（换段）符号（除了第一处）为〖DW1〗到〖DW3〗

import pyperclip

def transform_text(input_text):
    # 在开头和结尾加上特定标记
    transformed_text = '〖ZK(〗' + input_text + '〖ZK)〗'
  
    # 替换符号为〖DW1〗到〖DW3〗，跳过第一个符号
    parts = transformed_text.split('')
    transformed_text = ''.join(parts[:2])  # 保留第一个符号
    for i, part in enumerate(parts[2:], 1):  # 从第二个符号开始替换
        transformed_text += '〖DW' + str(i) + '〗' + part
  
    return transformed_text

# 原始文本
original_text = ''
original_text = pyperclip.paste()

transformed_text = transform_text(original_text)

# 将转换后的文本打印出来
print(transformed_text)
pyperclip.copy(transformed_text)

至此，请愉快地开启你的排版生涯吧！

Starting to Write Again

Sat, 09 Mar 2024 22:44:00 GMT

It probably began with a sudden inspiration during the winter break, preparing to take care of my blog again.

The last time I seriously ran a blog was perhaps in middle school. Back then, I thought having a Blogger was cool, and having your own space for thoughts on the internet was great and trendy—I guess it's not like that anymore?

CNBlogs urges everyone to turn off ad-blocking plugins, the atmosphere on CSDN in China keeps getting worse, and Bloggers have turned into Vloggers—why am I starting to take care of my blog again at this time?

I didn't expect to encounter so many difficulties deploying Typecho... My previous setup was Nginx+MySQL+BaoTa, so you can understand that BaoTa had taken care of everything; all I needed to do was click the deployment button and it was ready to go.

So maybe the first thing I tried to do to become a True Blogger is to organize all the things myself.

I dropped Nginx for Caddy2 (isn't this just asking for trouble? I couldn't find any working URL rewrite configurations online after searching for days!) Of course, I thought about giving up (php-fpm configuration was fine, mysql configuration was fine, Caddy rewrite was fine and the homepage was accessible, but articles were unreadable? Login page worked but couldn't log in?) In the end, the almighty Docker solved the problem, and I think it's worth pasting the setup here for backup:

docker run -d \
--name=typecho-blog \
--restart always \
--mount type=tmpfs,destination=/tmp \
-v /root/Typecho-Files:/data \
-e PHP_TZ=Asia/Shanghai \
-e PHP_MAX_EXECUTION_TIME=600 \
-p 127.0.0.1:9080:80 \
80x86/typecho:latest

Here I didn't expose the host port because I planned to use Caddy as a reverse proxy. When using Caddy as a reverse proxy, be aware of these two pitfalls I encountered (initially I could only access the homepage, but clicking on any content would redirect me to localhost:9080 which was inaccessible—how to solve it? Turns out I didn't properly set X-Forwarded-Proto and X-Forwarded-Port):

Check Typecho's config.inc.php file to ensure that TYPECHO_SITE_URL is set to your public domain.
In the Caddy configuration, make sure to set the correct X-Forwarded-For and X-Forwarded-Proto headers so Typecho knows the actual request protocol and client IP.

Your Caddy configuration should look like this:

YOUR_DOMAIN_GOES_HERE {
  reverse_proxy http://localhost:9080 {
       header_up Host {host}
       header_up X-Forwarded-Host {host}
       header_up X-Forwarded-For {remote_host}
       header_up X-Forwarded-Proto {scheme}
    }
  tls YOUR_EMAIL_GOES_HERE
}

Great, I finally have my own blog again. Hope I can write more in the future.

TheUnknownBlog

Enabling KVM GPU Passthrough

Credits

Enabling IOMMU

Setup

BIOS Settings

Checking for IOMMU Support on your CPU

Linux GRUB Settings

GPU Passthrough

Find IOMMU Groups

Loading the Correct Kernel Modules

Passing the GPU to the Guest VM

Creating a VM

Prerequisites: Check Hardware Virtualization Support

Install Libvirt

Add Your User to the libvirt Group

Verify the Installation

Create a Virtual Machine

Launching virt-manager

Accessing VM through virsh console

Virsh console

Inside the Guest VM

CS188 Notes 4 - Reinforcement Learning

Note:

Reinforcement Learning

Passive RL

Model-based

Model-free

Q-Learning (Active RL)

Experience Replay

Hold on a second

Approximate Q-Learning

Policy Gradient Methods

CS188 Notes 2 - Markov Decision Processes (MDPs)

Note:

CS188: Lecture 8 - Markov Decision Processes (MDPs)

Markov Decision Processes (MDPs)

Stationary Preferences

Why MDPs?

MDPs vs Expectimax

MDPs and Multi-Agent Games

MDPs vs Greedy Search

CS188 Notes 3 - Markov Decision Processes (MDPs) II

Note:

Markov Decision Processes (MDPs)

Why use a "fixed policy" in Policy Iteration?

Step 1: Policy Evaluation

Step 2: Policy Improvement

What is the difference between Policy Iteration and Value Iteration?

Value Iteration Equation:

Policy Evaluation Equation (for a fixed policy $\pi$):

Comparison Table

Does Policy Evaluation converge after more iterations than Value Iteration?

CS188 Notes 1 - Constraint Satisfaction Problems (CSPs)

Note:

CS188: Lecture 4 - Constraint Satisfaction Problems (CSPs)

Backtracking Search

Filtering (Constraint Propagation)

Speeding Up Backtracking

The Hidden Cost of try-catch

The Problem

The Performance Surprise and the Profiling Deep Dive

The "Aha!" Moment

What Actually Happens When C++ Throws an Exception? (And Why Profilers Flag It)

Why is This So Slow Compared to a Simple Check?

Key Takeaway

我的第一个 VS Code 插件开发之旅 - 从零到一开发ACMOJ 助手

一、启程

二、API 交互与认证

三、构建用户界面使用 TreeView 与 Webview

四、打包与发布

五、TypeScript 小插曲

六、告一段落了吗？

I'll Never Use memset Again...

0. Foreword

1. memset Function Introduction

2. The Trap

3. Exceptions

4. You Should Use std::fill

Installing Windows on a IPv6 VPS

Add Your User to the `libvirt` Group

Launching `virt-manager`

Accessing VM through `virsh console`

1. `memset` Function Introduction

4. You Should Use `std::fill`