Linux 内核源代码情景分析（二）（下）-阿里云开发者社区

Linux 内核源代码情景分析（二）（中）：https://developer.aliyun.com/article/1597957

5、系统调用exit()与wait4()

系统调用 exit() 与 wait4() 的代码基本上都在 kernel/exit.c 中，下面我们在引用代码时凡不特别说明出处的均来自这个文件。

先来看 exit() 的实现。

（1）sys_exit

// kernel/exit.c
asmlinkage long sys_exit(int error_code)
{
  do_exit((error_code&0xff)<<8);
}

// 首先，在函数的类型void前面还行个说明NORET_TYPE。在include/linux/kernel.h中NORET_TYPE 定义为
// “/**/”，所以对编译毫无影响，但起到了提醒读者的作用。CPU在进入do_exit()以后，当前进
// 程就在中途寿终正寝，不会从这个函数返回。所谓不会从这个函数返回到底是怎么回事，又是什么原
// 因，读者在读了下面的代码以后就明白了。
NORET_TYPE void do_exit(long code)
{
  struct task_struct *tsk = current;
// 这里只指出，既然CPU不会从do_exit()中返回，也就不会
// 从sys_exit()中返回，从而也就不会从系统调用exit()返回。也只有这样，才能达到“exit”，即从系统
// 退出的目的。另一方面，所谓 exit，只有进程(或线程)才谈得上。中断服务程序根本就不应该调用
// do_exit()，不管是直接还是间接调用。所以，这里首先通过in_interrupt()对此加以检杏，如发现这是
// 在某个中断服务程序中调用的，那就一定是出了问题。
// 那么，怎么知道是否在中断服务程序中呢？让我们来看看在
// include/asm-i386/hardirq.h中定义的 in_interrupt()
  if (in_interrupt())
    panic("Aiee, killing interrupt handler!");
  if (!tsk->pid)
    panic("Attempted to kill the idle task!");
  if (tsk->pid == 1)
    panic("Attempted to kill init!");
  tsk->flags |= PF_EXITING;
// 进程在决定退出之前可能已经设置了实时定时器，也就是将其 task_struct 结构中的成员 real_timer 
// 挂入了内核中的定时器队列。现在进程即将退出系统，一来是这个定时器己经没有了存在的必要，
// 一来进程的task_struct 结构行将撤销，作为其成员的real_timer 也将“皮之不存，毛将焉附”，当然要先
// 将它从队列中脱离。所以，要通过del_timer_sync()将当前进程从定时器队列中脱离出来。
  del_timer_sync(&tsk->real_timer);

fake_volatile:
#ifdef CONFIG_BSD_PROCESS_ACCT
  acct_process(code);
#endif
// 可想而知，进程在结束生命退出系统之前要释放其所有的资源。我们在前一节的do_fork()中看到
// 从父进程“继承”的资源有存储空间、已打开文件、工作目录、信号处理表等等。相应地，这里就有
// __exit_mm()、__exit_files()、__exit_fs() 以及 exit_sighand()。可是，还有一种资源是不
// "继承”的， 所以在do_fork()中不会看到，那就是进程在用户空间建立和使用的“信号量”(semaphore)。
// 这是一种用于进程间通讯的资源，如果在调用exit()之前还有信号量尚未撤销，那就也要把它撤销。
// 这里有一个简单的准则，就是看task_struct数据结构中的各个成分，如果一个成分是个指针，
// 在进程创建时以及运行过程中要为其在内核中分配一个数据结构或缓冲区，并且这个指针又是
// 通向这个数据结构或缓冲区的惟一途径，那就一定要把它释放，否则就会造成内核的存储空间“泄漏”。
// 例如，指针sig指向进程的信号处理表，这个表所占的空间是专为Sig分配的，指针sig就是进入这个表
// 的惟一途径，所以必须释放。而指针p_pptr指向父进程的task_struct结构，可是父进程的task_struct
// 结构却并不是专门为子进程的p_pptr而分配的，这个p_pptr并不是进入其父进程的task_struct
// 的惟一途径，所以不能把这个数据结构也释放掉，否则其他指向这个结构的指针就都 “悬空” 了。
// 具体到用户空间信号量，当进程在用户空间创建和使用信号量时，内核会为进程task_struct结构中
// 的两个指针 semundo 和 semsleeping 分配缓冲区 (semundo 数据结构和sem_queue数据结构，
// 详见“进程间通信”)。而且，这两个指针就是进入这些数据结构的惟一途径，所以必须把它们释放。
// 函数sem_exit()的代码在 ipc/sem.c 中。
  __exit_mm(tsk);

  lock_kernel();
  sem_exit();
  __exit_files(tsk);
  __exit_fs(tsk);
  exit_sighand(tsk);
  exit_thread();

// 如果当前进程是一个session中的主进程(current->leader 非 0)，那就还要将整个session与其主控终
// 端的联系切断，并将该tty释放(注意，进程的task_struct结构中有个指针tty指向其主控终端)。
// 函数 disassociate_ctty() 的代码在 drivers/char/tty_io.c 中
  if (current->leader)
    disassociate_ctty(1);

  put_exec_domain(tsk->exec_domain);
  if (tsk->binfmt && tsk->binfmt->module)
    __MOD_DEC_USE_COUNT(tsk->binfmt->module);

  tsk->exit_code = code;
// 接着，当前进程的状态就改成了 TASK_ZOMBIE，表示进程的生命已经结束，从此不再接受调度。
// 但是当前进程的残骸仍旧占用着最低限度的资源，包括其task_struct数据结构和系统空间堆栈所在的
// 两个页面。什么时候释放这两个页面呢？当前进程自己并不释放这两个页面，就像人们自己并不在临
// 终前注销自己的户口一样，而是调用exit_notify()通知其父进程，让父进程料理后事。
// 为什么要这样安排，而不是让当前进程，也就是子进程自己照料一切呢？有两个原因。首先，在
// 子进程的task_struct数据结构中还有不少有用的统计信息，让父进程来料理后事可以将这些统计信息
// 并入父进程的统计信息中而不会使这些信息丢失。其次，也许更重要的是，系统一但进入多进程状态
// 以后，任何一刻都需要有个“当前进程”存在。读者在第3章中看到了，在中断服务程序以及异常处
// 理程序中都要用到当前进程的系统空间堆栈。如果子进程在系统调度另一个进程投入运行之前就把它
// 的task_struct结构和系统空间堆栈释放，那就会造成一个空隙，如果恰好有一次中断或者异常在此空
// 隙中发生就会造成问题。诚然，中断是可以关闭的，可是异常却不能通过关中断来防止其发生，更何
// 况还有“不可屏蔽中断”哩。所以，子进程的task_struct结构和系统空间堆栈必须要保存到另一个进
// 程开始运行之后才能释放。这样，让父进程料理后事就是一个合理的安排了。此外，这样安排也有利
// 于使程序简化，否则的话调度程序schedule()就得要多考虑一班特殊情况了。让我们来看看exit.c中函
// 数 exit_notify()的源代码
  exit_notify();

// 回到do_exit()中时，剩下的大事只有一件了，那就是schedule()，即进程调度。前面讲过，
// do_exit()是不返回的，实际上使do_exit()不返回的正是这里的schedule()。换言之，在这里对
// schedule()的调用是不返回的。当然，在正常条件下对 schedule() 的调用是返回的，
// 只不过返回的时机要延迟到本进程再次被调度而进入运行的时候。函数 schedule() 按照一定的准则
// 从系统中挑选一个最适合的进程进入运行。这个进程有可能就是正在运行的进程本身，也可能是
// 另一个进程。如果不同的话，那就要进行切换。而当前进程虽然被暂时剥夺了运行权，
// 却维持其“运行状态”，即task->state不变，等待下一次又在schedule()中(由另一个进程引起，
// 或者因中断进入内核后从系统空间返回用户空间之前)被选中时再继续运行，从而从schedule()中返回。
// 所以，什么时候从schedule()返回取决于什么时候被进程调度选中而得以继续运行。可是，在这里，当
// 前进程的task一>state已经变成了 TASK_ZOMBIE，这个条件使它在 schedule() 中永远不会再被选中，
// 所以就“黄鹤一去不复返了”。而这里对schedule()的调用，实际上(从CPU的角度看)也是返回的，只
// 不过是返回到另一个进程中去了，只是从当前进程的角度来看没有返回而已。不过，至此为止，当前
// 进程还只是因为不会被选中而不能返回，从理论上说只是无限推迟而已，其task_struct结构还是存在
// 的。到父进程收到子进程发来的信号而来料理后事，将子进程的task_struct结构释放之时，子进程就
// 最终从系统中消失了。在我们这个情景中，父进程正在wait4()中等着哩。
  schedule();
  BUG();
/*
 * In order to get rid of the "volatile function does return" message
 * I did this little loop that confuses gcc to think do_exit really
 * is volatile. In fact it's schedule() that is volatile in some
 * circumstances: when current->state = ZOMBIE, schedule() never
 * returns.
 *
 * In fact the natural way to do all this is to have the label and the
 * goto right after each other, but I put the fake_volatile label at
 * the start of the function just in case something /really/ bad
 * happens, and the schedule returns. This way we can try again. I'm
 * not paranoid: it's just that everybody is out to get me.
 */
  goto fake_volatile;
}

⑴ in_interrupt

// include/asm-i386/hardirq.h
/*
 * Are we in an interrupt context? Either doing bottom half
 * or hardware interrupt processing?
 */
#define in_interrupt() ({ int __cpu = smp_processor_id(); \
  (local_irq_count(__cpu) + local_bh_count(__cpu) != 0); })

在单 CPU 的系统中，__cpu 一定是 0。在第3章中讲到过函数 handle_IRQ_event()，在其入口处和出口处各有一个函数调用 irq_enter() 和 irq_exit() ，就分别递增和递减计数器 local_irq_count[ __cpu ] 。所以，只要这个计数器为非 0，就说明 CPU 在 handle_IRQ_event() 中。类似地，只要计数器 local_bh_count[ __cpu ] 为非 0，就说明 CPU 正在执行某个 bh 函数，这也跟中断服务程序一样。反之，只要不是在中断服务的上下文中，那就一定是在某个进程 (或线程) 的上下文中了。但是，0 号进程和 1 号进程，也就是 “空转” (idle) 进程和 “初始化” (init) 进程，是不允许退出的，所以接着要对当前进程的 pid 加以检查。

⑵ sem_exit

// ipc/sem.c
/*
 * add semadj values to semaphores, free undo structures.
 * undo structures are not freed when semaphore arrays are destroyed
 * so some of them may be out of date.
 * IMPLEMENTATION NOTE: There is some confusion over whether the
 * set of adjustments that needs to be done should be done in an atomic
 * manner or not. That is, if we are attempting to decrement the semval
 * should we queue up and wait until we can do so legally?
 * The original implementation attempted to do this (queue and wait).
 * The current implementation does not do so. The POSIX standard
 * and SVID should be consulted to determine what behavior is mandated.
 */
void sem_exit (void)
{
  struct sem_queue *q;
  struct sem_undo *u, *un = NULL, **up, **unp;
  struct sem_array *sma;
  int nsems, i;

  /* If the current process was sleeping for a semaphore,
   * remove it from the queue.
   */
// 如果当前过程正在（睡眠）等待进入某个临界区，则其task_struct结构中的指针semsleeping指向
// 所在的队列。显然，现在不需要再等待了，所以把当前过程从这个队列中脱链。接着是一个for循环，
// 料理那些正在由当前过程所创建的用户空间信号量（即临界区）上操作的过程，告诉它们：信号量已
// 经撤销，临界区已经要“清场”并“关门大吉”，大家请回吧。建议读者在学习了 “进程间通信”的有
// 关内容后再回过来自己读一下这段代码。
  if ((q = current->semsleeping)) {
    int semid = q->id;
    sma = sem_lock(semid);
    current->semsleeping = NULL;

    if (q->prev) {
      if(sma==NULL)
        BUG();
      remove_from_queue(q->sma,q);
    }
    if(sma!=NULL)
      sem_unlock(semid);
  }

  for (up = &current->semundo; (u = *up); *up = u->proc_next, kfree(u)) {
    int semid = u->semid;
    if(semid == -1)
      continue;
    sma = sem_lock(semid);
    if (sma == NULL)
      continue;

    if (u->semid == -1)
      goto next_entry;

    if (sem_checkid(sma,u->semid))
      goto next_entry;

    /* remove u from the sma->undo list */
    for (unp = &sma->undo; (un = *unp); unp = &un->id_next) {
      if (u == un)
        goto found;
    }
    printk ("sem_exit undo list error id=%d\n", u->semid);
    goto next_entry;
found:
    *unp = un->id_next;
    /* perform adjustments registered in u */
    nsems = sma->sem_nsems;
    for (i = 0; i < nsems; i++) {
      struct sem * sem = &sma->sem_base[i];
      sem->semval += u->semadj[i];
      if (sem->semval < 0)
        sem->semval = 0; /* shouldn't happen */
      sem->sempid = current->pid;
    }
    sma->sem_otime = CURRENT_TIME;
    /* maybe some queued-up processes were waiting for this */
    update_queue(sma);
next_entry:
    sem_unlock(semid);
  }
  current->semundo = NULL;
}

⑶ __exit_mm

// kernel/exit.c
/*
 * Turn us into a lazy TLB process if we
 * aren't already..
 */
static inline void __exit_mm(struct task_struct * tsk)
{
  struct mm_struct * mm = tsk->mm;

  mm_release();
  if (mm) {
    atomic_inc(&mm->mm_count);
    if (mm != tsk->active_mm) BUG();
    /* more a memory barrier than a real lock */
    task_lock(tsk);
    tsk->mm = NULL;
    task_unlock(tsk);
    enter_lazy_tlb(mm, current, smp_processor_id());
    mmput(mm);
  }
}

实际的存储空间释放是调用 mmput() 完成的 (代码在 fork.c 中) ，我们已在前一节中读过它的代码，这里要提醒读者的是这里对 mm_release() 的调用。在 fork() 和 execve() 两节中，读者已经看到，当 do_fork() 时标志位 CLONE_VFORK 为 1 时，父进程在睡眠，等待子进程在一个信号量上执行一次 up() 操作以后才能回到用户空间运行，而子进程必须在释放其用户存储空间时执行这个操作，所以这里要通过 mm_release() ，在这个信号量上执行一次 up() 操作唤醒睡眠中的父进程。其代码上列出在 execve() 一节中，这里不再重复。

将一个进程的 task_struct 结构中的指针 mm 清成 0，这个进程便不再有用户空间了。

⑷ exit_notify

// kernel/exit.c
/*
 * Send signals to all our closest relatives so that they know
 * to properly mourn us..
 */
static void exit_notify(void)
{
  struct task_struct * p, *t;
// 就像人一样，所谓父进程也有“生父”和“养父”之分。在task_struct结构中有个指针p_opptr指
// 向其“original parent"也即生父，另外还有个指针p_pptr则指向养父。一个进程在创建之初其生父和
// 养父是一致的，所以两个指针指向同一个父进程。但是，在运行中p_pptr可以暂时地改变。这种改变
// 发生在一个进程通过系统调用ptrace()来跟踪另一个进程的时候，这时候被跟踪进程的p_pptr指针被设
// 置成指向正在跟踪它的进程，那个进程就暂时成了被跟踪进程的“养父”。而被跟踪进程的p_opptr指
// 针却不变，仍旧指向其生父。如果一个进程在其子进程之前“去世”的话，就要把它的子进程托付给
// 某个进程。托付给谁呢？如果当前进程是一个线程，那就托付给同一线程组中的下一个线程，使子进
// 程的p_opptr指向这个线程。否则，就只好托付给系统中的init进程，所以这init进程就好像是孤儿院。
// 由此可见，所谓“original parent"也不是永远不变的，原因在于系统中的进程号pid以及用作
// task_struct 结构的页面都是在周转使用的，所以实际上一来并没有保留这个记录的意义，
// 二来技术上也有困难。 现在，当前进程要exit()了，所以要将其所有的子进程都送进“孤儿院”，
// 要不然到它们也要exit()的时候就没有父进程来料理它们的后事了。这就是331行调用
// forget_original_parent()的目的(exit.c)。
  forget_original_parent(current);
  /*
   * Check to see if any process groups have become orphaned
   * as a result of our exiting, and if they have any stopped
   * jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
   *
   * Case i: Our father is in a different pgrp than we are
   * and we were the only connection outside, so our pgrp
   * is about to become orphaned.
   */
// 一个用户login到系统中以后，可能会启动许多不同的进程，所有这些进程都使用同一个控制终端
// (或用来模拟一个终端的窗口)。这些使用同一个控制终端的进程属于同一个session。此外，用户可以
// 在同一条shell命令或执行程序中启动多个进程，例如在命令“ls | wc -l”中就同时启动了两个进程，
// 这些进程形成一个“组”(session与组是两个不同的概念)。每个session或进程组中都有一个为主的、
// 最早创建的进程，这个进程的pid就成为session和进程组的代号。如果当前进程与父进程属于不同的
// session，不同的组，同时又是其所在的组与其父进程之间惟一的纽带，那么一旦当前进程不存在以后，
// 这整个组就成了 “孤儿”。在这样的情况下，按POSIX 322.2的规定要给这个进程组中所有的进程都
// 先发一个SIGHUP信号，然后再发一个SIGCONT信号，这是由kill_pg()完成的。   
  t = current->p_pptr;
  
  if ((t->pgrp != current->pgrp) &&
      (t->session == current->session) &&
      will_become_orphaned_pgrp(current->pgrp, current) &&
      has_stopped_jobs(current->pgrp)) {
    kill_pg(current->pgrp,SIGHUP,1);
    kill_pg(current->pgrp,SIGCONT,1);
  }

  /* Let father know we died 
   *
   * Thread signals are configurable, but you aren't going to use
   * that to send signals to arbitary processes. 
   * That stops right now.
   *
   * If the parent exec id doesn't match the exec id we saved
   * when we started then we know the parent has changed security
   * domain.
   *
   * If our self_exec id doesn't match our parent_exec_id then
   * we have changed execution domain as these two values started
   * the same after a fork.
   *  
   */
  
  if(current->exit_signal != SIGCHLD &&
      ( current->parent_exec_id != t->self_exec_id  ||
        current->self_exec_id != current->parent_exec_id) 
      && !capable(CAP_KILL))
    current->exit_signal = SIGCHLD;


  /*
   * This loop does two things:
   *
     * A.  Make init inherit all the child processes
   * B.  Check to see if any process groups have become orphaned
   *  as a result of our exiting, and if they have any stopped
   *  jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
   */

  write_lock_irq(&tasklist_lock);
  current->state = TASK_ZOMBIE;
  do_notify_parent(current, current->exit_signal);
  while (current->p_cptr != NULL) {
    p = current->p_cptr;
    current->p_cptr = p->p_osptr;
    p->p_ysptr = NULL;
    p->ptrace = 0;

    p->p_pptr = p->p_opptr;
    p->p_osptr = p->p_pptr->p_cptr;
    if (p->p_osptr)
      p->p_osptr->p_ysptr = p;
    p->p_pptr->p_cptr = p;
    if (p->state == TASK_ZOMBIE)
      do_notify_parent(p, p->exit_signal);
    /*
     * process group orphan check
     * Case ii: Our child is in a different pgrp
     * than we are, and it was the only connection
     * outside, so the child pgrp is now orphaned.
     */
    if ((p->pgrp != current->pgrp) &&
        (p->session == current->session)) {
      int pgrp = p->pgrp;

      write_unlock_irq(&tasklist_lock);
      if (is_orphaned_pgrp(pgrp) && has_stopped_jobs(pgrp)) {
        kill_pg(pgrp,SIGHUP,1);
        kill_pg(pgrp,SIGCONT,1);
      }
      write_lock_irq(&tasklist_lock);
    }
  }
  write_unlock_irq(&tasklist_lock);
}

⑸ do_notify_parent

// kernel/signal.c
/*
 * Let a parent know about a status change of a child.
 */

void do_notify_parent(struct task_struct *tsk, int sig)
{
  struct siginfo info;
  int why, status;

  info.si_signo = sig;
  info.si_errno = 0;
  info.si_pid = tsk->pid;
  info.si_uid = tsk->uid;

  /* FIXME: find out whether or not this is supposed to be c*time. */
  info.si_utime = tsk->times.tms_utime;
  info.si_stime = tsk->times.tms_stime;

  status = tsk->exit_code & 0x7f;
  why = SI_KERNEL;  /* shouldn't happen */
  switch (tsk->state) {
  case TASK_STOPPED:
    /* FIXME -- can we deduce CLD_TRAPPED or CLD_CONTINUED? */
    if (tsk->ptrace & PT_PTRACED)
      why = CLD_TRAPPED;
    else
      why = CLD_STOPPED;
    break;

  default:
    if (tsk->exit_code & 0x80)
      why = CLD_DUMPED;
    else if (tsk->exit_code & 0x7f)
      why = CLD_KILLED;
    else {
      why = CLD_EXITED;
      status = tsk->exit_code >> 8;
    }
    break;
  }
  info.si_code = why;
  info.si_status = status;

  send_sig_info(sig, &info, tsk->p_pptr);
  wake_up_parent(tsk->p_pptr);
}

参数 tsk 指向当前进程的 task_struct 结构，只有当进程处于 TASK_ZOMBTE ( 正在 exit() ) 或 TASK_STOPPED (被跟踪) 时才允许调用 do_notify_parent() 。从代码中可见，这里的所谓 parent 是指当前进程的“养父”而不是“生父”，也就是由指针 p_pptr 所指而不是 p_opptr 所指的进程。在前面的 forget_original_parent() 中已经把每个子进程的 p_opptr 改成了指向 child_reaper，而 notify_parent() 中却是向 p_pptr 所指进程发信号；那样，将来当那些子进程要 exit() 时岂不是要向一个已经不存在了的父进程发信号吗？不要紧，exit_notify() 的代码中随后(392行)就把子进程的 p_pptr 设置成与 p_opptr 相同。

进程之间都通过亲缘关系连接在一起而形成“关系网”，所用的指针除 p_opptr 和 p_pptr 外，还有：

p_cptr，指向子进程，这里的 c 表示 "child”。p_cptr 与 p_pptr 是相对应的。当一个进程有多个子进程时，p_cptr 指向其“最年轻的”，也就是最近创建的那个子进程。

p_ysptr，指向当前进程的“弟弟”，这里的 y 表示 “younger”，而 s 表示 “sibling”。

p_osptr，指向当前进程的“哥哥”，这里的。表示 “older"。

这样，当前进程的所有子进程都通过 p_ysptr 和 p_osptr 连接在一起形成一个双链队列。队列中每一个进程的 p_pptr 都指向当前进程，同当前进程的 p_optr 则指向队列中最后创建的子进程。有趣的是，子进程在行事时只认其“养父”，而 p_opptr 所指的“生父”倒似乎无关紧要。当然，一个进程除身处这个由亲属关系形成的队列中之外，同时也身处其他的队列中，所以 task_struct 结构中还有其他的 task_struct 指针，从而形成一个并不简单的“关系网”。进程是在创建的时候在 do_fork() 中通过 SET_LINK 进入这个关系网的。SET_LINK 的定义在 include/linux/sched.h 中。

⑹ disassociate_ctty

// drivers/char/tty_io.c
/*
 * This function is typically called only by the session leader, when
 * it wants to disassociate itself from its controlling tty.
 *
 * It performs the following functions:
 *  (1)  Sends a SIGHUP and SIGCONT to the foreground process group
 *  (2)  Clears the tty from being controlling the session
 *  (3)  Clears the controlling tty for all processes in the
 *    session group.
 *
 * The argument on_exit is set to 1 if called when a process is
 * exiting; it is 0 if called by the ioctl TIOCNOTTY.
 */
void disassociate_ctty(int on_exit)
{
  struct tty_struct *tty = current->tty;
  struct task_struct *p;
  int tty_pgrp = -1;

  if (tty) {
    tty_pgrp = tty->pgrp;
    if (on_exit && tty->driver.type != TTY_DRIVER_TYPE_PTY)
      tty_vhangup(tty);
  } else {
    if (current->tty_old_pgrp) {
      kill_pg(current->tty_old_pgrp, SIGHUP, on_exit);
      kill_pg(current->tty_old_pgrp, SIGCONT, on_exit);
    }
    return;
  }
  if (tty_pgrp > 0) {
    kill_pg(tty_pgrp, SIGHUP, on_exit);
    if (!on_exit)
      kill_pg(tty_pgrp, SIGCONT, on_exit);
  }

  current->tty_old_pgrp = 0;
  tty->session = 0;
  tty->pgrp = -1;

  read_lock(&tasklist_lock);
  for_each_task(p)
      if (p->session == current->session)
      p->tty = NULL;
  read_unlock(&tasklist_lock);
}

那么，进程与主控终端的这种联系最初是怎样，以及在什么时候建立的呢？显然，在创建子进程时，将父进程的 task_struct 结构复制给子进程的过程中把结构中的 tty 指针也复制了下来，所以子进程具有与父进程相同的主控终端。但是子进程可以通过 ioctl() 系统调用来改变主控终端，也可以先将当前的主控终端关闭然后再打开另一个时。不过，在此之前先得通过 setsid() 系统调用来建立一个新的人机交互分组 (session)，并使得作此调用的进程成为该 session 的主进程 (leader)。一个 session 的主进程与其主控终端断绝关系意味着整个 session 中的进程都与之断绝了关系，所以要给同一 session 中的进程发出信号。从此以后，这些进程就没有主控终端，成了 “后台进程” 。

（2）sys_wait4

// kernel/exit.c
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru)
{
  int flag, retval;
// 参数pid为某一个子进程的进程号。
// 首先，在当前进程的系统空间堆栈中通过DECLARE_WAITQUEUE分配空间并建立了一个
// wait_queue_t数据结构。有关的宏定义和数据结构都是在 include/linux/wait.h 中定义的
//
// 也就是说，sys_wait4() 一开头就在当前进程的系统堆栈上分配一个 wait_queue_t 数据结构
// (名为 wait)，结构中的 compiler_warning 为 0x1234567，指针 task 指向当前进程的
// task_struct，而 list_head 结构 task_list 中的两个指针均为NULL。由于这个数据结构建立
// 在当前进程的系统空间堆栈中，一旦从 sys_wait4() 返回，这个数据结构就不复存在了。与此相应，
// 在进程的 task_struct 中有个 wait_queue_head_t 数据结构 wait_chldexit 用于这个目的。
  DECLARE_WAITQUEUE(wait, current);
  struct task_struct *tsk;

  if (options & ~(WNOHANG|WUNTRACED|__WNOTHREAD|__WCLONE|__WALL))
    return -EINVAL;
// 然后，通过 add_wait_queue() 将这个数据结构(wait)加入到当前进程的 wait_chldexit 队列中。
// 这样做的作用在下面重温了 do_notify_parent() 的代码以后就会清楚。接着，就进入了一个循环，
// 这是一个不小的循环
  add_wait_queue(&current->wait_chldexit,&wait);
repeat:
  flag = 0;
  current->state = TASK_INTERRUPTIBLE;
  read_lock(&tasklist_lock);
  tsk = current;
// 这个由 goto 实现的循环要到当前进程被调度运行，并且下列条件之一得到满足时才结束(见代码
// 中的 "goto end_wait4" 语句)：
//  所等待的子进程的状态变成 TASK_STOPPED 或 TASK_ZOMBIE；
//  所等待的子进程存在，可是不在上述两个状态，而调用参数 options 中的 WNOHANG 标志位为
//   1，或者当前进程收到了其他的信号；
//  进程号为pid的那个进程根本不存在，或者不是当前进程的子进程。
// 否则，当前进程将其自身的状态设成TASK_INTERRUPTIBLE(见499行)并在575行调用schedule()
// 进入睡眠让别的进程先运行。当该进程因收到信号而被唤醒，并且受到调度从schedule()返回时，就又
// 经由576行的goto语句转回repeat，再次通过一个for循环扫描其子进程队列，看看所等待的子进程的
// 状态是否满足条件。这里的for循环扫描一个进程的所有子进程，从最年轻的子进程开始沿着由各个
// task_struct结构中的指针p_osptr所形成的链扫描，找寻与所等待对象的pid相符的子进程、或符合其
// 他一些条件的子进程。
  do {
    struct task_struct *p;
    for (p = tsk->p_cptr ; p ; p = p->p_osptr) {
      if (pid>0) {
        if (p->pid != pid)
          continue;
      } else if (!pid) {
        if (p->pgrp != current->pgrp)
          continue;
      } else if (pid != -1) {
        if (p->pgrp != -pid)
          continue;
      }
      /* Wait for all children (clone and not) if __WALL is set;
       * otherwise, wait for clone children *only* if __WCLONE is
       * set; otherwise, wait for non-clone children *only*.  (Note:
       * A "clone" child here is one that reports to its parent
       * using a signal other than SIGCHLD.) */
      if (((p->exit_signal != SIGCHLD) ^ ((options & __WCLONE) != 0))
          && !(options & __WALL))
        continue;
      flag = 1;
      switch (p->state) {
      case TASK_STOPPED:
        if (!p->exit_code)
          continue;
        if (!(options & WUNTRACED) && !(p->ptrace & PT_PTRACED))
          continue;
        read_unlock(&tasklist_lock);
        retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0; 
        if (!retval && stat_addr) 
          retval = put_user((p->exit_code << 8) | 0x7f, stat_addr);
        if (!retval) {
          p->exit_code = 0;
          retval = p->pid;
        }
        goto end_wait4;
      case TASK_ZOMBIE:
        current->times.tms_cutime += p->times.tms_utime + p->times.tms_cutime;
        current->times.tms_cstime += p->times.tms_stime + p->times.tms_cstime;
        read_unlock(&tasklist_lock);
        retval = ru ? getrusage(p, RUSAGE_BOTH, ru) : 0;
        if (!retval && stat_addr)
          retval = put_user(p->exit_code, stat_addr);
        if (retval)
          goto end_wait4; 
        retval = p->pid;
        if (p->p_opptr != p->p_pptr) {
          write_lock_irq(&tasklist_lock);
          REMOVE_LINKS(p);
          p->p_pptr = p->p_opptr;
          SET_LINKS(p);
          do_notify_parent(p, SIGCHLD);
          write_unlock_irq(&tasklist_lock);
        } else
          release_task(p);
        goto end_wait4;
      default:
        continue;
      }
    }
    if (options & __WNOTHREAD)
      break;
// 这个for循环又嵌套在一个do_while循环中。为什么要有这个外层的do_while
// 循环呢？这是因为当前进程可能是一个线程，而所等待的对象实际上是由同一个进程克隆出来的另一
// 个线程的子进程，所以要通过这个do_while循环来检查同一个thread_group中所有线程的子进程。代
// 码中的next_thread()从同一个thread_group队列中找到下一个线程的task_struct结构，并使局部量tsk
// 指向这个结构。
    tsk = next_thread(tsk);
  } while (tsk != current);
  read_unlock(&tasklist_lock);
  if (flag) {
    retval = 0;
    if (options & WNOHANG)
      goto end_wait4;
    retval = -ERESTARTSYS;
    if (signal_pending(current))
      goto end_wait4;
    schedule();
// 在我们这个情景中，当父进程调用wait4()而第一次扫描其子进程队列时，该进程尚
// 在运行，所以通过schedule()进入睡眠。当子进程exit()时，会向父进程发一个信号，从而将其唤醒。
// 怎么唤醒呢？我们在前面看到，子进程在exit_notify()中通过do_notify_parent()向父进程发送信号。这
// 个函数准备下一个siginfo数据结构，然后调用send_sig_info()将其发送给父进程，并调用
// wake_up_process()将父进程唤醒。对send_sig_info()的代码我们将在“进程间通信”的信号一节中介
// 绍。而 wake_up_process()，则把父进程的状态从 TASK_INTERRUPTABLE 改成 TASK_RUNNING，并
// 将其转移到可执行队列中，使schedule()能够“看”到父进程而可以调度其运行。
// 当父进程因子进程在exit()向其发送信号而被唤醒时，就转回到前面sys_wait4()中的repeat处，又
// 一次扫描其子进程队列。这一次，子进程的状态已经改成 TASK_ZOMBIE 了，所以父进程在将子进程
// 在用户空间运行的时间和系统空间运行的时间两项统计数据合并入其自身的统计数据中。然后，在典
// 型的条件下，就调用release_task()将子进程残存的资源，就是其task_struct结构和系统空间堆栈，全
// 都释放(exit.c)
    goto repeat;
  }
  retval = -ECHILD;
end_wait4:
  current->state = TASK_RUNNING;
  remove_wait_queue(&current->wait_chldexit,&wait);
  return retval;
}

在 sys_wait4() 中还有个特殊情况需要考虑，那就是万一子进程的 p_opptr 与 p_pptr 不同，也就是说其“养父”与“生父”不同。如前所述，进程在 exit() 时，do_notify_parent() 的对象是其“养父”，但是当“生父”与“养父”不同时，其“生父”可能也在等待，所以将子进程的 p_pptr 指针设置成与 p_opptr 相同，并通过 REMOVE_LINKS 将其 task_struct 从其“养父”的队列中脱离出来，再通过 SET_LINKS 把它归还给“生父”，重新挂入其“生父”的队列。然后，给其“生父”发一信号，让它自己来处理。

此外，根据当前进程在调用 wait4() 时的要求，还可能要把一些状态信息和统计信息通过 put_user() 复制到用户空间中。如果复制失败的话，那暂时就不能将子进程的 task_struct 结构释放了(这里的 “goto end_wait4” 跳过了对 release() 的调用)。在这种情况下，系统中会留下子进程的“尸体”，用户通过 “ps” 命令来观察系统中的进程状态时，会看到有个进程的状态为 “ZOMBIE” 。读者在前面看到：在 exit_notify() 当父进程要结束生命前为其子进程“托孤”时，还要看一下子进程的状态是否 TASK_ZOMBIE，若是的话，就要替它调用 do_notify_parent() 给新的“养父”发一信息，就是这个原因。

至此，在执行了 release() 以后，子进程就最终“灰飞烟灭”，从系统中消失了。

可是，要是父进程不在 wait4() 中等待呢？那也不要紧。读者在第3章中已经看到，每当进程从系统调用、中断或异常返回时，都要检查一下是否有信号等待处理，如有的话就转入 entry.S 中的 signal_return 处调用 do_signal()。而 do_signal() 中有一个片段为 (在 arch/i386/kernel/signal.c 中)。

⑴ DECLARE_WAITQUEUE

// include/linux/wait.h
struct __wait_queue {
  unsigned int flags;
#define WQ_FLAG_EXCLUSIVE 0x01
  struct task_struct * task;
  struct list_head task_list;
#if WAITQUEUE_DEBUG
  long __magic;
  long __waker;
#endif
};
typedef struct __wait_queue wait_queue_t;

// ------------------------------------------------------
struct __wait_queue_head {
  wq_lock_t lock;
  struct list_head task_list;
#if WAITQUEUE_DEBUG
  long __magic;
  long __creator;
#endif
};
typedef struct __wait_queue_head wait_queue_head_t;

#if WAITQUEUE_DEBUG
# define __WAITQUEUE_DEBUG_INIT(name) \
    , (long)&(name).__magic, 0
# define __WAITQUEUE_HEAD_DEBUG_INIT(name) \
    , (long)&(name).__magic, (long)&(name).__magic
#else
# define __WAITQUEUE_DEBUG_INIT(name)
# define __WAITQUEUE_HEAD_DEBUG_INIT(name)
#endif

#define __WAITQUEUE_INITIALIZER(name,task) \
  { 0x0, task, { NULL, NULL } __WAITQUEUE_DEBUG_INIT(name)}
#define DECLARE_WAITQUEUE(name,task) \
  wait_queue_t name = __WAITQUEUE_INITIALIZER(name,task)

⑵ release_task

// kernel/exit.c
static void release_task(struct task_struct * p)
{
  if (p != current) {
#ifdef CONFIG_SMP
    /*
     * Wait to make sure the process isn't on the
     * runqueue (active on some other CPU still)
     */
    for (;;) {
      task_lock(p);
      if (!p->has_cpu)
        break;
      task_unlock(p);
      do {
        barrier();
      } while (p->has_cpu);
    }
    task_unlock(p);
#endif
    atomic_dec(&p->user->processes);
    free_uid(p->user);
    unhash_process(p);

    release_thread(p);
    current->cmin_flt += p->min_flt + p->cmin_flt;
    current->cmaj_flt += p->maj_flt + p->cmaj_flt;
    current->cnswap += p->nswap + p->cnswap;
    /*
     * Potentially available timeslices are retrieved
     * here - this way the parent does not get penalized
     * for creating too many processes.
     *
     * (this cannot be used to artificially 'generate'
     * timeslices, because any timeslice recovered here
     * was given away by the parent in the first place.)
     */
    current->counter += p->counter;
    if (current->counter >= MAX_COUNTER)
      current->counter = MAX_COUNTER;
    free_task_struct(p);
  } else {
    printk("task releasing itself\n");
  }
}

这里通过 unhash_process() 把子进程的 task_struct 结构从杂凑表队列中摘除，然后把子进程的其他几项统计信息也合并入父进程。至于 release_thread() 只是检查进程的 LDT (如果有的话)是否确已释放。最后，就调用 free_task_struct() 将 task_struct 结构和系统空间堆栈所占据的两个物理页面释放。

⑶ do_signal

// arch/i386/kernel/signal.c
int do_signal(struct pt_regs *regs, sigset_t *oldset)
{
  // ...
    ka = &current->sig->action[signr-1];
    if (ka->sa.sa_handler == SIG_IGN) {
      if (signr != SIGCHLD)
        continue;
      /* Check for SIGCHLD: it's special.  */
      while (sys_wait4(-1, NULL, WNOHANG, NULL) > 0)
        /* nothing */;
      continue;
    }
  // ...  
}

可见父进程在收到 SIGCHLD 信号后还会被动地来调用 sys_wait4()，此时的调用参数 pid 为 -1，表示同一个进程组中的任何一个子进程都在处理之列 (见sys_wait4() 的for循环中对参数 pid 的比对)。

当然，如果父进程已经为 SIGCHLD 信号设置了其他的处理程序，那就另作别论了。

读者也许还会问，怎样才能保证一定会有系统调用、中断或异常来迫使其父进程执行do_signal() 呢？万一父进程在运行时既不作系统调用，也不访问外设，更没有任何操作引起异常呢？别忘记时钟中断是周期性地发生的，要不然就连调度也有可能不会发生了，正因为如此，时钟中断才被看作是系统的“心跳”。

To be continued

⇐ ⇒ ⇔ ⇆ ⇒ ⟺

①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳㉑㉒㉓㉔㉕㉖㉗㉘㉙㉚㉛㉜㉝㉞㉟㊱㊲㊳㊴㊵㊶㊷㊸㊹㊺㊻㊼㊽㊾㊿

⑴⑵⑶⑷⑸⑹⑺⑻⑼⑽⑿⒀⒁⒂⒃⒄⒅⒆⒇

➊➋➌➍➎➏➐➑➒➓⓫⓬⓭⓮⓯⓰⓱⓲⓳⓴

⒜⒝⒞⒟⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵

🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩

123

Linux 内核源代码情景分析（二）（下）

5、系统调用exit()与wait4()

（1）sys_exit

⑴ in_interrupt

⑵ sem_exit

⑶ __exit_mm

⑷ exit_notify

⑸ do_notify_parent

⑹ disassociate_ctty

（2）sys_wait4

⑴ DECLARE_WAITQUEUE

⑵ release_task

⑶ do_signal

To be continued

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Linux 内核源代码情景分析（二）（下）

5、系统调用exit()与wait4()

（1）sys_exit

⑴ in_interrupt

⑵ sem_exit

⑶ __exit_mm

⑷ exit_notify

⑸ do_notify_parent

⑹ disassociate_ctty

（2）sys_wait4

⑴ DECLARE_WAITQUEUE

⑵ release_task

⑶ do_signal

To be continued

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像