Grad School (the good parts)

On Abstractions

Systems designers are abstraction merchants.1

Nothing is so difficult it cannot be solved by another level of indirection.2

On Files

The life of an average file is tedious3 and brief4. Sequential access is rarely sequential with multiple threads.

On Queueing

Little’s Law has bizarre and counter-intuitive conclusions. Suppose bank customers take an average (exponential distribution) of 10 minutes to serve and they arrive at the rate of 5.8 per hour. With 1 teller, the average wait time is over 5 hours. With 2 tellers, it is 3 minutes.

Anything can be fixed with the words “assume i.i.d.”

$P_k$ (number in system at time k), $\bar{N}$ (mean number in system), and $\bar{T}$ (mean time in system) are independent of service disciplines, but the variance and distribution of T are not. I.e., $T_{FCFS} = T_{LCFS}$, but $var_{T,FCFS} \neq var_{T,LCFS}$ since $var = E[T^2] - E^2[T]$.

Poisson only works if your events are truly independent.5

On Optimization

Put broadly, the object of study in mathematics is truth; the object of study in computer science is complexity. [I]t’s not enough for a problem to have a solution, if that problem is intractable.6

Everyone is afraid to touch [database] optimizers, because no one knows how they work.

Optimization works in cycles:

  1. Exploit assumptions and amortization.
  2. Divide and conquer until re-computation takes too long.
  3. Parallelize until Amdahl’s.
  4. Dynamic programming (parenthesization , memoization) until state space complexity takes you back to #1.
On Distributed Systems

In distributed systems, it is impossible to tell whether a system is dead or arbitrarily delayed.

On Data

Curse of Multidimensionality: as the number of attributes/dimensions in a dataset increases, the average distance becomes larger, making it more difficult to detect outliers7 and leading to overfitted models.8

The more you look, the more you overfit9.

Tricks in Data Analysis:

  1. Add more dimensions: Non-linear SVM
  2. Take away dimensions: Random forest, eigenvectors

Data-ink ratio = $\frac{data\ ink}{total\ ink}$

On Tools

Typing skill and comprehension are independent. Concurrent tasking does not affect typing much.10

\(\require{AMScd} \begin{CD} CMF @>{smooth}>> CDF;\\ @AA{\sum}A @A{\int_0^x}AA \\ PMF @<{bin}<< PDF; \end{CD}\) 11

A long list is no list. 12

When in doubt, draw it out. 13

On Process and Organization

After accounting for size, other code complexity metrics become noise.14

In an sufficiently regulated engineering process, the non-work deliverables themselves take on technical rigor with little connection to the product.

The second is that people tend to inconsistency. The prediction is that methodologies requiring disciplined consistency are fragile in practice… The fourth is that people like to be good citizens, are good at looking around and taking initiative. These combine to form that common success factor, «a few good people stepped in at key moments.»15

On Human Reliability

Human reliability is a log normal distribution.16

The only risks in life are the people and things you depend on.12 17

On Protocols

Even if a protocol seems great on paper, it may not be used for lots of reasons.18

On Psychophysics

Sound exists in time and over space, vision exists in space and over time. 19

  1. On the Design and Evaluation of Abstractions 

  2. An indirect quote of David Wheeler, later institutionalized as the fundamental theoreum of software engineering (FTSE) and RFC1925. E.g., indirect block addressing

  3. From “A File Is Not A File”; it is actually many many files. 

  4. From “Measurements of a Distributed File System”; most are small (~50% <1KB) and short lived (74% open for less than .1 second, 50% live ≤1s, ~97% live less than a day). 

  5. From “Wide Area Traffic: The Failure of Poisson Modeling”; TCP traffic, or IP over AAL5 are rarely poisson. 

  6. Algorithms to Live By 

  7. On anomaly detection: The Nimbus 7 satellite had anomaly detection software which was dropping valid measurement of ozone depletion over Antarctica. 

  8. “87% of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}” Sweeney 2000

  9. Bonferroni’s Principle: “Who searches a lot, finds a lot.” 

  10. Salthouse 1986 

  11. Think Stats, chapter 6. 

  12. Robert (Bob) Bell, personal conversation  2

  13. M&R, chapter 6. 

  14. Khaled El Emam, Saida Benlarbi, Nishith Goel, and Shesh N. Rai: “The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics”. IEEE Transasctions on Software Engineering, 27(7), July 2001. 

  15. Alistair Cockburn, “Characterizing People as Non-Linear, First-Order Components in Software Development” 

  16. NUREG/CR-1278 

  17. Willoughby templates, institutionalized as DoD 4245.7-M 

  18. CSMA is more efficient at lower packet sizes, and the 802.11 setting for RTS/CTS/Thres[hold] will revert to CSMA below a a threshold size. But computing the optimum size is hard, and usually RTS/CTS is disabled. 

  19. Mountford, S.J., & Gaver, W.W. (1990). Talking and listening to computers. In B. Laurel (Ed.), The art of human-computer interface design (p. 322). Reading, Massachusetts: Addison-Wesley.