CS 647 Distributed Systems
CS647: Distributed Systems
3 credits
Spring 2023
General Information
Instructor:
- Dr. Colin S. Gordon
- csgordon@drexel.edu
- https://www.cs.drexel.edu/~csgordon/
- Office hours TBD, in-office (1174) and via Zoom. If you can’t make my normal office hours, feel free to email me or DM me on Discord for help over those channels, or to set up a physical or virtual meeting that you can make.
Student Learning Information
Course Description
In-depth discussion of fundamental concepts of distributed computer systems. Covers development techniques and runtime challenges, with a focus on reliability and system validation techniques. Subjects discussed include: interprocess communication, remote procedure calls and method invocation, middleware, distributed services, coordination, transactions, replication and weak data consistency models. Significant system-building term project in Java or similar language.
Course Purpose within a Program of Study
Within the revised MSSE program, this will serve as one of 6 possible CS electives (3 required along with 3/6 IS electives) to provide broader knowledge knowledge of software engineering. The MSSE degree emphasizes modern practices and techniques to produce reliable software that functions as desired, in a timely manner. Distributed systems are an increasingly important domain and more software systems move to shared cloud infrastructure.
Within the CS PhD program, the course will serve as an elective suitable for any graduate student, but particularly for those with research interests in software engineering, systems, or programming languages.
Statement of Expected Learning
The course objectives are to:
- Teach students the core aspects of distributed systems that make their design and implementation more challenging than sequential software running on a single machine
- Communicate core theoretical results (outcomes, not necessarily proof) establishing hard limits on what is possible in distributed systems (e.g., CAP and its limitations, FLP)
- Teach students how weakening notions of correctness (specifically, notions of serializability and consistency) can help programs achieve fault tolerance
- Prepare students to validate running distributed systems
As learning outcomes, students completing this course should be able to:
- Understand how independent network and machine failure conspire to make reliable distributed systems difficult to achieve
- Articulate and understand the trade-offs between different well-known data consistency models
- Be able to design a distributed system for a specified level of fault tolerance
- Understand how to reason about the execution of a distributed system
- Understand the basic concepts of how to validate a running distributed system
- Apply cutting edge tools to investigate the behavior of systems under failure
- Understand practical design considerations for some modern distributed systems frameworks, and their limitations
- Recognize cases when distribution is not necessary
Course Materials
Required and Recommended Texts, Readings, and Resources
Required: No textbooks, instructor-selected research papers.
Recommended: Designing Data-Intensive Applications, by Martin Kleppmann. This book is optional, but will provide a nice extended discussion for much of the course material. In addition to the natural option to purchase a hardcopy, it is available in electronic form (with and without DRM, see the book’s site), and via O’Reilly’s Safari Books Online platform. You can access this via Drexel Libraries’ subscription by going here, clicking on “Full Text Online” and signing in with your Drexel credentials.
Required and Supplemental Materials and Technologies
The course will use class Discord channels for general course questions and material questions that may be of interest to everyone in the course (setup questions, assignment clarifications, etc.). If you have already joined the CCI Discord server, you should be added to the class channels automatically. If not, go here to connect a Discord account to CCI’s server.
This course will be programming-intensive; you should expect to write a moderate amount of very challenging code.
Assignments must be completed using the Akka library for (distributed) actors on the JVM.
You are welcome to use any JVM language you choose (Java, Scala, Kotlin, or any other well-maintained JVM language).
Previous iterations of the course did permit students to use the Akka.NET reimplementation of Akka for .NET (i.e., C# and F#). Unfortunately the two have now diverged quite a bit, so it’s no longer possible for me to give a single example structure that works across both platforms.
Projects in the course will require you to write full self-contained projects using a build file for a cross-platform build tool that handles fetching dependencies,
compiling your code, and executing the code. On the JVM, any of the major tools is acceptable (ant
+ivy
, maven
, gradle
, or sbt
).
Projects must build from the command line: projects that only build via an IDE will be penalized, though you are welcome to use whatever editors or IDEs you like
for writing your code.
You are expected to use the Akka documentation as appropriate. There is a book in the works which covers the latest version of Akka well, but it will not be released in time for the class: Akka in Action, 2nd Edition. Early access to the latest draft (which will be upgraded to permanent access) is available via the publisher, but before deciding to buy that, keep in mind:
- You don’t need the book to do the class. Lectures will include some introductions to the core concepts, and other aspects of what you’ll need are well-covered by documentation.
- It is likely the book will be published during the term. In that case, shortly after release it should appear in O’Reilly’s online learning platform, which Drexel has a subscription to (so when this happens, you would be able to access it for free with your university login)
Getting Help
The best way to get help is to use the course Discord channels, email the professor, or show up to office hours.
Examples of questions that are best for Discord include:
- I’m doing the homework, and realized I don’t understand this line in the paper. What does it mean?
- I’m trying to set up Akka, but this part of the configuration file seems wrong, has anyone figured out what this “seeds” field should be?
- General questions about actor programming
- General questions about distributed systems material beyond what we cover in assignemnts and lecture
These are good for Discord because they’re questions of general interest which don’t go into too much detail about specific code you’re writing.
Examples of questions that are best for email or office hours only:
- Here’s my code, this line crashes, I think I misunderstand something
- Any time you need to share your code for an assignment, this is something for a private communication channel like email or office hours. Sharing in Discord makes it easy for someone else to copy off of your hard work (see academic integrity discussion below).
- I don’t understand this part of my assignment grade
Questions about personal circumstances (e.g, extension requests) should go to the professor only, via email.
Assignments, Assessments, and Evaluations
Graded Assignments and Learning Activities
The course grading is focused on responses to readings, as well as homework projects.
Readings
Each week (except the first) you will need to respond to two research papers, no later than midnight the day before class. Late responses are not accepted, but below there is a policy allowing you to skip a few of these during the term without penalty.
When reading each paper, you should consider if you could write:
- A short (3-4 sentence) summary
- List the paper’s major contributions (these may or may not match what the authors claim)
- What’s the greatest technical strength in the paper?
- What are some limitations of the techniques described?
- What are some ways the paper’s results could be extended, improved, or better evaluated?
- Does this paper suggest solutions (or approaches) to other problems?
- Are there parts you didn’t understand?
Not all of these will make sense for every paper, but most of them are sensible for most papers.
Each week there will be two discussion questions to respond to, one for each paper, to touch on some of the aspects above, and possibly relate papers to others discussed earlier in the term. In those responses, you are welcome and encouraged(!) to raise other issues you might have encountered, especially points of confusion.
Note that different people will often have different takes on the same paper, disagree on whether a choice made by the authors is a strength or weakness, or find different things clarifying or confusing. This is all okay! Everyone has different backgrounds.
If you are confused about some part of the paper, don’t be shy: almost certainly someone else found it confusing, too, and maybe they’re too timid to mention it. By pointing out what was confusing or difficult, we raise the opportunity to discuss it in class and help everyone understand better. I found some parts of these papers difficult or confusing the first time (or two) I read them - this is normal.
Professor Gordon will read the responses in time to adjust lecture to clarify common points of confusion or elaborate on things people found interesting.
You may skip up to 4 reading responses during the term, by submitting, instead of a response, a sentence saying you are using a skip — by the deadline.
Do not email me a request, just submit it.
You may do this for any 4 responses, distributed any time in the term.
You may skip one response in each of four weeks, both responses in each of two weeks, or take the in-between option of skipping one week completely and
half the work on two additional weeks. Assignments for those weeks will simply be omitted when calculating your reading response grade for the term.
A couple things to consider about skips:
- The skip only means you do not need to write a response. If it’s a paper that is useful for your homework, you may still find it beneficial (or essential) to read the paper, even if you skip the response.
- Because using a skip reduces the number of reading grades you have, it makes the remaining grades worth slightly more.
You may apply skips retroactively; if at the end of the term you have a 0 or 1 you’d like to remove, and you have skips remaining, you can apply the skip to that response by emailing me a request in the final week of the term (I will not do this automatically). There is no bonus for having unused skips at the end of the term.
Reading responses are graded on a scale of 0-2, with a possible extra-credit score of 3:
- 0: You did not submit anything, or your answer does not reflect a serious attempt at the reading (e.g., it does not answer the question asked, or does not appear at all specific to the paper in question)
- 1: Your response is shallow, in that while it reflects the paper, it does not reflect a serious engagement with it.
- 2: Your response reflects serious consideration of the discussion question.
- 3 (extra credit, rare): Your response is particularly insightful
The discussion questions do not have unique right answers. You’re expected to answer the questions after some thought. If you feel you do not understand some aspect of the paper well enought to answer the question, you can instead explain your confusion or uncertainty in detail, and such answers can (and usually will) receive full credit if they reflect a serious attempt at understanding and pinpointing the sources of confusion. Even answers that reflect misunderstandings can sometimes earn full credit; the grading is based more on your answer reflecting a serious engagement with the paper and course material than on specific factual understandings.
A Note on Proofs: Many of the papers you will read this term include formal proofs of correctness for an algorithm or protocol. You won’t be asked to produce proofs in this class. But, you will need to understand them. Some of your homework assignments will have you implementing algorithms from these papers, and understanding the proofs of correctness will help you think about the code you write. More broadly, outside this class, some of the proofs are fundamental impossibility results. It’s all well and good to understand that X is impossible, but you’re rarely asked to do X that is known to be impossible. Instead, you’re sometimes asked to do Y, where Y and X have some strong similarities. Sometimes Y is simpler than X in a key way that makes it possible. Sometimes Y is actually a variation on X. Understanding the proof for why X is impossible will help you recognize when you see variations on it.
Homeworks
The homeworks are tentatively on the following:
- Basic distributed systems programming model and storage consistency
- Distributed transactions
- Consensus
- Big data
The late policy for homeworks is as follows: for the term, you have 5 late days to distribute between homework assignments at your discretion, with one restriction: the final homework may not be submitted after the end of the final week of classes.
Each homework should only require a modest amount of code, but that code might be very difficult to write and debug. in addition to coding, each homework will include some kind of reflection or analysis of what you did, generally in an open-ended way.
Grading Matrix
- 50% Reading Assignment Assessments
- 50% Homework (split evenly across assignments)
The late/skip policies were described above.
In addition the late and skip policies, extensions are possible for good reason with reasonable notice. I am aware that students have jobs, family matters, paper deadlines for their PhD, etc., which can interfere with completing assignments. I want your grade to reflect your mastery of the material and quality of work you hand in, not whether or not you were fortunate enough to avoid major life events during the term. If something comes up during the term, let me know. If it’s unexpected (e.g., you end up in the ER when you were planning to work on coursework), let me know when you can and we’ll figure it out. If it’s something you know about in advance (e.g., you must travel for work), let me know as soon as you know, and we can discuss whether we should give you an extension on an assignment. I reserve the right to request supporting evidence for your stated need for an extension (only so far as justifying the existence of a good excuse; e.g., I might ask for a note confirming existence of a health issue interfering with attendance or assignment completion, but I don’t need to know the details of the particular health issue).
Academic Integrity, Cheating, and Plagiarism
The list of links at the end of the syllabus include a link to the University’s academic integrity policy. If you haven’t actually read it before, you should, because not meaning to plagiarize is not an excuse for plagiarism. This includes not realizing that something needed to be quoted, or being unfamiliar with the idea that paraphrased sentences still require citation (and possibly quotes), or opting to reuse someone else’s words or code because you’re not confident in the quality of your own.
The general idea is that you should not submit work that is not your own — code or written prose — that is not properly attributed. This includes, but is not limited, to things like putting direct quotes from someone else’s writing in quotes and citing the source, and giving the source for small snippets of code you might have taken from StackOverflow or similar. Again, you should read the actual university integrity policy.
The University leaves the penalty for cheating, plagiarism, etc. in a course up to the professor.
If you cheat in this class, I will give you an F for the term.
I realize that most cheating is a consequence of poor time management, or unexpected or hard-to-manage obligations beyond the class.
That is exactly why you have late days for homeworks, skips for readings, and the course has a fairly flexible extension policy - I want you to succeed, but I want you to do so honestly. If you have any doubts about whether something might cross the line into cheating, please ask me before you do it. The worst I’ll say is “No, don’t do that.” And I’ll be glad you asked. This is far better than an F for the term.
To avoid misunderstandings, please do not share pieces of your assignment code via Discord. General questions like “how do I set up an Actor” or “How do I configure this setting in Akka.NET” are great questions for Discord. “When this core part of my homework code executes it crashes” is not appropriate for Discord (your classmates should never see your code), but a great thing to email the professor with.
Two final notes:
- If you quote or reuse other sources (properly, with attribution) so heavily I feel like you haven’t actually done the work for the assignment/response, I’ll give you a 0 for the assignment because you didn’t do the work, but that’s not plagiarism so won’t result in academic misconduct procedures. But you are allowed to take small snippets (e.g., setting up an actor, the basis for a build file) from external sources for small things that are not central to the assignment, as long as you give credit appropriately.
- A popular question these days is what to do with ChatGPT. Submitting ChatGPT answers (or output from similar systems, even if edited) is considered cheating in this course. However, a more important reason not to do it is that these systems simply won’t be effective in helping you for this course. ChatGPT and similar systems simply won’t be able to give even reasonable-sounding answers to the paper discussion questions. And they really won’t be able to help you in substantial ways with the projects, for a number of reasons: a. They generally don’t deal well with subtle code, which is one reason code generated by LLMs such as ChatGPT tend to have more security vulnerabilities than human-written code. Distributed systems code is very subtle. b. You’re being asked to write code using the Akka library. Akka code is going to be a very small part of any such system’s training data, which means those systems are likely to generate more deeply broken Akka code than they would for general Java code (the problem will be compounded if you’re working in Kotlin or Scala). The short version: using ChatCPT and similar systems isn’t permitted in this class, but independent of that, trying to use them is almost certainly more work than just doing the work yourself, due to the nature of the course.
Grade Scale
The following scale will be used to convert points to letter grades:
Grade | |||||
---|---|---|---|---|---|
97-100 | A+ | 82-86.99 | B | 70-71.99 | C- |
92-96.99 | A | 80-81.99 | B- | 67-69.99 | D+ |
90-91.99 | A- | 77-79.99 | C+ | 60-66.99 | D |
87-89.99 | B+ | 72-76.99 | C | 0-59.99 | F |
Note that the instructor may revise this conversion if/when necessary.
Course Schedule
(This schedule is tentative and may change during the course.)
Most weeks attempt to pair:
- An older/classic paper with a modern paper
- A theory paper with a systems paper
Currently the syllabus is final up to and including week 6.
Week by week:
- Introduction, Overview, Actors
- No readings due
- Challenges and Time in distributed systems
- A Note on Distributed Computing. Sun Microsystems Laboratories, Inc. Technical Report SMLI TR-94-29. November 1994.
- Lamport, Leslie. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7), July 1978.
- Strong Consistency
- Chandy, K. Mani and Lamport, Leslie. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems 3(1), February 1985.
- Spanner: Google’s Globally-Distributed Database. ACM Transactions on Computer Systems 31(3), August 2013.
- Optional: Chain Replication for Supporting High Throughput and Availability
- Consensus (Paxos, Raft, etc.)
- Ongaro, Diego and Ousterhout, John. In Search of an Understandable Consensus Algorithm. USENIX Annual Technical Conference (ATC), 2014.
- See “Ongaro PDF” link near the bottom of the page
- Paxos Made Live - An Engineering Perspective. ACM Symposium on Principles of Distributed Computing (PODC), 2007.
- Ongaro, Diego and Ousterhout, John. In Search of an Understandable Consensus Algorithm. USENIX Annual Technical Conference (ATC), 2014.
- CAP, FLP, and other impossibilities
- Fischer, Michael J., and Lynch, Nancy A., and Paterson, Michael S. Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), April 1985.
- Brewer, Eric. CAP Twelve Years Later: How the “Rules” Have Changed. IEEE Computer 45, February 2012.
- Weak and Eventual Consistency
- Managing update conflicts in Bayou, a weakly connected replicated storage system. ACM Symposium on Operating Systems Principles (SOSP), 1995.
- Conflict-Free Replicated Data Types. Symposium on Self-Stabilizing Systems (SSS), 2011.
- Getting Things Right
- Disciplined Inconsistency with Consistency Types. In ACM Symposium on Cloud Computing (SoCC), 2016.
- Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In ACM Symposium on Operating Systems Principles (SOSP), 2015.
- Large scale data storage and processing: Hadoop & Spark
- Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating Systems Design and Implementation (OSDI), 2004.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
- Distributed Resource Management
- Apache Hadoop YARN: Yet Another Resource Manager. In the 4th Annual Symposium on Cloud Computing (SoCC), 2013.
- Large-scale Cluster Management at Google with Borg. In the 10th European Conference on Computer Systems (EuroSys), 2015.
- Getting Things Right, Part 2
This reading list is tenative. Some readings in Weeks 7-10 will probably change before it’s time to read the papers. Possible additional topics or papers include:
- Xen and the Art of Virtualization (SOSP’03)
- A View of Cloud Computing, Communications of the ACM 53(4), April 2010.
- A Comparison of Software and Hardware Techniques for x86 Virtualization (ASPLOS’06)
- Dynamo
- DryadLinq
- Secure Untrusted Data Repository (SUNDR)
- The Role of Distributed State
- Viewstamped Replication Revisited
- Chain Replication for Supporting High Throughput and Availability
- Building Consistent Transactions with Inconsistent Replication
- Orleans
- Large-scale Cluster Management at Google with Borg
- Paxos Made Moderately Complex
- Scalability! But at what COST?
- F.B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22(4), December 1990.
- Borg: The Next Generation
- Compartmentalized Paxos
- Energy Consumption of Cryptocurrencies Beyond Bitcoin
- Bitcoin emissions alone could push global warming above 2°C
- Practical Byzantine Fault Tolerance, paired with one of
- Augustus: Scalable and Robust Storage for Cloud Applications
- Byzantium: Byzantine-Fault-Tolerant Database Replication Providing Snapshot Isolation
- Callinicos: Robust Transactional Storage for Distributed Data Structures
- Byzantine fault-tolerant deferred update replication
- Bitcoin: A Peer-to-Peer Electronic Cash System
- Blockchains from a Distributed Computing Perspective. Communications of the ACM, February 2019.
- SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies. In IEEE Symposium on Security and Privacy (S&P), 2015.
- From Viewstamped Replication to Byzantine Fault Tolerance. Chapter from Replication: Theory and Practice, 2010. (Under the keywords look for “Download to read the full chapter text” to get just this chapter; the “Download book PDF” button in the top right downloads the entire book, courtesy of Drexel Library’s subscription.)
- Practical Byzantine Fault Tolerance
Academic Policies
This course follows university, college, and department policies, including but not limited to:
- Academic Integrity, Plagiarism, Dishonesty and Cheating Policy:Â http://www.drexel.edu/provost/policies/academic_dishonesty.asp
- Student Life Honesty Policy from Judicial Affairs:Â http://www.drexel.edu/provost/policies/academic-integrity
- Students with Disability Statement:Â http://drexel.edu/oed/disabilityResources/students/
- Course Add/Drop Policy:Â http://www.drexel.edu/provost/policies/course-add-drop
- Course Withdrawal Policy:Â http://drexel.edu/provost/policies/course-withdrawal
- Department Academic Integrity Policy:Â http://drexel.edu/cci/resources/current-students/undergraduate/policies/cs-academic-integrity/
- Drexel Student Learning Priorities:Â http://drexel.edu/provost/assessment/outcomes/dslp/
- Office of Disability Resources:Â http://www.drexel.edu/ods/student_reg.html
The instructor(s) may, at his/her/their discretion, change any part of the course before or during the term, including assignments, grade breakdowns, due dates, and schedule. Such changes will be communicated to students via the course web site. This web site should be checked regularly and frequently for such changes and announcements.
Students requesting accommodations due to a disability at Drexel University need to request a current Accommodations Verification Letter (AVL) in the ClockWork database before accommodations can be made. These requests are received by Disability Resources (DR), who then issues the AVL to the appropriate contacts. For additional information, visit the DR website at drexel.edu/oed/disabilityResources/overview/, or contact DR for more information by phone at 215.895.1401, or by email at disability@drexel.edu.