CS 647 Distributed Systems
CS647: Distributed Systems
3 credits
Spring 2021
General Information
Instructor:
- Dr. Colin S. Gordon
- csgordon@drexel.edu
- https://www.cs.drexel.edu/~csgordon/
- Office hours by appointment via Zoom. If you have anything you’d like to discuss, please do not hesitate to schedule an appointment! It really is not an imposition. There are no recurring hours because historically office hours associated with evening classes have not been well-attended because of student scheduling conflicts, and the presence of recurring hours tends to discourage students who can’t make them from reaching out.
TA:
- Yigit Can Alparslan
- ya332@drexel.edu
- Office Hours Monday 8-10PM US Eastern Time Zone, via Zoom: https://drexel.zoom.us/j/84097244271
- Please do not go to the CLC for office hours.
Student Learning Information
Course Description
In-depth discussion of fundamental concepts of distributed computer systems. Covers development techniques and runtime challenges, with a focus on reliability and system validation techniques. Subjects discussed include: interprocess communication, remote procedure calls and method invocation, middleware, distributed services, coordination, transactions, replication and weak data consistency models. Significant system-building term project in Java or similar language.
Course Purpose within a Program of Study
Within the revised MSSE program, this will serve as one of 6 possible CS electives (3 required along with 3/6 IS electives) to provide broader knowledge knowledge of software engineering. The MSSE degree emphasizes modern practices and techniques to produce reliable software that functions as desired, in a timely manner. Distributed systems are an increasingly important domain and more software systems move to shared cloud infrastructure.
Within the CS PhD program, the course will serve as an elective suitable for any graduate student, but particularly for those with research interests in software engineering, systems, or programming languages.
Statement of Expected Learning
The course objectives are to:
- Teach students the core aspects of distributed systems that make their design and implementation more challenging than sequential software running on a single machine
- Communicate core theoretical results (outcomes, not necessarily proof) establishing hard limits on what is possible in distributed systems (e.g., CAP and its limitations, FLP)
- Teach students how weakening notions of correctness (specifically, notions of serializability and consistency) can help programs achieve fault tolerance
- Prepare students to validate running distributed systems
As learning outcomes, students completing this course should be able to:
- Understand how independent network and machine failure conspire to make reliable distributed systems difficult to achieve
- Articulate and understand the trade-offs between different well-known data consistency models
- Be able to design a distributed system for a specified level of fault tolerance
- Understand how to reason about the execution of a distributed system
- Understand the basic concepts of how to validate a running distributed system
- Apply cutting edge tools to investigate the behavior of systems under failure
- Understand practical design considerations for some modern distributed systems frameworks, and their limitations
- Recognize cases when distribution is not necessary
Course Materials
Required and Recommended Texts, Readings, and Resources
Required: No textbooks, instructor-selected research papers.
Recommended: Designing Data-Intensive Applications, by Martin Kleppmann. This book is optional, but will provide a nice extended discussion for much of the course material. In addition to the natural option to purchase a hardcopy, it is available in electronic form (with and without DRM, see the book’s site), and via O’Reilly’s Safari Books Online platform. You can access this via Drexel Libraries’ subscription by going here, clicking on “Full Text Online” and signing in with your Drexel credentials.
We may also read excerpts from Concurrency Control and Recovery in Database Systems, which is freely available in PDF format from the authors.
Required and Supplemental Materials and Technologies
The course will use a Slack workspace for general course questions and material questions that may be of interest to everyone in the course (setup questions, assignment clarifications, etc.). To sign up click here and register with a Drexel email address.
This course will be programming-intensive; you should expect to write a moderate amount of very challenging code. Assignments must be completed using one of the following actor frameworks:
- You may use Akka on the JVM, through either Java or Scala
- You may use Akka.NET on Mono or .NET Core, through either C# or F#
Akka.NET is a .NET port of the original JVM implementation of Akka, so both provide nearly identical interfaces. I’m happy to read and grade code in any of the four languages above, though I’m less familiar with F# than the others, so I’ll be less helpful debugging F#-specific issues.
Projects in the course will require you to write full self-contained projects using a build file for a cross-platform build tool that handles fetching dependencies, compiling your code, and executing the code. On the JVM, any of the major tools is acceptable (ant
+ivy
, maven
, gradle
, or sbt
). For .NET, msbuild
via the console, or dotnet
for .NET Core projects are acceptable (update: also fake
for F# projects). Projects must build from the command line: projects that only build via an IDE (Eclipse, Visual Studio, etc.) will be penalized, though you are welcome to use whatever editors or IDEs you like for writing your code.
The JVM version of Akka has the best documentation. Akka.NET’s documentation seems a bit more sparse, but since it attempts to expose nearly the same API as the JVM version, you could largely adapt the Java version of the JVM Akka documentation for C# if you needed to.
Both versions of Akka have reasonable books available to you through the Drexel Libraries:
- Akka in Action for the JVM (I recommend using the second “Click to Access” to get the O’Reilly Online Learning version)
- Reactive Applications with Akka.NET for .NET.
- For F# specifically, the book Mastering F# (also on Safari) contains a brief introduction
Getting Help
The best way to get help is to use the course Slack workspace, email the TA and professor (both!), or show up to office hours (either the TA’s scheduled hours, or setting up an appointment with the professor).
Examples of questions that are best for the Slack workspace include:
- I’m doing the homework, and realized I don’t understand this line in the paper. What does it mean?
- I’m trying to set up Akka, but this part of the configuration file seems wrong, has anyone figured out what this “seeds” field should be?
- General questions about actor programming
- General questions about distributed systems material beyond what we cover in assignemnts and lecture
These are good for Slack because they’re questions of general interest which don’t go into too much detail about specific code you’re writing.
Examples of questions that are best for email to course staff only:
- Here’s my code, this line crashes, I think I misunderstand something
- Any time you need to share your code for an assignment, email the professor and TA. Sharing in Slack makes it easy for someone else to copy off of your hard work.
- I don’t understand this part of my assignment grade
Questions about personal circumstances (e.g, extension requests) should go to the professor only, via email.
Assignments, Assessments, and Evaluations
Graded Assignments and Learning Activities
The course grading is focused on responses to readings, as well as short homework assignments.
Readings
Each week (except the first) you will need to respond to two research papers, no later than 6pm the day before class. Late responses are not accepted, but below there is a policy allowing you to skip a few of these during the term without penalty.
For each paper you will submit a commentary that covers your understanding of the major points and contributions of the paper, the paper’s limitations, things you might not have understood (which we can explain in class!), ways to extend or improve the paper’s work, or other problems this paper’s solutions might provide insights to. Note that not all of these questions make sense for every paper we read. I mainly want to see (a) how well you understood the main ideas from the paper, and (b) that you’ve thought about how well or poorly it might really work.
You should consider:
- A short (3-4 sentence) summary
- List the paper’s major contributions (these may or may not match what the authors claim)
- What’s the greatest technical strength in the paper?
- What are some limitations of the techniques described?
- What are some ways the paper’s results could be extended, improved, or better evaluated?
- Does this paper suggest solutions (or approaches) to other problems?
- Are there parts you didn’t understand?
Not all of these will make sense for every paper, but most of them are sensible to cover for most papers.
Note that different people will often have different takes on the same paper, disagree on whether a choice made by the authors is a strength or weakness, or find different things clarifying or confusing. This is all okay! Everyone has different backgrounds. If you are confused about some part of the paper, don’t be shy: almost certainly someone else found it confusing, too, and maybe they’re too timid to mention it. By pointing out what was confusing or difficult, we raise the opportunity to discuss it in class and help everyone understand better. I found some parts of these papers difficult or confusing the first time (or two) I read them - this is normal.
You may skip up to 4 reading responses during the term, by submitting, instead of a response, a sentence saying you are using a skip — by the deadline. Do not email me a request, just submit it. You may do this for any 4 responses, distributed any time in the term. You may skip one response in each of four weeks, both responses in each of two weeks, or take the in-between option of skipping one week completely and half the work on two additional weeks. Assignments for those weeks will simply be omitted when calculating your reading response grade for the term. One thing to consider, though: the skip only means you do not need to write a response. If it’s a paper that is useful for your homework, you may still find it beneficial to read the paper, even if you skip the response. You may apply skips retroactively; if at the end of the term you have a 0 or 1 you’d like to remove, and you have skips remaining, you can apply the skip to that response by emailing me a request in the final week of the term (I will not do this automatically). There is no bonus for having unused skips at the end of the term.
A Note on Proofs: Many of the papers you will read this term include formal proofs of correctness for an algorithm or protocol. You won’t be asked to produce proofs in this class. But, you will need to understand them. Some of your homework assignments will have you implementing algorithms from these papers, and understanding the proofs of correctness will help you think about the code you write. More broadly, outside this class, some of the proofs are fundamental impossibility results. It’s all well and good to understand that X is impossible, but you’re rarely asked to do X that is known to be impossible. Instead, you’re sometimes asked to do Y, where Y and X have some strong similarities. Sometimes Y is simpler than X in a key way that makes it possible. Sometimes Y is actually a variation on X. Understanding the proof for why X is impossible will help you recognize when you see variations on it.
Homeworks
The homeworks are tentatively on the following:
- Basic distributed systems programming model and storage consistency
- Distributed transactions
- Consensus
- Big data
The late policy for homeworks is as follows: for the term, you have 5 late days to distribute between homework assignments at your discretion, with one restriction: the final homework may not be submitted after the end of the final week of classes.
Each homework should only require a modest amount of code, but that code might be very difficult to write and debug. in addition to coding, each homework will include some kind of reflection or analysis of what you did, generally in an open-ended way.
Grading Matrix
- 50% Reading Assignment Assessments
- 50% Homework (split evenly across assignments)
Responses are graded on a scale of 0 to 2, with the possibility earn a 3/2 on any summary:
- 0 - No summary, or your summary looks like you did not read the material.
- 1 - You read the material, but essentially summarized without reflecting
- 2 - You wrote a solid, thoughtful response
- 3 - You wrote a particularly insightful response
As noted above, the grading for responses emphasizes that you have made a serious effort to understand the paper and consider its strengths, limitations, etc. The response grading does not necessarily depend on correctness of your understanding, because most of you will lack the background to fully understand all parts of the paper, which can lead to misunderstandings and lack of understanding for parts. That’s okay and expected, and won’t hurt your grade unless your response suggests you have very fundamental misunderstandings of what the papers are even trying to accomplish. Historically, this has only occurred when people have not actually made a serious effort to read the papers. The responses are due the day before each lecture to allow me time to read the responses, and update the lecture material to both address common or important questions, and clarify any misunderstandings I see.
The late/skip policies were described above.
In addition the late and skip policies, extensions are possible for good reason with reasonable notice. I am aware that students have jobs, family matters, paper deadlines for their PhD, etc., which can interfere with completing assignments. I want your grade to reflect your mastery of the material and quality of work you hand in, not whether or not you were fortunate enough to avoid major life events during the term. If something comes up during the term, let me know. If it’s unexpected (e.g., you end up in the ER when you were planning to work on coursework), let me know when you can and we’ll figure it out. If it’s something you know about in advance (e.g., you must travel for work), let me know as soon as you know, and we can discuss whether we should give you an extension on an assignment. I reserve the right to request supporting evidence for your stated need for an extension (only so far as justifying the existence of a good excuse; e.g., I might ask for a note confirming existence of a health issue interfering with attendance or assignment completion, but I don’t need to know the details of the particular health issue).
Academic Integrity, Cheating, and Plagiarism
The list of links at the end of the syllabus include a link to the University’s academic integrity policy. If you haven’t actually read it before, you should, because not meaning to plagiarize is not an excuse for plagiarism. This includes not realizing that something needed to be quoted, or being unfamiliar with the idea that paraphrased sentences still require citation (and possibly quotes), or opting to reuse someone else’s words or code because you’re not confident in the quality of your own.
The general idea is that you should not submit work that is not your own — code or written prose — that is not properly attributed. This includes, but is not limited, to things like putting direct quotes from someone else’s writing in quotes and citing the source, and giving the source for small snippets of code you might have taken from StackOverflow or similar. Again, you should read the actual university integrity policy.
The University leaves the penalty for cheating, plagiarism, etc. in a course up to the professor. If you cheat in this class, I will give you an F for the term. I realize that most cheating is a consequence of poor time management, or unexpected or hard-to-manage obligations beyond the class. That is exactly why you have late days for homeworks, skips for readings, and the course has a fairly flexible extension policy - I want you to succeed, but I want you to do so honestly. If you have any doubts about whether something might cross the line into cheating, please ask me before you do it. The worst I’ll say is “No, don’t do that.” And I’ll be glad you asked. This is far better than an F for the term.
To avoid misunderstandings, please do not share pieces of your assignment code via Slack. General questions like “how do I set up an Actor” or “How do I configure this setting in Akka.NET” are great questions for the Slack workspace. “When this core part of my homework code executes it crashes” is not appropriate for Slack (your classmates should not see your code), but a great thing to email the professor and TA with.
A final note: If you quote or reuse other sources (properly, with attribution) so heavily I feel like you haven’t actually done the work for the assignment/response, I’ll give you a 0 for the assignment because you didn’t do the work, but that’s not plagiarism so won’t result in academic misconduct procedures. But you are allowed to take small snippets (e.g., setting up an actor, the basis for a build file) from external sources for small things that are not central to the assignment, as long as you give credit appropriately.
Grade Scale
The following scale will be used to convert points to letter grades:
Grade | |||||
---|---|---|---|---|---|
97-100 | A+ | 82-86.99 | B | 70-71.99 | C- |
92-96.99 | A | 80-81.99 | B- | 67-69.99 | D+ |
90-91.99 | A- | 77-79.99 | C+ | 60-66.99 | D |
87-89.99 | B+ | 72-76.99 | C | 0-59.99 | F |
Note that the instructor may revise this conversion if/when necessary.
Course Schedule
(This schedule is tentative and may change during the course.)
Most weeks attempt to pair:
- An older/classic paper with a modern paper
- A theory paper with a systems paper
Currently the syllabus is final up to and including week 9.
Week by week:
- Introduction, Overview, Actors
- No readings due
- Challenges and Time in distributed systems
- A Note on Distributed Computing. Sun Microsystems Laboratories, Inc. Technical Report SMLI TR-94-29. November 1994.
- Lamport, Leslie. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7), July 1978.
- Strong Consistency
- Chandy, K. Mani and Lamport, Leslie. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems 3(1), February 1985.
- Spanner: Google’s Globally-Distributed Database. ACM Transactions on Computer Systems 31(3), August 2013.
- Optional: Chain Replication for Supporting High Throughput and Availability
- Consensus (Paxos, Raft, etc.)
- Ongaro, Diego and Ousterhout, John. In Search of an Understandable Consensus Algorithm. USENIX Annual Technical Conference (ATC), 2014.
- See “Ongaro PDF” link near the bottom of the page
- Paxos Made Live - An Engineering Perspective. ACM Symposium on Principles of Distributed Computing (PODC), 2007.
- Ongaro, Diego and Ousterhout, John. In Search of an Understandable Consensus Algorithm. USENIX Annual Technical Conference (ATC), 2014.
- CAP, FLP, and other impossibilities
- Fischer, Michael J., and Lynch, Nancy A., and Paterson, Michael S. Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), April 1985.
- Brewer, Eric. CAP Twelve Years Later: How the “Rules” Have Changed. IEEE Computer 45, February 2012.
- Weak and Eventual Consistency
- Managing update conflicts in Bayou, a weakly connected replicated storage system. ACM Symposium on Operating Systems Principles (SOSP), 1995.
- Conflict-Free Replicated Data Types. Symposium on Self-Stabilizing Systems (SSS), 2011.
- Getting Things Right
- Disciplined Inconsistency with Consistency Types. In ACM Symposium on Cloud Computing (SoCC), 2016.
- Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In ACM Symposium on Operating Systems Principles (SOSP), 2015.
- Large scale data storage and processing: Hadoop & Spark
- Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating Systems Design and Implementation (OSDI), 2004.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
- Distributed Resource Management
- Apache Hadoop YARN: Yet Another Resource Manager. In the 4th Annual Symposium on Cloud Computing (SoCC), 2013.
- Large-scale Cluster Management at Google with Borg. In the 10th European Conference on Computer Systems (EuroSys), 2015.
- Cryptocurrency (a.k.a., Consensus round 2)
- Choose 2 of the following for responses:
- Bitcoin: A Peer-to-Peer Electronic Cash System
- Blockchains from a Distributed Computing Perspective. Communications of the ACM, February 2019.
- SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies. In IEEE Symposium on Security and Privacy (S&P), 2015.
- From Viewstamped Replication to Byzantine Fault Tolerance. Chapter from Replication: Theory and Practice, 2010. (Under the keywords look for “Download to read the full chapter text” to get just this chapter; the “Download book PDF” button in the top right downloads the entire book, courtesy of Drexel Library’s subscription.)
- Practical Byzantine Fault Tolerance
- Choose 2 of the following for responses:
This reading list is tenative. Some readings in Weeks 7-10 will change before it’s time to read the papers. Possible additional topics or papers include:
- Xen and the Art of Virtualization (SOSP’03)
- A View of Cloud Computing, Communications of the ACM 53(4), April 2010.
- A Comparison of Software and Hardware Techniques for x86 Virtualization (ASPLOS’06)
- Dynamo
- DryadLinq
- Secure Untrusted Data Repository (SUNDR)
- The Role of Distributed State
- Viewstamped Replication Revisited
- Chain Replication for Supporting High Throughput and Availability
- Building Consistent Transactions with Inconsistent Replication
- Orleans
- Large-scale Cluster Management at Google with Borg
- Paxos Made Moderately Complex
- Scalability! But at what COST?
- F.B. Schneider. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22(4), December 1990.
- Borg: The Next Generation
- Compartmentalized Paxos
- Energy Consumption of Cryptocurrencies Beyond Bitcoin
- Bitcoin emissions alone could push global warming above 2°C
- Practical Byzantine Fault Tolerance, paired with one of
Academic Policies
This course follows university, college, and department policies, including but not limited to:
- Academic Integrity, Plagiarism, Dishonesty and Cheating Policy:Â http://www.drexel.edu/provost/policies/academic_dishonesty.asp
- Student Life Honesty Policy from Judicial Affairs:Â http://www.drexel.edu/provost/policies/academic-integrity
- Students with Disability Statement:Â http://drexel.edu/oed/disabilityResources/students/
- Course Add/Drop Policy:Â http://www.drexel.edu/provost/policies/course-add-drop
- Course Withdrawal Policy:Â http://drexel.edu/provost/policies/course-withdrawal
- Department Academic Integrity Policy:Â http://drexel.edu/cci/resources/current-students/undergraduate/policies/cs-academic-integrity/
- Drexel Student Learning Priorities:Â http://drexel.edu/provost/assessment/outcomes/dslp/
- Office of Disability Resources:Â http://www.drexel.edu/ods/student_reg.html
The instructor(s) may, at his/her/their discretion, change any part of the course before or during the term, including assignments, grade breakdowns, due dates, and schedule. Such changes will be communicated to students via the course web site. This web site should be checked regularly and frequently for such changes and announcements.
Students requesting accommodations due to a disability at Drexel University need to request a current Accommodations Verification Letter (AVL) in the ClockWork database before accommodations can be made. These requests are received by Disability Resources (DR), who then issues the AVL to the appropriate contacts. For additional information, visit the DR website at drexel.edu/oed/disabilityResources/overview/, or contact DR for more information by phone at 215.895.1401, or by email at disability@drexel.edu.