Email Domain Replication System Notes

Goals

Reliable, fault tolerant storage of email
High-speed local access to email
Efficient bandwidth usage across WAN
Secure communication
Domain-level (sparse) replication
Simple addition and removal of servers and domains

Architecture

See the diagram for a visual description of how the pieces fit together.

Replication model:

Each server runs a remote proxy that connects to as many other servers as possible.
A remote agent maintains a connection to a remote.
A local agent manages the file store.
The remote agent arranges for locally-initiated actions to be completed remotely, and for remotely-initiated actions to be completed locally.

Security model:

Each server has a public/private key pair.
When remote agents connect, each end of a connection verifies the other against the known public key.
No local encryption is done.
All remote connections are encrypted with a SSL wrapper.

Synchronization model:

All events are effectively recorded in a log.
Messages are named such that collisions between systems is impossible.
Once created, messages are immutable.
All systems are assumed to have perfect knowledge about accounts, permissions, etc.

Plan #1: Communicate between proxy and agents using pipes

Proxy uses a local agent to manipulate mail store
Pros:

Allows for the case of being a front end to a remote mail store (but is this a reasonable configuration?)

Cons:

One more protocol encode/decode stage
More complex in the case of multiple agents
Redo log will have multiple writers (remote agents, local proxy, remote proxy)

Plan #2: Communicate between agents using files (greatly preferred)

No local agent
Access and delivery agents manipulate the mail store directly
Pros:

Simplest interaction between the various stages
Access agent can be reduced to a simple library that intercepts I/O system calls
Remote agents can be started and stopped at will without any involvement from other processes

Cons:

Requires a local mail store
Creates problems with synchronizing multiple (remote) agents without polling
Redo log will have multiple writers (proxy and remote agents)

Development Plan

Development should follow a "zero, one, any" plan -- implementation should always either disallow a feature (zero), allow the feature in only one mode or instance (one), or allow totally arbitrary use of that feature (any). Development of a feature that is required to by "any" should if possible proceed from the "zero" and "one" stages first.

Single-server model (zero remotes)

Proof of concept with:
Local only mail store
Simplified local agent library (don't handle folders)
Delivery agent
POP server

Two-server model (one remote)

Implement proxy server that knows about one local and one remote

Implement the redo log
Transmit the redo log
Trivial (no) authentication
Total domain replication

Add public/private keys to remote proxy agents

DSA authentication

Full local protocol

Flesh out the access agent library to handle folders and message numbering
Modify Courier IMAP to use the access agent library

Multiple-server model (any remotes)

Simple multiple-server

No forwarding of redo activity

Implement partial domain replication

Each server will have a list of domains that they are responsible for
Each server will know the lists from the other servers
Only redo activity necessary for the target will be forwarded

Implement spanning-tree algorithms into proxy

Each server forwards on any activity on as necessary to partially-connected servers

Definitions

EID: Email IDentifier, uniquely identifies any message across all systems

Domain name
Account name
Folder name
UID (Unique IDentifier), which is composed of:

Timestamp (including microseconds?)
PID
Originating hostname

Written as "account@domain/folder/UID"

Redo Log

Data for each message needs to be temporally ordered.
Messages do not need to be temporally ordered.
The log is used to determine what actions to take on a remote.
Previous plan: store the journal as a set of files

Two files per EID:

The message itself
List of locals or remotes that:

still need to process the message, or
have processed the message

The files are hard linked to the local mail store for non-deleted messages.
The redo log must therefore be in the same partition as the mail store.
Deleted messages are truncated in the redo log (all valid messages have non-zero length).
Cases:

Addition of a new message

Create new message file in temporary area
Write message data to the file
Link the file into the redo log
Move the file into the mail store

Modifications to a message's flags

If a link to the message does not exist in the redo log: create it
Change the flags on the message in the redo log
Change the flags on the message in the mail store

Deleted message

If a link to the message does not exist in the redo log: create it
Truncate the file
Remove the link from the mail store

Message moved from one folder to another

If a link to the message does not exist in the redo log: create it
???

Problems:

Atomic operations are rather difficult and time consuming due to requiring multiple syncs.

New plan: linear journal files

Each journal file has a maximum size.
Each activity is appended as a transaction record to the journal.
Journal can be used to replace synchronous operations on the mail store.
Record contents:

Action type
Domain name
Account name
Folder
UID
Message flags (optional)
New folder (optional)
New message contents (optional)

Cases:

Addition of a new message

Create new message file in destination account
Create new journal record
Write message data to both new file and to journal
Commit journal

Modifications to a message's flags, or message moved between folders

Rename message file
Write & commit journal record

Deleted message

Message moved from one folder to another

Unresolved Issues and Future Ideas

How does a new server get added to a network with existing data? One idea is to first insert it into the tables of the remotes, then mark it as down so that it will start accumulating redo-log entries, then mass-rsync all the data over to the target system. After that's done, roll forwards the redo-log entries, and mark it as up. Can this cause unsynchronization, though?
This architecture doesn't address scalability. One possibility for accomplishing clustering to achieve scalability is to go a step further with sparse domain replication (each server has a subset of the accounts for a domain). This would require some kind of "director" device to front-end connections to the access agent and effect the load balancing.
Another step forward past the multi-server model is a multi-layer model, where there are client servers that effectively serve as caches to the actual mail accounts, and only store the accounts that are accessed.