Authors:
(1) Martin Kleppmann, University of Cambridge, Cambridge, UK (martin.kleppmann@cst.cam.ac.uk);
(2) Paul Frazee, Bluesky Social PBC United States;
(3) Jake Gold, Bluesky Social PBC United States;
(4) Jay Graber, Bluesky Social PBC United States;
(5) Daniel Holmgren, Bluesky Social PBC United States;
(6) Devin Ivy, Bluesky Social PBC United States;
(7) Jeromy Johnson, Bluesky Social PBC United States;
(8) Bryan Newbold, Bluesky Social PBC United States;
(9) Jaz Volpert, Bluesky Social PBC United States.
Table of Links
2.3 Custom Feeds and Algorithmic Choice
3 The at Protocol Architecture
3.2 Personal Data Servers (PDS)
3.4 Labelers and Feed Generators
5 Conclusions, Acknowledgments, and References
3.3 Indexing Infrastructure
On the web, websites are crawled and indexed by search engines, which then provide web-wide search and discovery features that the websites alone cannot provide. The AT Protocol is inspired by this architecture: the repositories hosted by PDSes are analogous to websites, and the indexing infrastructure is analogous to a search engine. User repositories are primary data (the “source of truth”), and the indexes are derived from the content of the repositories.
At the time of writing, most of Bluesky’s indexing infrastructure is operated by Bluesky Social PBC (indicated by a shaded area in Figure 3). However, the company does not have any privileged access: since repositories are public, anybody can crawl and index them using the same protocols as our systems use. Client apps can switch to reading from a different index, or use a combination of multiple indexes.
While operating a small PDS is designed to be cheap, operating an indexer that ingests the entire network requires greater computing resources. We therefore expect that there will be fewer hobbyist indexers than self-hosted PDSes. Nevertheless, as Bluesky grows, there are likely to be multiple professionally run indexers for various purposes. For example, a company that performs sentiment analysis on social media activity about brands could easily create a whole-network index that provides insights to their clients. Web search engines can incorporate Bluesky activity into their indexes, and archivists such as the Internet Archive can preserve the activity for posterity.
The indexing infrastructure operated by Bluesky Social PBC is illustrated in Figure 3. It is composed of multiple services that have integration points for external services.
3.3.1 The Relay. The first component is the Relay, which crawls the user repositories on all known PDSes and consumes the streams of updates that they produce. The Relay checks the signatures and Merkle tree proofs on updates, and maintains its own replica of each repository. From this information, the Relay creates the firehose: an aggregated stream of updates that notifies subscribers whenever records are added or deleted in any of the known repositories.
The firehose is publicly available. Consuming the firehose is an easier way of building an index over the whole network, compared to directly subscribing to the source PDSes, since the Relay performs some initial data cleaning such as discarding malformed updates and filtering out high-volume spam. The firehose can optionally include Merkle proofs and signatures along with records, allowing subscribers to check that they are authentic.
The Relay does not interpret or index the records in repositories, but simply stores and forwards them. Any developers wanting to create a new social mode on top of atproto can define a new lexicon with new record types, and these records can be stored in existing repositories and aggregated in the firehose without requiring any changes to the Relay.
3.3.2 The App View. The App View is a service that consumes the firehose, and processes the records that are relevant to the Bluesky social app (records in the com.atproto and app.bsky lexicons). For example, the App View counts the number of likes on every post, and it collates the thread of replies to each post. The App View also maintains the set of followers for each user, and constructs the timeline containing the posts by the accounts that each user is following. It then offers a web service through which this information can be queried. When a record contains references to images, the App View fetches those files from the original PDS, resizes them if necessary to reduce the file size, and makes them available via a content delivery network (CDN).
To display this information in the user’s client app, the client queries the user’s own PDS, which then fetches the neccessary data from an App View. The App View is also responsible for enforcing moderation controls: for example, if one user has blocked another, and one of the users’ repositories contains a record of an interaction that should not have been allowed due to the block, then the App View drops that interaction so that nobody can see it in the client apps. This behavior is consistent with how blocking works on Twitter/X [61], and it is also the reason why blocks are public records in Bluesky: every protocol-conforming App View needs to know who is blocking who in order to enforce the block [16, 41]. If users are unhappy with the moderation rules applied by the App View operated by Bluesky Social PBC, it is always possible for third parties to operate alternative App Views that index the same firehose and present the data in a different way.
If the AT Protocol is used to implement another social mode besides microblogging, that application will most likely require an App View service of its own, which can be hosted by anyone. This service can then interpret and index the records in users’ repositories in whatever way is required for that application.
This paper is available on arxiv under CC BY 4.0 DEED license.