Constellation: Clustering Nihilistic Violent Extremist Telegram Networks

Fri, 19 Jun 2026 00:00:00 +0000

A technical account of a pipeline I built to collect Telegram entity data and model it as a node graph in Neo4j, supporting indicator clustering for threat actor investigations.

Foreword

This post describes a private research tool I created, “Constellation”, and the methodology behind it. It deliberately contains no target identifiers, no victim data, and no live findings. Collection of this kind should only be conducted under appropriate authorisation and legal advice, against a defined investigative scope with source material handled as sensitive evidence. The intent throughout is to support lawful disruption of threat groups and the protection of victims.

A note on the graphs below. The interactive node graphs embedded throughout this post are representations of Constellation’s output, rebuilt as self-contained visualisations for the web. They are not the live tool, and they use mock data throughout. The entities, names and handles in them are all fictional, arranged to mirror the structure of a real investigation without exposing any real target.

Background

A significant part of my threat intelligence work over the past four years has concerned a threat group known as The Com, short for “The Community”. The Com is a loosely connected, predominantly English-speaking group of minors involved in a wide range of criminal activity, from hacking, to physical attacks and sexual extortion. The FBI released an alert on The Com in July 2025 which has significant detail on the group’s crimes, motivations and subsets.

One subset of this threat group is called Extortion Com which is as a nihilistic violent extremist network. Threat actors in Extortion Com engage in the most depraved and evil acts, such as threats, blackmail, and manipulation to coerce or extort victims into performing acts of self-harm, violence, and animal cruelty. Extortion Com is fragmented into dozens of subgroups, with thousands of members.

These groups do not operate from a single, stable online location. They are distributed across thousands of Telegram channels, groups and disposable accounts that are created, abandoned and rebranded continuously. When a channel is reported and removed, equivalents reappear shortly afterwards. Investigating any single account or channel in isolation yielded very little. The information that matters is relational: which threat actor aliases operate which accounts, which accounts administer which channels, and which accounts recur across channels that otherwise appear unconnected.

This is the investigation problem that Constellation was built to address. It is not a scraping tool, it is a pipeline for turning Telegram entity data into a graph that an analyst can query, so that the structure of a network, rather than the content of any particular message, becomes the unit of analysis.

This post describes how it works end-to-end, and the two intelligence techniques it was designed to support: pivoting and clustering.

Why a node graph?

The questions that arise during a threat actor investigation are almost all questions about relationships:

Which accounts administer more than one channel within a given set?
Two channel present as unrelated - do they share administrators or members?
A newly observed account: has it co-administered a channel already attributed to a known group?
When was a channel created, and which others were established in the same period?

Relational tables answer these questions poorly. A property graph answers them directly because the relationships are first-class objects in the model. The architecture therefore follows the questions: collect the entities, represent them as nodes and edges, and query the graph.

The underlying technique is clustering, grouping multiple individually weak indicators (a profile name, communication nuances, channel memberships, group rosters), until, taken together, they support an attribution with materially higher confidence than any single indicator would. Not a single data point is used for identification. The combination is what carries weight.

Constellation has three stages. Figure 1: The collection plane feeds a normalised relational store, which is ingested into a Neo4j property graph for analysis.

Stage 1 - Collection

The collector is built on Telethon, the asyncio MTProto client for Telegram. Operating at the MTProto layer, as a user client rather than through the Bot API, was a deliberate choice. This is because a user client observes the network as a member does, with access to dialogs, participant lists and administrative metadata that the Bot API does not expose.

Authentication uses QR-code login by default, with a phone-number fallback and two-factor handling where required. The session is persisted so that collection can resume without re-authenticating. The account used for collection is itself an indicator and is treated accordingly: dedicated, isolated, and never reused across contexts. Collection is not limited to a single account either. Constellation can load a small pool of accounts, each with its own credentials and session, authenticate them independently, and rotate between them with a configurable delay, all writing into one shared store. This spreads the work of a long collection across several identities, without leaning too heavily on any one of them.

Once authenticated, the discovery routine iterates the dialogs the account can see and classifies each entity as a channel, group, supergroup, megagroup or user. Telegram’s own service channel is excluded, and discovery runs concurrently behind a bounded apparatus so that large dialog lists do not serialise.

Figure 2: The collector adapts to access level and respects Telegram’s rate limits.

Extraction produces three record types:

Accounts - ID, username, display name, the bot flag, and verified and premium status, with a fuller profile retrieved where the API permits.
Channels - ID, title, username and URL, member count, entity type, privacy status, and the is_verified, is_scam and is_fake flags Telegram itself assigns.
Relationships - the account-to-channel edges. Participant data is mapped to a permission level: ChannelParticipantCreator to OWNER, ChannelParticipantAdmin to ADMIN, and other participants to MEMBER.

For administrators and owners of channels, the individual administrative rights are preserved as discrete fields (e.g. the ability to delete messages, restrict members, promote other administrators, or change channel information, alongside any other custom rank). This granularity is significant: an account able to promote administrators across several channels occupies a different position of trust than one able only to post, and that distinction becomes useful during analysis.

Where a channel restricts its participant list, the collector reconstructs membership from message history, recording a membership edge for each distinct account that has posted. Administrative detail is lost in this mode, but presence is retained. Presence across multiple otherwise unconnected channels is itself a clustering signal.

Telegram’s API does not return a channel’s creation date. Because that value is useful for clustering, Constellation estimates it from the earliest available message and, where administrative access exists, the earliest admin-log event. The result is treated as an estimate and weighed alongside other indicators rather than relied upon in isolation.

Aggressive collection is answered by Telegram with FloodWaitError. The collector honours the server-specified wait interval and backs off rather than retrying immediately. Collection is correspondingly slower, but the account remains usable over the long timeframes these investigations require.

Stage 2 - Storage

Collected data is written to SQLite before it reaches the graph. Retaining a normalised relational copy is intentional: it is the source of truth, it exports cleanly as evidence, and it allows the graph to be rebuilt at any time without re-collection.

The entity schema comprises three core tables:

users, keyed on account identifier;
channels, keyed on channel identifier and including the estimated creation date; and
relationships, the account-to-channel edges, with a uniqueness constraint on the account–channel pair so that re-collection updates an existing edge rather than duplicating it.

A fourth table records the access level held for each entity: full, limited, or none. That provenance is not incidental: in intelligence work, the completeness and confidence of a collection is as important as its content.

All data exports to CSV and JSON with uniicode normalisation, so that non-Latin channel names survive processing intact.

Where it runs is a deployment detail, not a design one. Constellation runs equally as a local command-line process or as a containerised service, and one variant packages the collector for Google Cloud Run behind a small HTTP API, writing each run’s CSV, JSON and database snapshot out to cloud storage. None of that changes the data model, only how collection is scheduled and where its output lands.

Stage 3 - Graph Ingestion

The final stage loads the relational data into Neo4j as a property graph. The model is intentionally small: two node labels and three relationship types.

Figure 3. Accounts and channels as nodes; ownership, administration and membership as typed, property-bearing edges.

(:TelegramAccount {user_id, username, version, ...})
(:TelegramChannel {channel_id, channel_title, creation_date, version, ...})

(:TelegramAccount)-[:OWNER_OF {admin_rights, ...}]->(:TelegramChannel)
(:TelegramAccount)-[:ADMIN_OF {admin_rights, ...}]->(:TelegramChannel)
(:TelegramAccount)-[:MEMBER_OF {...}]->(:TelegramChannel)

Uniqueness constraints on TelegramAccount.user_id and TelegramChannel.channel_id, together with indexes on the username and collection-date fields, are created before loading. They keep the graph free of duplicates and keep queries responsive.

The ingestor is built to be run repeatedly against a target that keeps changing, so every write is a MERGE rather than a CREATE:

UNWIND $batch AS account
MERGE (t:TelegramAccount {user_id: account.user_id})
ON CREATE SET t.version = 1,
 t.created_date = datetime(),
 t += account.props
ON MATCH SET t.version = t.version + 1,
 t.updated_date = datetime(),
 t += account.props

The first time an account is seen it is created at version one. Every later run that touches it increments the version and updates the timestamp, which gives me a record of change over time. An account whose membership, username or administrative footprint shifts between collections is one that is moving within the network, and that movement is itself worth flagging. Relationships are merged on the same principle, so re-ingestion never produces duplicate edges. Loading is batched, with validation and type coercion applied first, and any record missing a required field is logged and skipped rather than allowed to corrupt the load.

That is the whole automated model: one account, the channels it touches, and the three typed edges between them. The graph below is that smallest unit running live. Hover or click a node or an edge to inspect the properties each one carries.

A representation of the Constellation graph model, reconstructed for this post and not a live system.

From a collection graph to an attribution graph

The pipeline above produces a faithful map of accounts and channels. What it does not do, and cannot do, is tell me who is behind them. That is the part no scraper can automate, and it is where the actual intelligence work happens.

On top of the collected TelegramAccount and TelegramChannel data, I maintain a second, manually curated layer in the same Neo4j graph. It introduces node types the collector never creates. An Actor is a tracked human operator, an Identity is a real-world person where one has been established, and a ThreatGroup is the group itself. Alongside these sit account nodes for the other platforms the same actors use, such as Discord, YouTube, Doxbin and TikTok. They are joined by a small, deliberate relationship grammar:

an Actor is IDENTIFIED_AS a real-world Identity;
an Actor OPERATES the accounts attributed to them, across any platform;
an Actor is a MEMBER_OF, FOUNDER_OF, RECRUITER_OF or STAFF_OF a ThreatGroup;
ThreatGroups are AFFILIATED with one another; and
a ThreatGroup OPERATES the channels that belong to it.

Every node in this layer carries its own provenance, recording who created it, when, a confidence level, and a version. This is because in attribution work the basis for a claim matters as much as the claim itself. It is the join between collection and intelligence. The automated graph records which accounts administer which channels, and the attribution layer records what I have concluded about who is behind them, and how confident I am in that conclusion.

The graph below is a worked example built from mock data. A single tracked actor, here called “Driftwood”, sits at the centre, connected out to the groups they belong to, an identity behind them, and the accounts they operate across Telegram, Discord and YouTube. One Telegram account, palecedarback, acts as the operational hub for a cluster of channels. The names are fictional, but the shape is exactly what an attribution graph looks like in practice. Click any node to open its full property and provenance panel.

A worked attribution graph built from mock data, with fictional entities throughout.

Pivoting and clustering

With the network in Neo4j, the analysis can begin. The tool builds the substrate, but the intelligence comes from the questions I ask of it. The Cypher pivots below are the ones I lean on most, and I have paired each one with a small interactive view of the shape it surfaces.

Accounts administering multiple channels. A single-channel administrator is common and tells me very little. An account administering several is closer to infrastructure, and worth a closer look.

MATCH (a:TelegramAccount)-[:ADMIN_OF|OWNER_OF]->(c:TelegramChannel)
WITH a, count(DISTINCT c) AS channels, collect(c.channel_title) AS where_
WHERE channels > 1
RETURN a.username, channels, where_
ORDER BY channels DESC

A representation of the multi-channel-admin pivot: one account that owns or administers several channels.

Channels linked by shared administration. Two channels that present as unrelated, but are administered by the same accounts, are for attribution purposes a single operation.

MATCH (c1:TelegramChannel)<-[:ADMIN_OF|OWNER_OF]-(a:TelegramAccount)-[:ADMIN_OF|OWNER_OF]->(c2:TelegramChannel)
WHERE id(c1) < id(c2)
WITH c1, c2, count(DISTINCT a) AS shared_admins, collect(a.username) AS who
RETURN c1.channel_title, c2.channel_title, shared_admins, who
ORDER BY shared_admins DESC

A representation of the shared-administration pivot: two channels bridged by the accounts that administer both.

Channels linked by overlapping membership. This is a weaker signal than shared administration, but at volume it reliably outlines the satellite channels around a core. The view below is drawn from a mock collection of the same shape: two chatrooms and the accounts that hold membership in both.

MATCH (c1:TelegramChannel)<-[:MEMBER_OF]-(u:TelegramAccount)-[:MEMBER_OF]->(c2:TelegramChannel)
WHERE id(c1) < id(c2)
WITH c1, c2, count(DISTINCT u) AS shared_members
WHERE shared_members > 10
RETURN c1.channel_title, c2.channel_title, shared_members
ORDER BY shared_members DESC

A representation built from mock collection data: two channels and a sample of the members they share.

Expansion from a single indicator. This is the routine pivot. I begin from one attributed account and expand outwards to the channels it touches, and then to the accounts that touch those channels.

MATCH (seed:TelegramAccount {username: $known_actor})-[r]->(c:TelegramChannel)<-[r2]-(neighbour:TelegramAccount)
RETURN seed, r, c, r2, neighbour

A representation of single-seed expansion: one attributed account, the channels it touches, and the neighbours that appear alongside it.

The process is iterative. A surfaced cluster produces new accounts of interest, so I re-run collection around them, re-ingest the graph, and the picture sharpens. I then combine the structural signals with temporal ones, such as channels whose estimated creation dates fall in the same window, or accounts whose version history shows them moving together, so that weak indicators accumulate into a defensible assessment.

Figure 4: Clustering accumulates independent indicators around a seed. The assessment rests on their combination, not on any single edge.

This is the substance of clustering. No single edge in the graph identifies anyone. A username is deniable, an administrative right is deniable, and a creation timestamp can be coincidence. But an account that co-administers several channels alongside an already-attributed account, was established in the same narrow window as those channels, and shares a large proportion of its membership with them, is no longer plausibly a coincidence. The combination is the finding.

It is worth seeing what this looks like at scale. The graph below is a representation of a single mock collection: roughly 760 accounts across five channels, with a thin attribution layer laid over the top - the groups that operate the channels, and a handful of actors tied to the accounts that administer them, reaching out in turn to identities and to accounts on other platforms. The channels sit as bright hubs, privileged accounts are drawn larger, and the whole graph carries the same node types as the rest of the post. Hover or click any node to inspect it.

A representation of a mock collection of roughly 760 accounts across five channels, with a mock attribution layer over the top, rendered as a standalone visualisation rather than the live Neo4j graph.

Handling and limitations

A few constraints are worth stating plainly. Creation dates are estimates, not facts, and I treat them that way. Membership reconstructed from message history understates true membership and omits inactive participants. Clusters are hypotheses to be tested against further evidence, not conclusions, and the deliverable is the analyst’s judgement, not the raw query output. Collection scope, retention and access are governed accordingly, and the relational store is handled as sensitive evidence with its provenance attached. Mapping a network is an intelligence function, and doing it responsibly, by minimising collateral collection, protecting victim data, and passing actionable assessments to those positioned to act lawfully, is a requirement rather than an afterthought.

Conclusion

The difficulty in investigating networks like this was never reading their messages. It was holding on to the shape of a structure that fragments faster than I can track it by hand. Constellation does not automate attribution, because nothing can. What it does is collect entities faithfully, model them as a graph, and make pivoting cheap, so that a scattered set of channels and disposable accounts becomes something I can reason about systematically.

Clustering, in the end, is the discipline of refusing to trust any single indicator while taking seriously what many of them say together. The graph is simply where that reasoning is made explicit.

Neo4j | Jacob Larsen