Between February 2022 and September 2025, Bellingcat staff and volunteers collected, geolocated, and shared more than 2,500 incidents of civilian harm following Russia’s full-scale invasion of Ukraine.
As part of this effort, Bellingcat tested a new machine learning model intended to rank Telegram social media posts on their likelihood of containing incidents of civilian harm.
This novel methodology dramatically reduced the search and selection time required, freeing researchers to focus on verifying incidents of civilian harm – not just searching for them.
This piece documents our methodology, ethical considerations and lessons learned in the hope that others researching similar topics can benefit from our work.
Open source research into civilian harm is still a relatively new field and it presents many challenges – one of the biggest is organising and sorting through the huge volume of user generated content being produced to find what is relevant.
Machine learning, a form of artificial intelligence that uses algorithms to identify patterns from large amounts of data and make predictions, can make this task more efficient.
With ongoing conflicts involving large amounts of civilian harm occurring in Sudan, and much of the Middle East, this guide aims to offer those covering these conflicts an example of how machine learning can be used to help find and sort incidents. You can also access the Code Notebook for our model here.
We defined “civilian harm” not just as civilian deaths or injuries resulting from armed conflict, but also the broader and delayed effects on civilians from mental trauma, loss of livelihood, displacement, destruction of infrastructure and more. This definition was informed by the Protection of Civilians book on civilian harm.
Initial Telegram Dataset
Each Telegram post containing civilian harm which had already been manually verified by researchers was used to build an initial dataset of confirmed cases of civilian harm, which data scientists call positive instances. We collected a total of 5,848 unique URLs for these Telegram posts. For our manual collection we reviewed posts on relevant Telegram channels, working through oldest to newest posts each day. Assuming that a given post made it to our geolocated incidents list, it meant the researcher who flagged it also looked at the posts that appeared before and after it on Telegram and did not flag those ones, so we selected the 10 posts surrounding the verified civilian harm post as our additional dataset of posts that did not contain civilian harm. After excluding any deleted or duplicate posts, we ended up with 48,545 non-civilian harm posts, our negative instances.
The choice to overrepresent negative instances aims at better reflecting the real world and increasing data available for model training.
We enriched each URL with metadata from the Telegram API, such as the time of publication, reactions or textual content. As some of these posts had been deleted, we completed the missing data points with previously preserved versions from our Auto Archiver database, only available for the positive instances.
Feature Engineering
Training a machine learning model requires numerical data, as these models compute a prediction score based on mathematical operations.
We built these by converting raw data from our initial dataset, such as keywords signalling potential civilian harm, into numerical scores (or “features”) that the model could interpret, with the aim of increasing the model’s ability to identify patterns. This process, known as feature engineering, can significantly improve model results because it allows data scientists to suggest explicit context knowledge.
A full list of features we used to train the model can be found in the code notebook accompanying this piece. Many features were directly inspired by researchers’ input from their experiences manually screening cases of civilian harm by sorting through a set number of Telegram channels and inspecting each post individually.
Several of the features used were directly built from the metadata contained in each Telegram post including media_type, day_of_week; or binary ones: forwarded, edited and reply_to.
Other features included engagement information: views, forwards, total_reactions, and even individual features for most used emojis including the reaction_crying_face to count 😭 emoji.
Converting Text to Numbers
To embed the experience from the manual collection process, researchers put together a list of keywords both in Ukrainian and Russian that, to them, signalled posts likely to show civilian harm. For instance, “Шахед” and “КАБ” translated to “Shahed” and “Guided aerial bomb” respectively. We created a numerical feature to count their frequency.
In addition, we included several generic English-language keywords which meaningfully signalled potential civilian harm, such as “injured”, “school affected” and “hospital affected” that were only used for generating semantic similarity scores.
A semantic similarity score is a calculation used to determine the proximity in meaning between different words and phrases. To get the semantic similarity between the post text and each of our keywords, we represented each in a list of numbers via a Sentence Transformer model, which converts words into numerical representations called vectors that a computer can understand.
We then calculated the level of similarity between each vector using cosine similarity, one of the most popular methods for measuring similarity between two pieces of text.
Due to how embeddings work, this calculation results in a figure on a scale from -1 (no semantic proximity) to 1 (same meaning). For example, the words “hurt” and “injured” would have a high similarity score, while “residential” and “injured” would have a negative score as the words are not semantically similar.
Finally, to enable the model to identify the relevance of each post to civilian harm in Ukraine, we used a multilingual text transformer from the BERT family of language models to represent the entire post’s text as a vector of 768 numerical values. This model can efficiently represent text from many languages in a way that captures meaning: the same sentence in different languages will generate similar embeddings, and trained machine learning models can detect patterns in the embeddings.
It is important to note that for this initial prototype of a civilian harm detection model, we did not include any features derived from media content such as photos and videos, although that would be a logical next step in attempting to improve model performance.
Selecting, Training and Evaluating Models
With 54,393 rows of 893 numerical features each, we selected four machine learning algorithms to train our predictive models.
We chose Logistic Regression as a baseline algorithm due to its simplicity. We also selected three other “best in class” models, Random Forest, XGBoost, and LightGBM. These choices centred on the interpretability of the models and their ability to work on tabular data of this size. For example, we avoided neural networks due to a lack of interpretability and because those models work best with a larger dataset.
To genuinely assess the performance of the trained models, we split our dataset into three parts:
We used a stratified split to divide the dataset instead of a random split. This method ensured the proportion of positive instances (i.e. confirmed cases of civilian harm) remained consistent across all three sets at about 11 percent.
To measure the performance of machine learning models, we ran them through the test set and measured the number of correct and incorrect predictions. Models output a likelihood between 0 and 1 that each Telegram post contains civilian harm, and we tried to find a cut-off threshold that leads to a good balance between flagging almost every post (0.1) or flagging very few (0.9).
There are two main types of evaluation metrics to gauge a model’s prediction power. Recall asserts what fraction of positive instances (i.e. known civilian harm posts) were correctly flagged as such. Precision measures the fraction of posts flagged as civilian harm that are indeed civilian harm posts.
During the training phase, we tuned the models to maximise average precision (PR-AUC), a metric that summarises precision across all recall levels. While this method also accounts for precision, it prioritises recall, which is preferable for this use case as it steers model selection to reduce the number of civilian harm posts that are skipped.
The following table sorts models from best to worst PR-AUC against a baseline of a coin-flip predictor. ROC-AUC and F1 are two other evaluation metrics included as sanity checks. Simply put, ROC-AUC measures the probability of ranking two instances, one negative and one positive, correctly; F1 balances precision and recall equally and its best cut-off threshold value.




