SNDM Project
Brief description
The main goal of the project was to investigate the approaches to mine, monitor and analyze the communications between the employees of a given company. The project resulted in the development of a prototype of a DLP-like system. The main functions of the prototype include mining of the information flow, building the users' profiles and their classification, and detection of unusual events inside the community.
Dataset
The data collected from the corporate intranet instant messenger was used as an input.
Main characteristics of the dataset:
- Logs for the period of 142 days;
- 242 active users;
- On average, >500 messages/user.
Some results
Messaging intensity statistics
The average daily message flow looks like this:
Using this chart as the baseline several text-based patterns can be revealed by indicating the subsets of specific messages. For example, adding to the chart above the statistics specifically for greetings and farewells, we got the following chart:
Three peaks of "hello's" revealed the three shifts working cycle of the given organization
The search for the words "yesterday" and "tomorrow" gave the following chart:
Behaviour segmentation
Using the intraday activity curve the clusterisation of users was done, revealing two groups of users: the core "typical" users and the anomalous ones (with big deviations in the behaviour).
The closer examination brought out that everybody from anomalistic group was either IT-specialist or internal security department employee :)
Communication graph microclustering
As the next step of links thresholding was done, revealing the narrow communication circle of each user. Overlaps of such circles could form the stable closures (such as couples or more complex groups of users) or bring out some one-way dependencies between users.
The additional colorization of users was done based on the user's gender predicted by the naive bayes built on the basis of text and behaviour factors (the measured precision was about 89% while the recall was about 68%).
Stable closures
Most of the revealed pairs were later classified by the community expert as married couples or people bound together by friendship or other out of office relationships.
One-way dependencies
Microclusters dynamics
On the next step, the special multilayered layout of sequenced time slices of the communication graph was done to visualize the dynamics of the structures.
This technique was helpful for investigating of the processes of couple formation and disruption.
Information percolation
Several approaches to cluster the users based on the percolation processes on the communication graph were studied.. Thus, several main clusters of information exchange were determined along with most important members of each of them.
Individual rhythms of communication
The usage of slightly modified authorlines formalism helped to discover several different communication styles among given set of users. Here are some of them:
Work communication
Key features:
- Frequent dialogues with different users.
- Short dialogues prevail.
Personal communication
Key features:
- The majority of communications with the same user.
- The presence of protracted dialogues.
"Search for a partner"
Key features:
- Irregular communication.
- Conversations with lots of different people (the opposite sex prevail).
- Protracted dialogues are often
- The number of outbound messages in the dialogue is often significantly higher than the number of incoming messages.
- Dialogues are on average more often initiated by the given user, not his interlocutor.
Nonreciprocal interest
Key features:
- Irregular dialogues with the same companion.
- Almost all of the dialogues are initiated by the user.
- The number of outbound messages in the dialogue is often significantly higher than the number of incoming messages.
Overall results
- The behaviour of community's users was studied, the different types of users were detected and described.
- Several periodic cycles of the users' activity were determined and analyzed.
- The common paths of the information propagation through the community were detected.
- The stable microsegments of users were discovered and stable bonds between users were classified.
- A number of internal enterprise events was detected.