Tag Archives: Gephi

Week 3 #dalmooc assignment – Twitter vs Blogs in CCK11 SNA in Gephi

Interpreting SNA feels like telling "just so" stories about how leopard got its spots. Or is it just me:)
Interpreting SNA feels like telling “just so” stories about how leopard got its spots. Or is it just me:)

Time to bring the week 3 exercises (Gephi SNA part 1 and part 2) together for week 3 assessment, asking us to “compare the two networks (Twitter vs. blogs) for week 12 of CCK11” (for description of the data set see part 1).

Health warning – this is a rather an amateurish effort and felt more like telling stories than drawing hard conclusions. But – hey – from what I have seen so far much SNA stuff is very much about “just so stories” and rather overimaginative interpretation of patterns and associations. If you were to believe the rumours, such approach should be a familiar territory for an evolutionary biologist like me…and so the process made me feel almost at home;)

Density and centrality measures

Twitter network had slightly lower density but slightly higher average degree than the blog network (0.001 vs 0.03; 2.7 vs 2.01).

Density was quite low in both networks, pointing to low level of overall community integration and low potential for spread of information. However, it seems that blog network would have a marginal advantage over Twitter here.

Low density indicates that using either system for important announcements would be inefficient. Participants or subgroups may feel isolated from the others and may find it difficult to find relevant information or connections. The slight differences in density and centrality may be a result of the inherent properties of the two tools. Twitter is inherently easier to use for quick communication and making connections due to the low level of effort required to post in comparison to blogs. The latter allows for more in depth exchanges.

But I immediately want to ask – what were the course team strategies for use of the two communication tools? Did these strategies have an impact on the network characteristics? Did they override or aide the influence of the tool affordances?

Broker and central node overlap

Identify any any overlap in the main brokers and central nodes across the two networks?

What does it mean to have brokers and central nodes in common for both networks?

If you haven’t identified central nodes in common for the both networks, what does it mean in terms of integration of different social media used in a single course?

Now – I had fun with this one (and of course spent disproportionately too much time on wrangling the data and the software into semi-submission). As formal analysis for this type of question was not covered in the lectures, I assumed that identification of “any overlap in the main brokers and central nodes across the two networks” was to be done by eyeballing alone.

OK – so I discovered that the unique id for each participant was coded in the label column within Data Lab.

I could display Labels on nodes with highest betweeness scores (the brokers) and highest centrality scores in visualisation of each network and simply check if they are the same.

But anonimisation meant that the identifiers were long strings of alphanumerics. Not so easy to eyeball and compare for multiple brokers and central nodes then. Especially, that Gephi does not support side-by side display of the network plots! Wouldn’t it be nice if on highlighting a node with a particular ID (or in this case Label) in one network it would get highlighted it in the second one?

OK – another idea was to dump both datasets in Excel and then use simple sort by value of betweenness and centrality and check if there are any nodes in common among the highest values. Notwithstanding the issue of comparison of long strings by eye, this would definitely loose the power of visualisation in presentation of my conclusions.

Ideally, I would want to display the nodes shared between the two networks within each diagram. Even more ideally, I would like to heatmap betweeness of the second network on the first one where nodes were sized according to the same measure. I would simply have to import the values for the second network into the Data Lab spreadsheet of the first one and use them for node colours!

This did not turn out to be as easy as it sounded….mostly because of idiosyncrasies of Gephi  (oh the joys of hands on problem-based discovery of new software!). This is what I did to produce a heatmap of blogger betweenness onto the Twitter network (and it does sounds soo basic when you put it down like this):

1.  I exported spreadheet of Twitter and blog network from Gephi Data Lab.

2. I joined the two in Tableau with left hand join (i.e. all Twitter nodes were preserved and data added only for the blog nodes which occurred in BOTH networks).  Making sure that the join was on Label (unique ID for each participant).

3. Selected the whole joined data table in Tableau and copied into Excel.

4. In Excel I deleted all the columns corresponding to the Twitter data (except from the Label in order to cross check IDs after import to Gephi). I also replaced all the nulls in the missing blog data with -40. Replacement was necessary as Gephi does not recognise columns with empty cells as valid for visualisation, and you cannot use letters as it misclassifies the data. I chose value of -40 as this was not a real value and was low enough to be able to select a cut off when colouring the nodes. Horrible hack I know! Saved as csv.

5. Imported spreadsheet into Gephi’s existing Data Lab spreadsheet for Twitter. Surprise, surprise, had to change data types for the imported columns (during the import!). It turns out that Degree, Component ID and Modularity Class are integers, but Eccentricity, Closeness and Betweenness are double. Don’t ask how I figured it out!

6. Now I sized the nodes of the Twitter network on Twitter network betweenness, and used betweenness values for blog network to colour the nodes (placing cut off point above -40 and as close as I could get it to +1ish). Here is the pretty picture:

Twitter network in CCK11 MOOC with nodes sized for betweenness centrality within Twitter network. Node colour corresponds to betweenness centrality of the node in the blog network. Dark blue=highest, light blue=lowest, yellow=nodes not shared between the two networks.


Now – all this fiddling did not leave much time for interpretation. (I also did not have time to do the same for blogs or centrality)

But here goes.

62/194 (32%) of participants who either blogged or commented on blogs also tweeted using the course hashtag. This indicates that at least some participants actively communicating via online social networks, tended to use more than one network. The bloggers were more likely to use both (32% of bloggers overlapping vs 7% of Twitter users). This is not surprising, since Twitter is often used to advertise new blog posts by bloggers and many Twitter participants would visit and comment on those blogs. Tweeting is much less demanding in terms of time and effort, hence majority of participants chose to only Tweet.

Now to the brokers. It appears that the participants with the highest betweenness centrality within Twitter did not have the highest betwenness in the blog network. There were 2 or 3 exceptions for slightly lower values – shown in mid-green within the figure.

Without knowing more about the nodes it is difficult to draw any meaningful conclusions. One conclusion may be that nodes with high centrality and betweenness on Twitter are in fact course coordinators who try to engage in exchanges with wide range of participants and retweet their contributions to spread their ideas to their own networks. It is not common for course leaders to also blog extensively, hence they would not have a similar position in the blog network.

A couple of participants with relatively high centrality in two networks may be course leaders who engaged in extensive commentary on multiple blogs. Or course participants whose communication strategy was similar within both networks.

Not really possible to confirm this without knowing more about the nodes here…

Either way – I like the idea of having different “dominant” nodes in different social spaces. It allows for greater diversity of conversation across them. For example, if Twitter was in fact course leader-centric, blogs would be providing a more student-led environment. It would really be interesting to see what the course organisers envisaged for these spaces!

And can we really make any conclusions without seeing the lurkers – followers who do not comment or retweet. Excluding them from the analysis instantly devalues the act of witnessing or reading as an invalid form of interaction and learning. And yet the lurking rule of thumb seems pretty invariable across various media. I wonder how lurkers with high betweenness centrality measure up to active communicators on the creativity and innovation scales;)


Compare the number of the communities across the two networks and check if there are some potential similarities;

Try to provide interpretation for the findings

What implications can you draw based on the modularity analysis and the communities identified in the two networks?

OK – I think I run out of steam here. Twitter has many more communities as measured by modularity analysis as described here (12 for Twitter and 6 for blogs on default settings). I suspect this has to do more with the relative size of the two networks but I may be wrong. It is also possible that the blog community network is a reflection of a few bloggers receiving many comments on their posts this week. In this case each blog post would form a mini-community hub. Again – hard to check this without having more insight into the data.

In terms of overlap in communities for Twitter and blogs – I would say there is very little evidence of this. I used the combined dataset as per description above to get this pretty picture.

Overlap between Twitter and blog communities in CCK11. Colours=Twitter communities, size + numbers = blog communities of shared members.
Overlap between Twitter and blog communities in CCK11. Colours=Twitter communities, size + numbers = blog communities of shared members.

It looks like many of the blog “communities” are represented within the largest “community” on Twitter. Some of the Twitter communities are not represented within blogs at all. As per above – no time for the equivalent analysis within blogs.

As per centrality measure discussion earlier this indicates that Twitter and blogs networks are different. As for underlying reasons and effects of this difference – hard to say…

Application of the analysis to other educational contexts

Reflect on the possible educational contexts in which you would apply similar analysis types. Of special importance would be to identify the learning contexts that are of direct relevance to you such as work (e.g., teaching), study (e.g., courses you are taking), or research (e.g., projects). Discuss possible questions that you would be able to address in those contexts by applying the analyses used in this assignment.

Some defend evolutionary “just so stories” as valid hypothesis making – the trick is to get at the ones which can actually be falsified, and then get some hard data behind them. So for me to be believable I would like any SNA married up with multiple sources of data from the same experimental/observational set up – and across some contrasting ones.  So for example, in addition to the communication pattern data available here I would like to have access to the media use strategy in the course, characteristics of the course cohort as well as some additional data from surveys/interviews as well as participation/survival logs. Oh – and wouldn’t it be nice to have data from multiple instances of the cMOOC, perhaps with some careful variation in the baseline conditions:)

Most importantly, there should be a clear question – why are we looking for patterns here, what patterns would we expect to see? Perhaps the question could be related to aspects of social presence (as per the Community of Inquiry model) and impact of the use of Twitter vs blogs as a communication medium in a cMOOC course. Apparently, there is already an existing and validated questionnaire relating to this framework which could come in quite handy for collection of data additional to logs of social interactions:)

Oh – and I would definitely be interested in exploring any formal comparison measures/methods/statistics which facilitate multi-network comparisons and their visual presentation within SNA graphs… Just sayin’


Top image: By Tambako The Jaguar

#dalmooc wk3 homework: Twitter and blog networks in CCK11 – SNA in Gephi part 2

Feels a bit like learning finger paining...
Feels a bit like learning finger paining…

Hey – and I am back from fiddling with pretty pictures (had a bit of a pause to consider meanings of “social capital” on the way – there will be a post about that, don’t you worry!)

So Part 2 of this activity/homework was to get some visualizations going on the Twitter and blog data sets from the CCK11 (Part 1 – prelim analysis here).

Exploring layouts

I used the larger Twitter set to have a play around with the layouts.

Yinfan Hu (YH) algorithm seemed to produce a vis which pulls external nodes out/away from the centre of the vis so that they form a jagged circle. Fruchterman Reingold (FR) laid it out so that a smooth circle was created with less differentiation between sub-clusters.

YH seems to have more properties to tweak but I had no idea what some of them meant, e.g. Quadtree max level or theta (despite a handy definition tip appearing at the bottom when you click on any of them). In fact, changing values of most did not seem to have any substantial effect on the overall shape of the visualisation – at first glance anyway. Two which had most effect were relative strength and optimal distance.

  • Relative strength – The relative strength between electrical force (repulsion) and spring force (attraction). Smaller values produce tighter central cluster. If you want to see inside the central cluster – make it larger!
  • Optimal distance – the natural length of the springs. Bigger values mean nodes will be further apart. Again – if you want to see individual nodes – increase the value. To tighten individual clusters/make communities more visible make it smaller.

FR had fewer properties so was easier to understand at a first glance (area, gravity and speed). In FR lower gravity (force attracting nodes to the centre) values made the cluster less tight, making it easier to distinguish individual nodes. It took much longer to run and needed to be manually stopped.

Without use of colour to highlight modularity I found it difficult to see any structure within FR layout, so I opted for using YH method for the analysis. It seems that OpenOrd (modification of FR) would be best for detection of distinct clusters – but this must be an imported add on as I do not see it in the default layout list. Something to explore at another time:)

I also found Gephi Tutorials on Slideshare re:layout helpful:

The choice of methods depends on topology you want to emphasise (also size of your network though)

Explanation of FR method

  • Area = graph size area
  • Gravity = increasing gravity reduces dispersion of disconnected components by pulling them into the centre

IMPORTANT: When the algorithm does not converge – need to reduce speed to gain precision (unstable nodes position/unstable graph)

Explanation of YH method:
Including demystification of some of the more obscure parameters:

Sizing the network nodes based on centrality measures

Playing around with sizing the nodes based on their centrality measures gave a quick overview of the visual overlap between the nodes ranking highest on the different measures. For example, in Twitter network, nodes with highest values of betweeness centrality also looked like the ones with highest values of degree centrality (see below).

Size=degree centrality
Size=degree centrality
Size=betweeness centrality
Size=betweeness centrality

Ultimately I sized the nodes based on betweeness centrality and inserted the degree centrality score as a label (purely a matter of convenience as degree centrality values were just too large to fit into the node circles!).

Visualising communities identified in the networks

The tutorial suggested by Dagan – by Jen Goldbeck, was useful here although I have still not worked out how to highlight the selected nodes in the Data lab spreadsheet from the right click.

I fiddled with size range for the nodes by increasing the minimum size so that the circles are large enough to show the colours.

I also played around with modularity factor, increasing it to 2 for Twitter and to 1.8 for blogs in order to decrease the number of communities for clarity (from 12 to 8 and from 6 to 4 respectively).

I also changed some of the colours to provide better contrast between the different communities.

NOTE: any overlap in colours for Twitter and blog visualisation is accidental and does not indicate overlap in communities across these two environments.

And voilla – two pretty pictures!

Blog network in CCK11 course. Node size=betweeness; node label=centrality.
Twitter network in CCK11 course. Node size=betweeness, node label=degree.
Twitter network in CCK11 course. Node size=betweeness, node label=degree.

What does it all mean? I think I may have to cover that in the next post – the actual Assignment for week 3 demonstrating my competencies…It will be nice to finally have some questions to answer:)

Top image source: Flickr by Maegan under CC license

Twitter and blog networks in CCK11 MOOC – SNA in Gephi part 1 for #dalmooc

Twitter vs Blog networks at CCK11 MOOC - SNA in Gephi
Twitter vs Blog networks at CCK11 MOOC – SNA in Gephi

We are onto Social Network Analysis in week 3 and now actually doing it rather than talking about it  (I am running behind, of course, as this is very much a start of week 4 now. Yikes!)

So, we were given two datasets collected during the Connectivism and Connected Knowledge MOOC in 2011, encompassing exchanges between participants on Twitter and via blogs (communication via comments). Each data set had a version collected in week 6 and week 12 of the course. The data was pre-processed into the format directly importable into Gephi (a lovely open-source and free SNA visualisation tool).

For Twitter: “graphs included all authors and mentions as nodes of the network, and the edges between them were created if an author or an account were tagged within the tweet. For example, if a course participant @Learner1 mentioned @Learner2 and @Learner3 in a tweet, then the course Twitter network would contain @Learner1, @Learner2, and @Learner3 with the following edges: @Learner1 – @Learner2, and @Learner1 – @Learner3.”

For blogs: “[graphs] included authors of the blog posts (i.e. blog owners) and the authors of the comments to individual blog posts. If a learner A1 created a blog post, and then learners B1 and C1 added comments to that post, then the corresponding network would contain nodes A1, B1, and C1 with the following edges: A1-B1, and A1-C1. All the four networks are undirected.”

The pre-processed data is only available to the course participants and its use restricted to completing our assignments so I cannot share it here.

In this analysis step I imported each set into Gephi without any glitches and performed basic analyses and filtering for each of the sets at week 12 of the  course. As per instructions this included computing the density measure and centrality measures (betweenness and degree) introduced in the course, followed by apply the Giant Component filter to filter out all the disconnected nodes and identify communities by using the modularity algorithm. Dagan’s walkthroughs in Gephi were very useful here (introduction + modularity analysis).

I report the key numbers in the table above.

It was nice to see some numbers:) And it was immediately obvious that there were some differences between Twitter and blogs, e.g. more nodes and edges in the former and twice as many “communities”.

But instantly I was concerned – what do these numbers actually mean?

Is network density of 0.003 good or bad? What does it actually mean in terms of e.g. speed of information flow in minutes/days? Has this stuff been quantified? Or is it just a matter of getting a feel for it with experience?

Or perhaps it only works for comparisons? If so – how do I tell if the difference in network density between Twitter and blogs (0.003 vs 0.01, respectively) is actually significant? And would this significant difference in a network measure value have any meaningful effect on the ease information flow within each network?

Clearly, still a lot to learn. Onto making some pretty pictures (oh – sorry – visualisations;) with the said data for part 2 of this task. Won’t be long I hope:)