Tag Archives: LA

#dalmooc wk3 homework: Twitter and blog networks in CCK11 – SNA in Gephi part 2

Feels a bit like learning finger paining...
Feels a bit like learning finger paining…

Hey – and I am back from fiddling with pretty pictures (had a bit of a pause to consider meanings of “social capital” on the way – there will be a post about that, don’t you worry!)

So Part 2 of this activity/homework was to get some visualizations going on the Twitter and blog data sets from the CCK11 (Part 1 – prelim analysis here).

Exploring layouts

I used the larger Twitter set to have a play around with the layouts.

Yinfan Hu (YH) algorithm seemed to produce a vis which pulls external nodes out/away from the centre of the vis so that they form a jagged circle. Fruchterman Reingold (FR) laid it out so that a smooth circle was created with less differentiation between sub-clusters.

YH seems to have more properties to tweak but I had no idea what some of them meant, e.g. Quadtree max level or theta (despite a handy definition tip appearing at the bottom when you click on any of them). In fact, changing values of most did not seem to have any substantial effect on the overall shape of the visualisation – at first glance anyway. Two which had most effect were relative strength and optimal distance.

  • Relative strength – The relative strength between electrical force (repulsion) and spring force (attraction). Smaller values produce tighter central cluster. If you want to see inside the central cluster – make it larger!
  • Optimal distance – the natural length of the springs. Bigger values mean nodes will be further apart. Again – if you want to see individual nodes – increase the value. To tighten individual clusters/make communities more visible make it smaller.

FR had fewer properties so was easier to understand at a first glance (area, gravity and speed). In FR lower gravity (force attracting nodes to the centre) values made the cluster less tight, making it easier to distinguish individual nodes. It took much longer to run and needed to be manually stopped.

Without use of colour to highlight modularity I found it difficult to see any structure within FR layout, so I opted for using YH method for the analysis. It seems that OpenOrd (modification of FR) would be best for detection of distinct clusters – but this must be an imported add on as I do not see it in the default layout list. Something to explore at another time:)

I also found Gephi Tutorials on Slideshare re:layout helpful:

The choice of methods depends on topology you want to emphasise (also size of your network though)

LayoutMethodChoice
Explanation of FR method
FGLayoutHere:

  • Area = graph size area
  • Gravity = increasing gravity reduces dispersion of disconnected components by pulling them into the centre

IMPORTANT: When the algorithm does not converge – need to reduce speed to gain precision (unstable nodes position/unstable graph)

Explanation of YH method:
YHLayout
Including demystification of some of the more obscure parameters:

YHparameters
Sizing the network nodes based on centrality measures

Playing around with sizing the nodes based on their centrality measures gave a quick overview of the visual overlap between the nodes ranking highest on the different measures. For example, in Twitter network, nodes with highest values of betweeness centrality also looked like the ones with highest values of degree centrality (see below).

Size=degree centrality
Size=degree centrality
Size=betweeness centrality
Size=betweeness centrality

Ultimately I sized the nodes based on betweeness centrality and inserted the degree centrality score as a label (purely a matter of convenience as degree centrality values were just too large to fit into the node circles!).

Visualising communities identified in the networks

The tutorial suggested by Dagan – by Jen Goldbeck, was useful here although I have still not worked out how to highlight the selected nodes in the Data lab spreadsheet from the right click.

I fiddled with size range for the nodes by increasing the minimum size so that the circles are large enough to show the colours.

I also played around with modularity factor, increasing it to 2 for Twitter and to 1.8 for blogs in order to decrease the number of communities for clarity (from 12 to 8 and from 6 to 4 respectively).

I also changed some of the colours to provide better contrast between the different communities.

NOTE: any overlap in colours for Twitter and blog visualisation is accidental and does not indicate overlap in communities across these two environments.

And voilla – two pretty pictures!

CCK11BlogCommunitiesLABEL
Blog network in CCK11 course. Node size=betweeness; node label=centrality.
Twitter network in CCK11 course. Node size=betweeness, node label=degree.
Twitter network in CCK11 course. Node size=betweeness, node label=degree.

What does it all mean? I think I may have to cover that in the next post – the actual Assignment for week 3 demonstrating my competencies…It will be nice to finally have some questions to answer:)

Top image source: Flickr by Maegan under CC license

Twitter and blog networks in CCK11 MOOC – SNA in Gephi part 1 for #dalmooc

Twitter vs Blog networks at CCK11 MOOC - SNA in Gephi
Twitter vs Blog networks at CCK11 MOOC – SNA in Gephi

We are onto Social Network Analysis in week 3 and now actually doing it rather than talking about it  (I am running behind, of course, as this is very much a start of week 4 now. Yikes!)

So, we were given two datasets collected during the Connectivism and Connected Knowledge MOOC in 2011, encompassing exchanges between participants on Twitter and via blogs (communication via comments). Each data set had a version collected in week 6 and week 12 of the course. The data was pre-processed into the format directly importable into Gephi (a lovely open-source and free SNA visualisation tool).

For Twitter: “graphs included all authors and mentions as nodes of the network, and the edges between them were created if an author or an account were tagged within the tweet. For example, if a course participant @Learner1 mentioned @Learner2 and @Learner3 in a tweet, then the course Twitter network would contain @Learner1, @Learner2, and @Learner3 with the following edges: @Learner1 – @Learner2, and @Learner1 – @Learner3.”

For blogs: “[graphs] included authors of the blog posts (i.e. blog owners) and the authors of the comments to individual blog posts. If a learner A1 created a blog post, and then learners B1 and C1 added comments to that post, then the corresponding network would contain nodes A1, B1, and C1 with the following edges: A1-B1, and A1-C1. All the four networks are undirected.”

The pre-processed data is only available to the course participants and its use restricted to completing our assignments so I cannot share it here.

In this analysis step I imported each set into Gephi without any glitches and performed basic analyses and filtering for each of the sets at week 12 of the  course. As per instructions this included computing the density measure and centrality measures (betweenness and degree) introduced in the course, followed by apply the Giant Component filter to filter out all the disconnected nodes and identify communities by using the modularity algorithm. Dagan’s walkthroughs in Gephi were very useful here (introduction + modularity analysis).

I report the key numbers in the table above.

It was nice to see some numbers:) And it was immediately obvious that there were some differences between Twitter and blogs, e.g. more nodes and edges in the former and twice as many “communities”.

But instantly I was concerned – what do these numbers actually mean?

Is network density of 0.003 good or bad? What does it actually mean in terms of e.g. speed of information flow in minutes/days? Has this stuff been quantified? Or is it just a matter of getting a feel for it with experience?

Or perhaps it only works for comparisons? If so – how do I tell if the difference in network density between Twitter and blogs (0.003 vs 0.01, respectively) is actually significant? And would this significant difference in a network measure value have any meaningful effect on the ease information flow within each network?

Clearly, still a lot to learn. Onto making some pretty pictures (oh – sorry – visualisations;) with the said data for part 2 of this task. Won’t be long I hope:)