Week 3 #dalmooc assignment – Twitter vs Blogs in CCK11 SNA in Gephi

November 28, 2014 Kay 1 Comment

Interpreting SNA feels like telling "just so" stories about how leopard got its spots. Or is it just me:) — Interpreting SNA feels like telling “just so” stories about how leopard got its spots. Or is it just me:)

Time to bring the week 3 exercises (Gephi SNA part 1 and part 2) together for week 3 assessment, asking us to “compare the two networks (Twitter vs. blogs) for week 12 of CCK11” (for description of the data set see part 1).

Health warning – this is a rather an amateurish effort and felt more like telling stories than drawing hard conclusions. But – hey – from what I have seen so far much SNA stuff is very much about “just so stories” and rather overimaginative interpretation of patterns and associations. If you were to believe the rumours, such approach should be a familiar territory for an evolutionary biologist like me…and so the process made me feel almost at home;)

Density and centrality measures

Twitter network had slightly lower density but slightly higher average degree than the blog network (0.001 vs 0.03; 2.7 vs 2.01).

Density was quite low in both networks, pointing to low level of overall community integration and low potential for spread of information. However, it seems that blog network would have a marginal advantage over Twitter here.

Low density indicates that using either system for important announcements would be inefficient. Participants or subgroups may feel isolated from the others and may find it difficult to find relevant information or connections. The slight differences in density and centrality may be a result of the inherent properties of the two tools. Twitter is inherently easier to use for quick communication and making connections due to the low level of effort required to post in comparison to blogs. The latter allows for more in depth exchanges.

But I immediately want to ask – what were the course team strategies for use of the two communication tools? Did these strategies have an impact on the network characteristics? Did they override or aide the influence of the tool affordances?

Broker and central node overlap

Identify any any overlap in the main brokers and central nodes across the two networks?

What does it mean to have brokers and central nodes in common for both networks?

If you haven’t identified central nodes in common for the both networks, what does it mean in terms of integration of different social media used in a single course?

Now – I had fun with this one (and of course spent disproportionately too much time on wrangling the data and the software into semi-submission). As formal analysis for this type of question was not covered in the lectures, I assumed that identification of “any overlap in the main brokers and central nodes across the two networks” was to be done by eyeballing alone.

OK – so I discovered that the unique id for each participant was coded in the label column within Data Lab.

I could display Labels on nodes with highest betweeness scores (the brokers) and highest centrality scores in visualisation of each network and simply check if they are the same.

But anonimisation meant that the identifiers were long strings of alphanumerics. Not so easy to eyeball and compare for multiple brokers and central nodes then. Especially, that Gephi does not support side-by side display of the network plots! Wouldn’t it be nice if on highlighting a node with a particular ID (or in this case Label) in one network it would get highlighted it in the second one?

OK – another idea was to dump both datasets in Excel and then use simple sort by value of betweenness and centrality and check if there are any nodes in common among the highest values. Notwithstanding the issue of comparison of long strings by eye, this would definitely loose the power of visualisation in presentation of my conclusions.

Ideally, I would want to display the nodes shared between the two networks within each diagram. Even more ideally, I would like to heatmap betweeness of the second network on the first one where nodes were sized according to the same measure. I would simply have to import the values for the second network into the Data Lab spreadsheet of the first one and use them for node colours!

This did not turn out to be as easy as it sounded….mostly because of idiosyncrasies of Gephi (oh the joys of hands on problem-based discovery of new software!). This is what I did to produce a heatmap of blogger betweenness onto the Twitter network (and it does sounds soo basic when you put it down like this):

1. I exported spreadheet of Twitter and blog network from Gephi Data Lab.

2. I joined the two in Tableau with left hand join (i.e. all Twitter nodes were preserved and data added only for the blog nodes which occurred in BOTH networks). Making sure that the join was on Label (unique ID for each participant).

3. Selected the whole joined data table in Tableau and copied into Excel.

4. In Excel I deleted all the columns corresponding to the Twitter data (except from the Label in order to cross check IDs after import to Gephi). I also replaced all the nulls in the missing blog data with -40. Replacement was necessary as Gephi does not recognise columns with empty cells as valid for visualisation, and you cannot use letters as it misclassifies the data. I chose value of -40 as this was not a real value and was low enough to be able to select a cut off when colouring the nodes. Horrible hack I know! Saved as csv.

5. Imported spreadsheet into Gephi’s existing Data Lab spreadsheet for Twitter. Surprise, surprise, had to change data types for the imported columns (during the import!). It turns out that Degree, Component ID and Modularity Class are integers, but Eccentricity, Closeness and Betweenness are double. Don’t ask how I figured it out!

6. Now I sized the nodes of the Twitter network on Twitter network betweenness, and used betweenness values for blog network to colour the nodes (placing cut off point above -40 and as close as I could get it to +1ish). Here is the pretty picture:

TwitterBlogsBetweennessCentrality — Twitter network in CCK11 MOOC with nodes sized for betweenness centrality within Twitter network. Node colour corresponds to betweenness centrality of the node in the blog network. Dark blue=highest, light blue=lowest, yellow=nodes not shared between the two networks.

Now – all this fiddling did not leave much time for interpretation. (I also did not have time to do the same for blogs or centrality)

But here goes.

62/194 (32%) of participants who either blogged or commented on blogs also tweeted using the course hashtag. This indicates that at least some participants actively communicating via online social networks, tended to use more than one network. The bloggers were more likely to use both (32% of bloggers overlapping vs 7% of Twitter users). This is not surprising, since Twitter is often used to advertise new blog posts by bloggers and many Twitter participants would visit and comment on those blogs. Tweeting is much less demanding in terms of time and effort, hence majority of participants chose to only Tweet.

Now to the brokers. It appears that the participants with the highest betweenness centrality within Twitter did not have the highest betwenness in the blog network. There were 2 or 3 exceptions for slightly lower values – shown in mid-green within the figure.

Without knowing more about the nodes it is difficult to draw any meaningful conclusions. One conclusion may be that nodes with high centrality and betweenness on Twitter are in fact course coordinators who try to engage in exchanges with wide range of participants and retweet their contributions to spread their ideas to their own networks. It is not common for course leaders to also blog extensively, hence they would not have a similar position in the blog network.

A couple of participants with relatively high centrality in two networks may be course leaders who engaged in extensive commentary on multiple blogs. Or course participants whose communication strategy was similar within both networks.

Not really possible to confirm this without knowing more about the nodes here…

Either way – I like the idea of having different “dominant” nodes in different social spaces. It allows for greater diversity of conversation across them. For example, if Twitter was in fact course leader-centric, blogs would be providing a more student-led environment. It would really be interesting to see what the course organisers envisaged for these spaces!

And can we really make any conclusions without seeing the lurkers – followers who do not comment or retweet. Excluding them from the analysis instantly devalues the act of witnessing or reading as an invalid form of interaction and learning. And yet the lurking rule of thumb seems pretty invariable across various media. I wonder how lurkers with high betweenness centrality measure up to active communicators on the creativity and innovation scales;)

Communities

Compare the number of the communities across the two networks and check if there are some potential similarities;

Try to provide interpretation for the findings

What implications can you draw based on the modularity analysis and the communities identified in the two networks?

OK – I think I run out of steam here. Twitter has many more communities as measured by modularity analysis as described here (12 for Twitter and 6 for blogs on default settings). I suspect this has to do more with the relative size of the two networks but I may be wrong. It is also possible that the blog community network is a reflection of a few bloggers receiving many comments on their posts this week. In this case each blog post would form a mini-community hub. Again – hard to check this without having more insight into the data.

In terms of overlap in communities for Twitter and blogs – I would say there is very little evidence of this. I used the combined dataset as per description above to get this pretty picture.

Overlap between Twitter and blog communities in CCK11. Colours=Twitter communities, size + numbers = blog communities of shared members.

It looks like many of the blog “communities” are represented within the largest “community” on Twitter. Some of the Twitter communities are not represented within blogs at all. As per above – no time for the equivalent analysis within blogs.

As per centrality measure discussion earlier this indicates that Twitter and blogs networks are different. As for underlying reasons and effects of this difference – hard to say…

Application of the analysis to other educational contexts

Reflect on the possible educational contexts in which you would apply similar analysis types. Of special importance would be to identify the learning contexts that are of direct relevance to you such as work (e.g., teaching), study (e.g., courses you are taking), or research (e.g., projects). Discuss possible questions that you would be able to address in those contexts by applying the analyses used in this assignment.

Some defend evolutionary “just so stories” as valid hypothesis making – the trick is to get at the ones which can actually be falsified, and then get some hard data behind them. So for me to be believable I would like any SNA married up with multiple sources of data from the same experimental/observational set up – and across some contrasting ones. So for example, in addition to the communication pattern data available here I would like to have access to the media use strategy in the course, characteristics of the course cohort as well as some additional data from surveys/interviews as well as participation/survival logs. Oh – and wouldn’t it be nice to have data from multiple instances of the cMOOC, perhaps with some careful variation in the baseline conditions:)

Most importantly, there should be a clear question – why are we looking for patterns here, what patterns would we expect to see? Perhaps the question could be related to aspects of social presence (as per the Community of Inquiry model) and impact of the use of Twitter vs blogs as a communication medium in a cMOOC course. Apparently, there is already an existing and validated questionnaire relating to this framework which could come in quite handy for collection of data additional to logs of social interactions:)

Oh – and I would definitely be interested in exploring any formal comparison measures/methods/statistics which facilitate multi-network comparisons and their visual presentation within SNA graphs… Just sayin’

Top image: By Tambako The Jaguar

Uncategorized

Apologies for the look and feel

November 28, 2014 Kay 1 Comment

It appears that WordPress does not like me. As of couple of weeks back it started having issues with displaying side bar widgets in my chosen theme and refused to display all my posts.

I have played around with a couple of other themes and there seem to be similar issues with widget display. Unfortunately, I don’t have time to play around with it more at the moment so stuck with the most well behaved free theme for now.

So – apologies for the mess peeps. I hope to find more time soon to fix it all up…

DALMOOC

Social capital in SNA for LA – too much focus on individuals at a cost of the group?

November 14, 2014 Kay Leave a comment

Individualistic conceptions of social capital bring to mind the tragedy of the commons... — Individualistic conceptions of social capital bring to mind the tragedy of the commons…

Dagan used the term “social capital” several times in #DALMOOC’s week 3 materials. He spoke of a person accumulating social capital and how it correlates positively with their node’s high centrality and betweenness centrality values produced by SNA. We were encouraged to build or optimise our own social capital by “positioning” ourselves in the DALMOOC’s social networks (e.g. retweeting was suggested as a good strategy based on DALMOOC’s own data analysis presented at the start of week 4). Social capital after all, we were told, correlates with individual’s academic success, career prospects, influence…

At first it seemed like quite an intuitive concept, but the more it was mentioned, the more I yearned for a clear definition of the term. In fact, something was not sitting right with me in the way that the concept seemed to be atomised to a level of a node/individual in pursuit of their own interests. Surely, an individual would have no social capital on their own, without their network?

When I looked it up in Wikipedia (lazy, I know;), the complexity of the concept belied the initial intuitive impression.

The basic definition seemed pretty straight forward:

“social capital is the expected collective or economic benefits derived from the preferential treatment and cooperation between individuals and groups”

But it turns out that there are many ways of conceptualising social capital (SC). There are also many views on its value – high SC can be seen as positive (Putnam sees it as means for civic engagement and democracy), negative (Bourdieu’s “old boys networks” as means to produce/reproduce inequality) or neutral (Coleman’s “neutral resource that facilitates any manner of action” )! But my niggling feeling was confirmed – many of the conceptions seemed to focus on social capital as a feature of a group, a community.

Although the definition agreed that “social networks have value” which can “affect the productivity of individuals AND groups”, there was a palpable tension between conceptions of SC from these two perspectives:

Early attempts to define social capital focused on the degree to which social capital as a resource should be used for public good or for the benefit of individuals. Putnam suggested that social capital would facilitate co-operation and mutually supportive relations in communities and nations and would therefore be a valuable means of combating many of the social disorders inherent in modern societies, for example crime. In contrast to those focusing on the individual benefit derived from the web of social relationships and ties individual actors find themselves in, attribute social capital to increased personal access to information and skill sets and enhanced power. According to this view, individuals could use social capital to further their own career prospects, rather than for the good of organisations.

It is the latter conception which is reflected Nan Lin‘s definition (developed in late 1990s-early 2000s):

“Investment in social relations with expected returns in the marketplace” and “access to resources through network ties”

This definition makes the heretofore elusive SC concept quantifiable, allowing for empirical testing of SC theory predictions, which contributes to its widespread use, in economic study of SC and beyond.

Quick read of one of Dagan’s papers confirmed that he was likely using Lin’s definition during live sessions in Week 3 of the course (Gašević, D., Zouaq, A., Jenzen, R. (2013). Choose your Classmates, your GPA is at Stake!’ The Association of Cross-Class Social Ties and Academic Performance. American Behavioral Scientist, 57(10), 1459-1478. doi: 10.1177/0002764213479362 from week 4 Additional Resources).

Thus network brokers with high betweenness centrality scores and super connectors with high centrality scores maximise individual’s potential for drawing on information/resources from many/diverse others in their SN. This is inevitably a simplified and unidimensional proxy for social capital embodying an individualistic view. Using it in LA makes this conception a central message in our classrooms. Is this really the lesson we want our learners to come away with?

Thankfully, I also found Bogatti, Jones and Everett’s 1998 attempt at mapping more of SC facets onto SNA measures, explicitly trying to tackle the individual AND the group perspective (Borgatti, Stephen P., Candace Jones, and Martin G. Everett. “Network measures of social capital.” Connections 21, no. 2 (1998): 27-36). So perhaps the group perspective can be salvaged in application of SNA to LA after all. Phew!

I know this would be a great point to finish on. But just a couple more reflections.

An interesting tidbit from the Hangout with Shane Dawson, SNAPP co-creator, in relation to the meaning of betweenness centrality scores. You see, it really depends on the context. Even if you just look at information flow/communications or who-talks-to-whom in the discussion forums for simplicity (and exclude other forms of relationships), it can mean drastically different things. In one of his studies of a distance learning cohort high betweenness scores correlated with student dissatisfaction (the students were those who tried to talk to many groups looking for help and never got it). In other studies of campus, blended courses, betweenness positively correlated with creativity. So, as Shane emphasised, interpretation of the SNA outputs is even more important than understanding the underlying maths. The same SC proxy values can mean entirely different things!

I have to admit to a greater-communal-good bias in my conception of SC. Social learning does not end in how much an individual can draw on the knowledge of others. Social learning is also about working well in collaborative teams and organisations and contributing to their social capital (be it via creation of positive atmosphere through provision of cakes, or bringing in new ideas through your external connections). How would this look in SNA?

I designed many of my activities for students as cooperative tasks – they were to work as a team and their individual performance was linked to how well they performed as a group. Now, Shane suggested that on-the-go SNA analysis of discussion forums can provide an early feedback for the instructor to diagnose issues and intervene (change the shape of the community). Neat. Now – if in my teams I saw student behaviour resulting in high centrality (star shape) – single node maximising their social capital by maximising their centrality value some would say – I would be worried. We do not want a single student to dominate the conversation just like we do not want the instructor to do so! This is a common issue in any f2f classroom and most instructors would have many strategies to discourage such behaviour as it takes away opportunities to contribute from other participants. To take this further – any healthy collaborative team would be expected to have high social capital – lots of connections within SN (high network density). And each member would expect to benefit from it. Yet I would not expect each member to look the same in SNA (just like team roles are not all the same – i.e. you have one person chairing the meeting, one taking notes, one bringing cookies and yet another making sure that minuted actions are accomplished to the timetable), especially when zooming it out to inter-team relationships. Would a team consisting solely of high betweenness scorers be successful? While they spend all their time as social butterflies scouting the horizon for new ideas and positioning themselves between groups in the classroom who would be actually doing the work implementing those within the team?

Very keen to get into some literature in search of the group SC conceptions in SNA for LA! Perhaps week 4 material will bring some goodies to sink my teeth into:)

Top image source: Flickr by mike baker under CC license.

DALMOOC

#dalmooc wk3 homework: Twitter and blog networks in CCK11 – SNA in Gephi part 2

November 13, 2014 Kay 1 Comment

Feels a bit like learning finger paining... — Feels a bit like learning finger paining…

Hey – and I am back from fiddling with pretty pictures (had a bit of a pause to consider meanings of “social capital” on the way – there will be a post about that, don’t you worry!)

So Part 2 of this activity/homework was to get some visualizations going on the Twitter and blog data sets from the CCK11 (Part 1 – prelim analysis here).

Exploring layouts

I used the larger Twitter set to have a play around with the layouts.

Yinfan Hu (YH) algorithm seemed to produce a vis which pulls external nodes out/away from the centre of the vis so that they form a jagged circle. Fruchterman Reingold (FR) laid it out so that a smooth circle was created with less differentiation between sub-clusters.

YH seems to have more properties to tweak but I had no idea what some of them meant, e.g. Quadtree max level or theta (despite a handy definition tip appearing at the bottom when you click on any of them). In fact, changing values of most did not seem to have any substantial effect on the overall shape of the visualisation – at first glance anyway. Two which had most effect were relative strength and optimal distance.

Relative strength – The relative strength between electrical force (repulsion) and spring force (attraction). Smaller values produce tighter central cluster. If you want to see inside the central cluster – make it larger!
Optimal distance – the natural length of the springs. Bigger values mean nodes will be further apart. Again – if you want to see individual nodes – increase the value. To tighten individual clusters/make communities more visible make it smaller.

FR had fewer properties so was easier to understand at a first glance (area, gravity and speed). In FR lower gravity (force attracting nodes to the centre) values made the cluster less tight, making it easier to distinguish individual nodes. It took much longer to run and needed to be manually stopped.

Without use of colour to highlight modularity I found it difficult to see any structure within FR layout, so I opted for using YH method for the analysis. It seems that OpenOrd (modification of FR) would be best for detection of distinct clusters – but this must be an imported add on as I do not see it in the default layout list. Something to explore at another time:)

I also found Gephi Tutorials on Slideshare re:layout helpful:

The choice of methods depends on topology you want to emphasise (also size of your network though)

Explanation of FR method
Here:

Area = graph size area
Gravity = increasing gravity reduces dispersion of disconnected components by pulling them into the centre

IMPORTANT: When the algorithm does not converge – need to reduce speed to gain precision (unstable nodes position/unstable graph)

Explanation of YH method:

Including demystification of some of the more obscure parameters:

Sizing the network nodes based on centrality measures

Playing around with sizing the nodes based on their centrality measures gave a quick overview of the visual overlap between the nodes ranking highest on the different measures. For example, in Twitter network, nodes with highest values of betweeness centrality also looked like the ones with highest values of degree centrality (see below).

Ultimately I sized the nodes based on betweeness centrality and inserted the degree centrality score as a label (purely a matter of convenience as degree centrality values were just too large to fit into the node circles!).

Visualising communities identified in the networks

The tutorial suggested by Dagan – by Jen Goldbeck, was useful here although I have still not worked out how to highlight the selected nodes in the Data lab spreadsheet from the right click.

I fiddled with size range for the nodes by increasing the minimum size so that the circles are large enough to show the colours.

I also played around with modularity factor, increasing it to 2 for Twitter and to 1.8 for blogs in order to decrease the number of communities for clarity (from 12 to 8 and from 6 to 4 respectively).

I also changed some of the colours to provide better contrast between the different communities.

NOTE: any overlap in colours for Twitter and blog visualisation is accidental and does not indicate overlap in communities across these two environments.

And voilla – two pretty pictures!

CCK11BlogCommunitiesLABEL — Blog network in CCK11 course. Node size=betweeness; node label=centrality.

Twitter network in CCK11 course. Node size=betweeness, node label=degree.

What does it all mean? I think I may have to cover that in the next post – the actual Assignment for week 3 demonstrating my competencies…It will be nice to finally have some questions to answer:)

Top image source: Flickr by Maegan under CC license

DALMOOC

Social Network Analysis basics in 300 words – #DALMOOC Assignment week 3

November 10, 2014 Kay 1 Comment

Simple social network graph example (actors=nodes; relationships=lines)

Week 3 was a useful refresher about SNA for me (after all I “successfully” completed a whole MOOC about it before;). The succinct presentation of the basic network structure and analysis was probably more useful than delving into mathematical details behind each of them – and certainly allowing to see applicability and meaning of such measures IRL settings.

Summarising it AND reflecting on it all in 300 words – that’s a bit of a challenge! But here goes it (word count starts now):

SNA aims to study relationships (e.g. communication, advice networks, hindrance) among actors that interact with one another in social networks via a combination of graphical representation and parameters describing the network structure. Dragan emphasised that, currently, SNA is the most popular form of LA.

Network elements

Each actor is represented as a node (apex) and existing relationships with others as lines (edges/ties/acs/links). Relationships can be undirected or directional, and can be weighted (e.g. by volume of exchanges).

Network measures

Connectivity of the entire network (ease of communication of information within the network):

Diameter – the longest distance between the pair of nodes within the network
Density – actual number of connections/potential number of connections

Centrality measures (identifying importance of individual actors within the network):

Degree centrality = overall number of actor’s connections
For directional networks:
- In-degree centrality aka popularity or prestige
- Out-degree centrality aka gregariousness

Betweeness centrality aka network brokers
Closeness centrality – shortest distance to anybody in the network for individual node

Network modularity

Analysis aimed at identification of communities within the overall social network – or groups of actors with higher density within them vs between them – and critical connectors between those communities.

Benefits of SNA

Social learning via discussion, collaboration and cooperation is currently a very popular paradigm in education and it has indeed been demonstrated to contribute to development of higher order skills such as critical thinking. Networking itself is considered an essential skill for a contemporary workplace (and they certainly want their potential employees to be able to demonstrate the ability to do it and be able to capitalise on the employees’ personal learning networks – PLNs). Applying SNA to ubiquitous data created through online interactions among learners and faculty may allow us to understand which types of online interactions are most beneficial for learning. But it should also benefit the learners directly by helping them with documentation and development of their lifelong online professional/personal learning networks.

Applications of SNA

In my experience, the richest source of Learning Analytics data within HE institutions is the institutional VLE. I had worked with distance learning postgraduate students using discussion board tools to complete discussion-based and collaborative learning activities. We provided some advice on how to effectively collaborate within such environments but this was largely based on common sense and personal experience.

SNA of discussion board interactions in correlation with student marks for the project and overall achievement could help provide better a priori advice to students in how to tackle such collaborations. Use of SNAPP tool may be a good start here.

This student cohort had minimal interactions outside the VLE, as they were not meeting f2f and shunned social media hence the data would capture majority of social interactions among the learners. This would not have been the case for the campus-based or younger cohort, where data from other social spaces, e.g. social media, may need to be included. I believe that at most of HE institutions in the UK students do sign off on their data use for improvement of learning which would cover data collected within insitutional VLEs. Even in this case students may perceive such use as invasion of privacy and some educators see it as an unethical grooming of students into surveillance culture. The ethical and legal implications of using data from students’ personal social networks such as Facebook (even when formally used for teaching) are likely to be more complex. The EU data protection legislation makes particularly so.

End of word count. That will be around 600 words folks!

My first impression of SNA application to learning is that this is very much still a work in progress – i.e. SNA is being used for discovery/research rather than as means of routine monitoring and of effective social networking patterns for learning. Perhaps I will be proved wrong over the next week;)

DALMOOC

Twitter and blog networks in CCK11 MOOC – SNA in Gephi part 1 for #dalmooc

November 10, 2014 Kay 2 Comments

Twitter vs Blog networks at CCK11 MOOC - SNA in Gephi — Twitter vs Blog networks at CCK11 MOOC – SNA in Gephi

We are onto Social Network Analysis in week 3 and now actually doing it rather than talking about it (I am running behind, of course, as this is very much a start of week 4 now. Yikes!)

So, we were given two datasets collected during the Connectivism and Connected Knowledge MOOC in 2011, encompassing exchanges between participants on Twitter and via blogs (communication via comments). Each data set had a version collected in week 6 and week 12 of the course. The data was pre-processed into the format directly importable into Gephi (a lovely open-source and free SNA visualisation tool).

For Twitter: “graphs included all authors and mentions as nodes of the network, and the edges between them were created if an author or an account were tagged within the tweet. For example, if a course participant @Learner1 mentioned @Learner2 and @Learner3 in a tweet, then the course Twitter network would contain @Learner1, @Learner2, and @Learner3 with the following edges: @Learner1 – @Learner2, and @Learner1 – @Learner3.”

For blogs: “[graphs] included authors of the blog posts (i.e. blog owners) and the authors of the comments to individual blog posts. If a learner A1 created a blog post, and then learners B1 and C1 added comments to that post, then the corresponding network would contain nodes A1, B1, and C1 with the following edges: A1-B1, and A1-C1. All the four networks are undirected.”

The pre-processed data is only available to the course participants and its use restricted to completing our assignments so I cannot share it here.

In this analysis step I imported each set into Gephi without any glitches and performed basic analyses and filtering for each of the sets at week 12 of the course. As per instructions this included computing the density measure and centrality measures (betweenness and degree) introduced in the course, followed by apply the Giant Component filter to filter out all the disconnected nodes and identify communities by using the modularity algorithm. Dagan’s walkthroughs in Gephi were very useful here (introduction + modularity analysis).

I report the key numbers in the table above.

It was nice to see some numbers:) And it was immediately obvious that there were some differences between Twitter and blogs, e.g. more nodes and edges in the former and twice as many “communities”.

But instantly I was concerned – what do these numbers actually mean?

Is network density of 0.003 good or bad? What does it actually mean in terms of e.g. speed of information flow in minutes/days? Has this stuff been quantified? Or is it just a matter of getting a feel for it with experience?

Or perhaps it only works for comparisons? If so – how do I tell if the difference in network density between Twitter and blogs (0.003 vs 0.01, respectively) is actually significant? And would this significant difference in a network measure value have any meaningful effect on the ease information flow within each network?

Clearly, still a lot to learn. Onto making some pretty pictures (oh – sorry – visualisations;) with the said data for part 2 of this task. Won’t be long I hope:)

DALMOOC

Getting to grips with educational data sources in the UK – and the NSS mini-case study

November 6, 2014 Kay 1 Comment

Skipping ahead to week 2 of #DALMOOC and some data wrangling. In one of the assignments (assignment56) George Siemens challenged us to go forth and find some educational data…so I did.

I am not embedded within an institution at the moment (and even if I were, I’d imagine there would be ethical/legal issues in exposing institutional data like this, even with aggregation/anonimisation). So I had to go foraging for some publicly available sources.

There were some suggestions from George Siemens – covering the US and the world:

Integrated Postsecondary Education Data System (IPEDS)
US department of education
Organisation for Economic Co-operation and Development (OECD) – which has a very nice Data portal (beta) with interactive graphs and data download options
The World Bank

Of course, at that level the data really speaks more to the academic analytics level – looking at student demographics, institution or programme completion rates, fees, staff-student ratios etc. Not as exciting as getting to the nitty-gritty of learning but will have to do!

At that level the data is also heavily summarised and aggregated so the opportunities to drill down for your own analysis are limited.

Never for making it easy for myself, I thought I’d have a wee look at the data sources available locally, in the UK.

For a novice like me, it was a bit of a challenge…

Thankfully, I came across some “data journalism” blogs with handy advice:

Help me investigate…Education (based on Help Me Invesitgate.com) has a nice summary for the main, national Higher Education data sources, including:
- Universities and Colleges Admissions Service (UCAS) – Data on university applications and acceptances
- Higher Education Statistics Agency (HESA) – Data on performance and destinations. Data is available via HEIDI database.
- Regional Funding Councils – e.g. HE Funding council for England (HEFCE)
FullFact had a summary of sources with broader education focus, but its sources were focused on published analyses and reports rather than raw data.

UK data service provides centralised Discover portal to number of education-related data sets, including the international ones (e.g. OECD or EU-based).

There is also the brand new and so far limited but rather ambitious data.ac.uk aimed at aggregating openly available HE education data (thanks to helpmeinvestigateeducation and Tony Hirst for pointing me to this source).

Rather than seeing what’s there, I decided to look for a particular data set – National Student Survey. The survey started in 2005 or so and was aimed at documenting student satisfaction with their courses at their graduation, using a very simple questionnaire (PDF). It started in England but now extends to some Scottish institutions in my own backyard. Seemed like a great idea to give university applicants some simple info on quality of teaching. But it has been causing quite a controversy, especially when some more esteemed establishments within the Russell Group Universities scored close to the bottom of the rankings on the quality of student feedback. There was much grumbling about methodology of course (some of it justified) but also institutional action to address the shortcomings in assessment strategies (or students’ perceptions of feedback).

I thought it would be fun to see how the Russel Group is doing these days:)

I found it in two places: at HEFCE website in simple, poorly documented Excel spreadsheets and via HESA in a more aggregate format as a part of a wider and well documented XML data set underlying Unistats website (the latter includes data on student retention, salaries, careers, staff/support etc.).

I went for HEFCE dataset from 2013 (2013 NSS results by teaching institution for all institutions). It was in a familiar Excel format and much of it in “human” easily understandable language. The results were granular to subject area/degree within each institution. There were some clean-up/reformatting required but it was likely to be minimal (I think I will write about it in another post).

I thought it would be neat to use a map for some of the visualisations (who doesn’t love a map;). Geolocation data for the institutions was missing from NSS HEFCE data – but data.ac.uk came to the rescue here with their list of registered learning providers. They even threw in institutional groupings (e.g. Russell group etc.) for good measure (see augmented data set). Both sets included UKRLP code which should make for an easy join.

HEFCE set only contained question numbers so I needed to create another table containing question text as well as the evaluated course aspect – and I used the exemplar questionnaire from NSS website as an input here. I would use question number as a join with the NSS set.

Phew – it was rather hard work all this rummaging for data. And this is even before the clean up and playing around with it in Tableau!

What I learned:

much of the data (especially in the less aggregated and more valuable format) is available via subscription only to the members of educational institutions (e.g. HEIDI or Discover portal)
some of the databases have in-built interactive visualisation tools, e.g. Discover
each dataset has their own terms and conditions for use – you must read (or click through) a lot of bumph even before you get started, especially on portals aggregating datasets across sources!
data derived from the same data collection exercise can often be found via different sources varying in degree of aggregation, data integration and documentation – it looks like it is worth looking around for something that fits your needs
getting to know well documented and structured data can be hard work, especially for a database novice like me (labels etc. are rarely written in human-readable language and you have to digest a lot of definitions)
it is likely that you will have to find more than one data source to cover all the aspects you need for your analysis
Even highly curated data sets may need some clean up

Image source: Flickr by ttcopley under CC license

nauczanKi

Monthly Archives: November 2014

Apologies for the look and feel

Twitter and blog networks in CCK11 MOOC – SNA in Gephi part 1 for #dalmooc

Getting to grips with educational data sources in the UK – and the NSS mini-case study

Lessons learned and unlearned