Experimental Options for Analyzing Social Networks in Messaging Systems

Social network analysis is the study of connections, flows, and structure among people, groups, organizations, and systems. The points or nodes in the network may include people, routers, or even disease vectors. The ability to analyze communication patterns and social networks has become a major component of eDiscovery systems. Packages from Autonomy’s Zantaz, Cataphora, and Seagate’s i365 MetaLINCS all feature social network analysis functionality. Research, development, and experimentation in social network analysis tools are likely to make significant contributions to commercial eDiscovery systems in the future. Community, communication and collaboration services, such as LinkedIn, Twitter, FaceBook, and MySpace, are now commonly used in conjunction with institutional systems. These external services are not yet commonly integrated with most compliance and archiving systems. In this article I discuss the NodeXL and Maltego applications. Both of these tools offer a specialized feature set that could offer insight into future development for eDiscovery platforms in terms of external data and analysis of social networks.

Social network analysis and network theory research has a rich literature that spans many disciplines including anthropology, criminology, economics, epidemiology, political science, psychology, sociology, and statistics. Social scientists in anthropology, psychology, and sociology developed modern social network analysis methods from the 1930s to the 1960s. Starting in the 1970s, social network analysis attracted researchers from a growing array of fields and rapidly increasing subspecialties. Mark Granovetter’s 1973 paper “The Strength of Weak Ties” and Stanley Milgram’s small world project (the source of the idea of “six degrees of separation” were both fundamental to the growth of the field. Euler’s 1736 paper, titled the “Seven Bridges of Koenigsberg,” is considered the first paper on graph theory, which underlies much of the mathematics behind social network analysis. An excellent place to begin looking for more information is the Web site of the International Network for Social Network Analysis (INSNA), a professional organization dedicated to advancing the field of social network analysis.

NodeXL

NodeXL is a free and open source social network analysis and visualization tool that is an add-on package for Microsoft Excel 2007. The ease of use of the software combined with integrated import functionality for multiple common types of data make NodeXL particularly compelling. NodeXL is straightforward to use, quick to set up, and is capable of analyzing data without requiring programming experience for anyone who is familiar with working with Excel. The software is in use by both academics and professionals. Several university classes teach social network analysis with NodeXL. The primary NodeXL documentation is a well-written tutorial developed for the courses. Update: the book Analyzing Social Media Networks with NodeXL: Insights from a Connected World is available as of September 2010.

One major difficulty with many experimental and research systems is that mechanisms for importing real world data are often limited or nonexistent. In these systems, data extraction, normalization, and cleaning may involve significant effort. Commercial electronic discovery systems typically include an integrated set of components that are capable of managing the entire lifecycle of the eDiscovery process including preservation, collection, processing, review, and analysis. Most eDiscovery systems are able to import common document and messaging formats. NodeXL is able to import real world data from multiple sources including email messages, Twitter messages, Flickr tags, and YouTube user networks. NodeXL relies on Windows Desktop Search in XP or Windows Search in Vista to import email. The software can also import other social network analysis tool formats including: UCINet, graphML, Pajek, and CSV. The software requires Windows XP or Vista, Office 2007, and several other updated system components from Microsoft, but no other third-party software.

Commercial eDiscovery systems include a much wider array of features for analysis such as the ability to reconstruct conversations, threads, as well as a history of connections between messages, documents, and access logs. Dedicated SNA packages, such as NodeXL, typically contain more specialized network metrics, network layouts, and visualizations than general eDiscovery systems, although with a more limited set of data types. These specialized packages allow for experimentation with different types of analyses and for comparison with existing analyses. The NodeXL team plans to include support for additional types of popular social network services such as Facebook and enterprise information sources such as Active Directory. Access to the data is provided through official APIs from each of the services. The terms of service for an API typically restricts how much data may be collected and the potential uses for the data. The use of these APIs to collect data for legal action will most certainly require a court order to remain compliant with the terms of service.

Maltego

Maltego from Paterva is a unique tool that bills itself as an “open source intelligence” application that could be viewed as an eDiscovery system for the Internet at large. Paterva is based in South Africa. Roelof Temmingh, who is active and vocal in the security community, formed the company in 2007. Maltego launched as a commercial product in 2008.

The application helps you to collect information about people, documents, network resources, and the trails of information we increasingly leave around the Internet. Once you have gathered your information, Maltego provides methods to analyze the data, make inferences about relationships between them, and then visualize these connections. Maltego excels at enumerating Internet infrastructure such as information about IP addresses, net blocks, autonomous system numbers, DNS records, and WHOIS records. There are methods to collect information about email address, mail servers, URLs, phone numbers, document metadata and social network services.

Maltego relies on specialized data connectors called “transforms” that interact with online services to gather information from many sources. For example, one set of transforms connects to the WikiScanner service and allows a user to query if a particular IP address or netblock has made edits to Wikipedia or query which IP address made a particular edit. Another set of transforms allows users to discover connections between phone numbers, email addresses, URLs, and IP addresses. In the first version of the application, known then as Evolution, the transforms ran directly on the Maltego client machine. After a legal threat from a large social network, Paterva reexamined its data sources and determined that several of the transforms violated the terms of service.

Paterva then redesigned the software and released a second version called Maltego that ran the transforms directly from Paterva’s servers rather than on the client. Paterva eliminated all the transforms that potentially violated terms of service for various providers including all social network service transforms. In addition, Paterva changed the primary search engine from Google to Yahoo!, which allows automated queries under its terms of service. The new architecture also allowed Maltego to quickly add new transforms to the service and manage the number of API calls to prevent users from reaching limits defined by the services. Recent versions of Maltego once again include the ability to perform local transforms, so that users can create and share their own transforms and adds the ability to extract data from local files for databases.

Paterva offers a commercial edition and a free community edition Maltego client. The commercial edition costs $430 USD for the first year and $320 USD per year for renewals. The community edition is free, but places some limitations on use including a limited number of queries per day, no export of data, and limited levels of detail. Paterva offers a Maltego Transform Application Server (TAS) that allows customers to run transformations from their own server. This allows customers to integrate Maltego with their own infrastructure and eliminate their reliance on Paterva’s servers for privacy reasons. Maltego Mesh is an experimental Firefox plugin that automatically extracts entities from Web pages, such as names, companies, email, addresses, phone numbers, dates and IP addresses. Mesh can then save these entities along with source of the entities for later analysis in Maltego.

Acquiring datasets of real world examples, that are of significant size and do not have significant legal or privacy restrictions can be a significant problem when evaluating systems for eDiscovery, social network analysis, anti-spam, and email content analysis. Enron’s investigation by the Federal Energy Regulatory Commission (FERC) led to the email messages used in the trial entering the public record. Academics acquired, processed, and cleaned these emails and made available them for others to analyze. The result is known as the Enron Email Corpus, which contains approximately a half million messages from more than 150 individuals. The Enron Email Dataset has resulted in a significant number of publications and has been a boon to researchers and practitioners alike, as it is a tremendous resource to experiment with and test against.

Communications applications, and therefore eDiscovery systems, will increasingly be designed with an awareness of social network systems and social network analysis. The tools presented in this article offer some insight into potential future developments for these systems.

* This article originally appeared as Experimental Options for Analyzing Social Networks in Messaging Systems in the November 2009 issue of Messaging News magazine. Minor updates and link to NodeXL book added on September 13, 2010.