10 Best Node.js Frameworks For Developers

Best Node.js Frameworks for 2015_630

With the coming of Node.js, JavaScript has come to the forefront. It is bound to happen since it is already a well-known programming language used by developers in browsers. And, with Node.js, it has found a way to server-side implementation thus reducing the complexity of using two different languages at both ends. Today, Node.js offers one of the most innovative solutions to building servers and web/mobile applications. Its single-threaded event looping and asynchronous, non-blocking input/output processing feature distinguishes it from other runtime environments. Its scope is rapidly increasing with valuable contributions from the developer community and other technology giants. Right now, several performance-driven frameworks are being developed using primary principles and approaches of Node.js. These frameworks have extended the functionality of Node.js to a considerable extent and have also built newer features.

Today, frameworks like Express.js and Hapi.js are gaining prominence for designing better websites and mobile applications. Hence, it has become important to embrace the latest innovations that are being brought into the tech world by these Node.js frameworks. With this intention in mind, I decided to compile a list of popular Node.js frameworks and their useful applications. I am presenting a list of 10 Best Node.js Frameworks, which are currently redefining the application development field.

1) Hapi.js

Hapi.js is a powerful Node.js web framework for building application program interfaces (APIs) and other software applications. The framework has a robust plugin system and numerous key features, including input validation, configuration-based functionality, implement caching, error handling, logging and more. Hapi.js is used for designing useful applications, such as Postmile, a collaborative list making tool. Besides, it is used for providing technology solutions by several large-scale websites, such as Disney, Concrete, PayPal, Walmart and more.

2) Socket.io

Socket.io is a Node.js server framework for building real-time web applications. Being a Javascript library, it allows event-driven, bidirectional communication between web clients and server. Socket.io works as a client-side library running in the browser and as a server-side library for node.js. The framework allows real-time concurrency for document collaboration and data exchange. Moreover, its key features also includes asynchronous input/ output (I/O) processing, binary streaming, instant messaging (‘Hello World’ chat application) and more.

3) Express.js

Express.js is one of the most essential web frameworks for Node.js. It is a minimalist framework for building a host of web and mobile applications as well as application programming interfaces (APIs). A lot of popular applications and websites like MySpace, Geekli.st, Klout, Segment.io and Yummly are powered by Express.js. Express.js is offers various features, like template engines, simplified multiple routing, database integration and more.

4) Mojito

Mojito is a JavaScript framework based on Yahoo! Cocktails, a mobile application development platform introduced by Yahoo! Developer Network. JavaScript is the only programming language which is used for Yahoo! Cocktails Platform. Since, client and server components are both written in JavaScript, Mojito can run on both the client side (browser) and the server (Node.js).
Mojito is a model-view controller (MVC) framework offering a gamut of features, such as

  • Convenient data fetching.
  • Local development environment and tools (Yahoo! independent stack).
  • Integrated unit testing.
  • Library for simplifying internationalization & localization.

5) Meteor

Meteor is an open-source, model-view controller (MVC) framework for building websites and web/mobile applications. The framework supports OS X, Windows and Linux. It allows writing both client and server parts of an application in JavaScript. Meteor’s in-built set of pre-written, self-contained modules makes writing application codes. Moreover, its reactive programming model allows creating applications using lesser JavaScript code. Besides, Meteor is a powerful framework for building real-time applications. Popular applications built using Meteor are Respondly (a team collaboration app), Blonk (Job search mobile App) and more.

6) Derby

Derby is a model-view controller (MVC) JavaScript framework for both client-side and server-side. It is ideal for creating real-time mobile and web applications. Derby’s Racer, a real-time data synchronization engine for Node.js allows multi-site, real-time concurrency and data synchronization across clients and servers. By leveraging ShareJS, Racer optimizes conflict resolution algorithm and allows real-time editing within an application. Moreover, server rendering is one such feature of Derby that allows fast page loads, search engine support and HTML templates to render in the browser or on the server.

7) Mean.js

Mean.js is a full-fledged JavaScript framework to build web applications using NoSQL database, MongoDB as well as Angular.js for the front-end and Express.js/Node.js for the backend (server). It also leverages the Grunt tool to enable automated testing. Mean.js and Mean.io are both considered a part of Mean stack. Mean stands for MongoDB, Express.js, Angular.js and Node.js. Ziploop is one example of a popular mobile application used for shopping which is designed using Mean stack.

8) Sails.js

Sails.js is one of the most popular real-time frameworks around for building Node.js applications. Sails.js offers a model-view controller (MVC) pattern for implementing data-driven application programming interfaces (APIs). The framework has gained ground for building real-time chat applications, dashboards and multiplayer games. It uses Waterline for object-relational mapping and providing database solutions. Sails.js is built on top of Node.js and uses Express.js for handling HTTP requests. It is ideal for creating browser-based applications as it is compatible with all the Grunt modules, including LESS, SASS, Stylus, CoffeeScript, Jade, Dust, and more. Sails.js supports any front-end approach, such as Angular, Backbone, iOS/ObjC, Android/Java or anything else.

9) Koa.js

Koa.js is a powerful server framework for Node.js to build efficient web applications and application programming interfaces (APIs). Koa.js efficiently uses generators to efficiently deal with call backs and increase error-handling capabilities. This also improves readability of the application.

10) Total.js

Total.js is one of the modern and modular Node.js frameworks supporting model-view controller (MVC) software architecture. It is fully compatible with various client-side frameworks, like Angular.js, Polymer, Backbone.js, Bootstrap and more. Total.js is fully extensible and asynchronous. One great feature of Total.js is that you don’t need any tools like Grunt to compress JavaScript, HTML and CSS. Additionally, the framework has NoSQL embedded database and supports array and other prototypes. It supports RESTful routing mechanism, supports web sockets, media streaming and more.

Conclusion:

Today, these Node.js frameworks are shaping the future of web and application development technology. Some of them like Express.js, Mean.js and Sails.js are here to stay for a while and will also matter in the long run. I am sure that with the on-going development in the field there will be a lot more like these ones in the near future. There are a lot of Node.js frameworks that have still not become very popular. But, if you know any such frameworks which are still relatively rare then do not forget to share some information about them in the comments section below. Lastly, if you are currently working or have previously worked on Node.js frameworks and have experience developing cool web or mobile applications, then you can share your insights or useful links with us. Till then, signing off.

(referenced from: http://www.devsaran.com)

9 JavaScript tools aimed at easing development

Once considered a script-kiddie toy, JavaScript has become a stalwart scripting language for developing Web applications, and vendors and open source organizations alike are pushing out IDEs and tools targeted at making JavaScript development easier and more reliable.

These tools move beyond familiar JavaScript technologies like jQuery and the growing set of JavaScript libraries to provide Web developers with a plethora of functionality, including debugging and support for HTML5 and other popular scripting languages.

Here is a look at some of the more compelling tools and frameworks for taking your JavaScript Web development projects to the next level.

ActiveState Komodo IDE 7

ActiveState Komodo IDE 7 supports JavaScript development, in addition to other popular Web development languages like PHP, Python, and Ruby. Version 7 backs the Node.js server-side JavaScript environment, offering such capabilities as editing, syntax-checking, code intelligence, and debugging. Support for CoffeeScript, which compiles to JavaScript, is featured, too. Improved syntax checking in Version 7 enables developers to check JavaScript or CSS within HTML.

Appcelerator Aptana Studio 3

Appcelerator’s Aptana Studio 3 is an open source Web development IDE that supports JavaScript, HTML, and CSS Code Assist, to aid in authoring. Integrated JavaScript debugging is offered, as is debugging for Ruby on Rails applications. Other features include support for HTML5, Git repository integration, and the ability to customize the IDE. Aptana Studio can be installed as an Eclipse IDE plug-in. Appcelerator, which recently acquired Aptana Studio, also offers the Titanium Studio IDE, which provides similar functionality.

4D Wakanda

4D Wakanda is a JavaScript development platform for building Web and mobile business applications. The included Wakanda Server features a datastore for housing application data and models, and it’s run by the WakandaDB NoSQL object engine. WakandaDB leverages JavaScript and classes for an application’s business logic. Also featured are Wakanda Studio (a visual designer and code editor) and Wakanda Framework (a client-side framework comprised of interface widgets for the browser front end, a data source layer, and a data provider to communicate with the server).

dotCloud JS

The dotCloud JS software development kit is for building Web applications rapidly with JavaScript and HTML. Applications built with dotCloud JS can be deployed on the dotCloud PaaS cloud. Developers gain access to a selection of cloud APIs without having to write back-end code or deal with servers. APIs access capabilities like data storage and real-time data synchronization. Integration with Twilio and Twitter APIs is also featured. Developers can access 14 cloud services via dotCloud JS. A software stack based on Node.js, MongoDB, Redis, and WebSocket is included as well.

Telerik Kendo UI

Telerik Kendo UI is a framework for building HTML5 and JavaScript mobile applications and sites. It incorporates adaptive rendering and leverages JavaScript, HTML5, and CSS3 to adapt a mobile application’s native look and feel on any smartphone or tablet while supporting all major browsers. Controls and widgets for building iPad user interfaces are also part of the tool. Kendo also features themes, templates, and an MVVM (Model View View Model) framework.

SproutCore

The open source SproutCore JavaScript framework is intended to enable development of Web applications with less code. SproutCore applications move business logic to the browser to provide immediate responses to users’ taps and clicks; there is no need for round trips across network connections. A binding system is featured for building data-centric applications, leveraging application state and data flow descriptions. Semantic templates allow developers to write HTML and CSS that automatically update when models change. An in-memory database is provided for managing and querying data and synchronizing with a server. Applications build a directory of static assets that can be deployed to any server.

Alpha Five v11

The Alpha Fire rapid application development tool is aimed at building AJAX business applications for Web and mobile devices. Developers can include charts, graphs, and analytics in their applications, enabling users to summarize trends. The tool features JavaScript classes and libraries, and it supports jQuery. JavaScript is generated for the developer, enabling usage by those who are not able to write the code. The tool also leverages CSS3.

Eclipse JavaScript Development Tools

The Eclipse JSDT (JavaScript Development Tools) is a set of plug-ins for Eclipse aimed at supporting JavaScript application and Web development. JSDT adds a JavaScript project type to the Eclipse Workbench, along with views, editors, wizards, and builders. The tool set features JSDT Core, including a parser and compiler DOM; JSDT UI, with user interface code to create the JavaScript IDE; and JSDT Debug, for debugging JavaScript using Rhino and Crossfire. Also featured is JSDT Web, which supports client-side JavaScript implemented in the Eclipse Web Tools Platform’s Source Editing project.

Oracle NetBeans IDE

Oracle’s NetBeans IDE supports error-checking for JavaScript, CSS, and HTML, including JavaScript 1.7. The editor also recognizes JavaScript and HTML in XTHML, PHP, and JSP (JavaServer Pages) files. Browser compatibility is provided when developers specify browser types and versions in the JavaScript Options panel. The IDE features an AJAX-ready environment for choosing a server-side scripting language like PHP or Groovy, and integration is enabled for third-party JavaScript toolkits and Web frameworks. Code completion and integrated documentation for JavaScript toolkits like JQuery and Script.aculo.us is provided via the NetBeans JavaScript editor.

Drupal architecture – how to implement loosely coupled communication across modules

Drupal is a free and opensource Content Management System which is used for building Websites. It is based on LAMP Architecture using PHP as implementation language. You can read success stories from the website and from the author Dries Buytaert ‘s personal blog. Most of the success of Drupal derives from his modular architecture which simplifies development, collaboration and user contribution. Drupal’s history can be read here.

Bird’s eye view of Drupal architecture can be sketeched by following diagram:

At the ground level there is Drupal API. It implements the basic functionality of the module system and of theCMS. Physically it is made up by a folder called includes/ which contains a set of php files (named with .incextension). Each of this php file implements an API that can be exploited by upper levels modules.
Core package contains all the Drupal Modules that implement the CMS engines. In this package you can find modules for node management, blogging, commenting, forum, menu and so on. Here is a list of Drupal v7 core modules as they appear on filesystem:

On top of Core modules there are Community Modules. These are modules contributed by the opensource community which are not in the main distribution. For example you can fine modules for Adsense, Amazon integration, Voting and many more (here you can find a complete list of community modules available).
At end there are User modules which are custom private modules built by developers for implementing specific project’s needs. A typical website is deployed using Drupal API and Core modules.

Drupal API

Following diagram reports how drupal API is structured (content of the /includes directory):

There is an entry point file which is called index.php. This is called on every Drupal Request. According the request it includes several .inc files. Every file implements a particular features: for example database.inc exports the API to access database, module.inc the hidden work of the module system, theme.inc implements the theming subsystem and so on. Every include exports a set of constants and functions in the form of:

<?php
// module sample.inc
define('SAMPLE_CONSTANT', 'sample value');
......

function a_method_1($param1, $param2) {
	// code here ...
}

function a_method_2() {
	// code here ...
}

function a_method_3($input_param1, $input_param2, $input_param3) {
	// code here
}
...................
...................

As an example this is the code extracted from includes/bootstrap.inc:

Here you can find a description on how drupal boots and processes each request.

Drupal Modules

A module is a set of PHP and configuration files deployed on a particular folder of Drupal installation. These files follows a particular convention according the following:

each has a

  • .info file (example: mymodule.info) which contains version and dependency information about the module
  • .module file (example: mymodule.module) which contains PHP code implementing module functionality. Generally module uses Drupal API
  • .api file (example: mymodule.api) which contains hooks implemented by module: event to which module is interested
  • .install file (example: mymodule.install) which contains code to execute when module is installed/uninstalled

One of the first things I questioned myself is how they implemented loosely coupled communication between modules. At an abstract level they used Observer/Observable pattern principally based on a file name convention. Every module notifies a set of internal events which carry out data. These events are called Hooks. When a module is interested in an event, it implements a particular method whose name contains the event name plus a prefix, which will be automatically called by the module subsystem. Dynamics is sketched by following diagram:

When Module_A wants to send an event to other interested modules, it invokes method invoke_all() from Drupal API (file modules.inc). invoke_all finds all modules implementing that particular hooks, calling for each a method called <modulename>_<hookname>(params).

Difference between AngularJS and jQuery

AngularJS and jQuery are the Javascript frameworks and are different with each other, so never mix up the AngularJS and jQuery code in your project. Use only one Javascript framework at a time. If you are starting a new project, must consider AngularJS over jQuery. If you are a experienced jQuery developer, then you have to invest some time to work in AngularJS way. There are a lot of difference between AngularJS and jQuery.

1. Web designing approach in jQuery and AngularJS

In jQuery, you design a page, and then you make it dynamic. This is because jQuery was designed for augmentation and has grown incredibly from that simple premise.

But in AngularJS, you must start from the ground up with your architecture in mind. Instead of starting by thinking “I have this piece of the DOM and I want to make it do X”, you have to start with what you want to accomplish, then go about designing your application, and then finally go about designing your view.

2. Don’t augment jQuery with AngularJS

Similarly, don’t start with the idea that jQuery does X, Y, and Z, so I’ll just add AngularJS on top of that for models and controllers. This is really tempting when you’re just starting out, which is why I always recommend that new AngularJS developers don’t use jQuery at all, at least until they get used to doing things the “Angular Way”.

I’ve seen many developers here and on the mailing list create these elaborate solutions with jQuery plugins of 150 or 200 lines of code that they then glue into AngularJS with a collection of callbacks and $applys that are confusing and convoluted; but they eventually get it working! The problem is that in most cases that jQuery plugin could be rewritten in AngularJS in a fraction of the code, where suddenly everything becomes comprehensible and straightforward.

The bottom line is this: when solutioning, first “think in AngularJS”; if you can’t think of a solution, ask the community; if after all of that there is no easy solution, then feel free to reach for the jQuery. But don’t let jQuery become a crutch or you’ll never master AngularJS.

3. Always think in terms of architecture

First know that single-page applications are applications. They’re not webpages. So we need to think like a server-side developer in addition to thinking like a client-side developer. We have to think about how to divide our application into individual, extensible, testable components.

So then how do you do that? How do you “think in AngularJS”? Here are some general principles, contrasted with jQuery.

The view is the “official record”

In jQuery, we programmatically change the view. We could have a dropdown menu defined as a ul like so:

<ul class=”main-menu”>

    <li class=”active”>

        <a href=”#/home”>Home</a>

    </li>

    <li>

        <a href=”#/menu1″>Menu 1</a>

        <ul>

            <li><a href=”#/sm1″>Submenu 1</a></li>

            <li><a href=”#/sm2″>Submenu 2</a></li>

            <li><a href=”#/sm3″>Submenu 3</a></li>

        </ul>

    </li>

    <li>

        <a href=”#/home”>Menu 2</a>

    </li>

</ul>

In jQuery, in our application logic, we would activate it with something like:

$(‘.main-menu’).dropdownMenu();

When we just look at the view, it’s not immediately obvious that there is any functionality here. For small applications, that’s fine. But for non-trivial applications, things quickly get confusing and hard to maintain.

In AngularJS, though, the view is the official record of view-based functionality. Our ul declaration would look like this instead:

<ul class=”main-menu” dropdown-menu>

    …

</ul>

These two do the same thing, but in the AngularJS version anyone looking at the template knows what’s supposed to happen. Whenever a new member of the development team comes on board, he can look at this and then know that there is a directive called dropdownMenu operating on it; he doesn’t need to intuit the right answer or sift through any code. The view told us what was supposed to happen. Much cleaner.

Developers new to AngularJS often ask a question like: how do I find all links of a specific kind and add a directive onto them. The developer is always flabbergasted when we reply: you don’t. But the reason you don’t do that is that this is like half-jQuery, half-AngularJS, and no good. The problem here is that the developer is trying to “do jQuery” in the context of AngularJS. That’s never going to work well. The view is the official record. Outside of a directive (more on this below), you never, ever, never change the DOM. And directives are applied in the view, so intent is clear.

Remember: don’t design, and then mark up. You must architect, and then design.

Data binding

This is by far one of the most awesome features of AngularJS and cuts out a lot of the need to do the kinds of DOM manipulations I mentioned in the previous section. AngularJS will automatically update your view so you don’t have to! In jQuery, we respond to events and then update content. Something like:

$.ajax({

  url: ‘/myEndpoint.json’,

  success: function ( data, status ) {

    $(‘ul#log’).append(‘<li>Data Received!</li>’);

  }

});

For a view that looks like this:

<ul class=”messages” id=”log”>

</ul>

Apart from mixing concerns, we also have the same problems of signifying intent that I mentioned before. But more importantly, we had to manually reference and update a DOM node. And if we want to delete a log entry, we have to code against the DOM for that too. How do we test the logic apart from the DOM? And what if we want to change the presentation?

This a little messy and a trifle frail. But in AngularJS, we can do this:

$http( ‘/myEndpoint.json’ ).then( function ( response ) {

    $scope.log.push( { msg: ‘Data Received!’ } );

});

And our view can look like this:

<ul class=”messages”>

    <li ng-repeat=”entry in log”>{{ entry.msg }}</li>

</ul>

But for that matter, our view could look like this:

<div class=”messages”>

    <div class=”alert” ng-repeat=”entry in log”>

        {{ entry.msg }}

    </div>

</div>

And now instead of using an unordered list, we’re using Bootstrap alert boxes. And we never had to change the controller code! But more importantly, no matter where or how the log gets updated, the view will change too. Automatically. Neat!

Though I didn’t show it here, the data binding is two-way. So those log messages could also be editable in the view just by doing this: 

<input ng-model=”entry.msg” />. 

And there was much rejoicing.

Distinct model layer

In jQuery, the DOM is kind of like the model. But in AngularJS, we have a separate model layer that we can manage in any way we want, completely independently from the view. This helps for the above data binding, maintains separation of concerns, and introduces far greater testability. Other answers mentioned this point, so I’ll just leave it at that.

Separation of concerns

And all of the above tie into this over-arching theme: keep your concerns separate. Your view acts as the official record of what is supposed to happen (for the most part); your model represents your data; you have a service layer to perform reusable tasks; you do DOM manipulation and augment your view with directives; and you glue it all together with controllers. This was also mentioned in other answers, and the only thing I would add pertains to testability, which I discuss in another section below.

Dependency injection

To help us out with separation of concerns is dependency injection (DI). If you come from a server-side language (from Java to PHP) you’re probably familiar with this concept already, but if you’re a client-side guy coming from jQuery, this concept can seem anything from silly to superfluous to hipster. But it’s not. 

From a broad perspective, DI means that you can declare components very freely and then from any other component, just ask for an instance of it and it will be granted. You don’t have to know about loading order, or file locations, or anything like that. The power may not immediately be visible, but I’ll provide just one (common) example: testing.

Let’s say in our application, we require a service that implements server-side storage through a REST API and, depending on application state, local storage as well. When running tests on our controllers, we don’t want to have to communicate with the server – we’re testing the controller, after all. We can just add a mock service of the same name as our original component, and the injector will ensure that our controller gets the fake one automatically – our controller doesn’t and needn’t know the difference.

4. Test-driven development 

This is really part of section 3 on architecture, but it’s so important that I’m putting it as its own top-level section.

Out of all of the many jQuery plugins you’ve seen, used, or written, how many of them had an accompanying test suite? Not very many because jQuery isn’t very amenable to that. But AngularJS is.

In jQuery, the only way to test is often to create the component independently with a sample/demo page against which our tests can perform DOM manipulation. So then we have to develop a component separately and then integrate it into our application. How inconvenient! So much of the time, when developing with jQuery, we opt for iterative instead of test-driven development. And who could blame us?

But because we have separation of concerns, we can do test-driven development iteratively in AngularJS! For example, let’s say we want a super-simple directive to indicate in our menu what our current route is. We can declare what we want in our view:

<a href=”/hello” when-active>Hello</a>

Okay, now we can write a test:

it( ‘should add “active” when the route changes’, inject(function() {

    var elm = $compile( ‘<a href=”/hello” when-active>Hello</a>’ )( $scope );

    $location.path(‘/not-matching’);

    expect( elm.hasClass(‘active’) ).toBeFalsey();

    $location.path( ‘/hello’ );

    expect( elm.hasClass(‘active’) ).toBeTruthy();

}));

We run our test and confirm that it fails. So now we can write our directive:

.directive( ‘whenActive’, function ( $location ) {

    return {

        scope: true,

        link: function ( scope, element, attrs ) {

            scope.$on( ‘$routeChangeSuccess’, function () {

                if ( $location.path() == element.attr( ‘href’ ) ) {

                    element.addClass( ‘active’ );

                }

                else {

                    element.removeClass( ‘active’ );

                }

            });

        }

    };

});

Our test now passes and our menu performs as requested. Our development is both iterative and test-driven. 

5. Conceptually, directives are not packaged jQuery

You’ll often hear “only do DOM manipulation in a directive”. This is a necessity. Treat it with due deference!

But let’s dive a little deeper…

Some directives just decorate what’s already in the view (think ngClass) and therefore sometimes do DOM manipulation straight away and then are basically done. But if a directive is like a “widget” and has a template, it should also respect separation of concerns. That is, the template too should remain largely independent from its implementation in the link and controller functions.

AngularJS comes with an entire set of tools to make this very easy; with ngClass we can dynamically update the class; ngBind allows two-way data binding; ngShow and ngHide programmatically show or hide an element; and many more – including the ones we write ourselves. In other words, we can do all kinds of awesomeness without DOM manipulation. The less DOM manipulation, the easier directives are to test, the easier they are to style, the easier they are to change in the future, and the more re-usable and distributable they are.

I see lots of developers new to AngularJS using directives as the place to throw a bunch of jQuery. In other words, they think “since I can’t do DOM manipulation in the controller, I’ll take that code put it in a directive”. While that certainly is much better, it’s often still wrong.

Think of the logger we programmed in section 3. Even if we put that in a directive, we still want to do it the “Angular Way”. It still doesn’t take any DOM manipulation! There are lots of times when DOM manipulation is necessary, but it’s a lot rarer than you think! Before doing DOM manipulation anywhere in your application, ask yourself if you really need to. There might be a better way.

Here’s a quick example that shows the pattern I see most frequently. We want a toggleable button. (Note: this example is a little contrived and a skosh verbose to represent more complicated cases that are solved in exactly the same way.)

.directive( ‘myDirective’, function () {

    return {

        template: ‘<a class=”btn”>Toggle me!</a>’,

        link: function ( scope, element, attrs ) {

            var on = false;

 

            $(element).click( function () {

                if ( on ) {

                    $(element).removeClass( ‘active’ );

                }

                else {

                    $(element).addClass( ‘active’ );

                }

 

                on = !on;

            });

        }

    };

});

There are a few things wrong with this. First, jQuery was never necessary. There’s nothing we did here that needed jQuery at all! Second, even if we already have jQuery on our page, there’s no reason to use it here; we can simply use angular.element and our component will still work when dropped into a project that doesn’t have jQuery. Third, even assuming jQuery was required for this directive to work, jqLite (angular.element) will always use jQuery if it was loaded! So we needn’t use the $ – we can just use angular.element. Fourth, closely related to the third, is that jqLite elements needn’t be wrapped in $ – the element that is passed to the link function would already be a jQuery element! And fifth, which we’ve mentioned in previous sections, why are we mixing template stuff into our logic?

This directive can be rewritten (even for very complicated cases!) much more simply like so:

.directive( ‘myDirective’, function () {

    return {

        scope: true,

        template: ‘<a class=”btn” ng-class=”{active: on}” ng-click=”toggle()”>Toggle me!</a>’,

        link: function ( scope, element, attrs ) {

            scope.on = false;

 

            scope.toggle = function () {

                scope.on = !scope.on;

            };

        }

    };

});

Again, the template stuff is in the template, so you (or your users) can easily swap it out for one that meets any style necessary, and the logic never had to be touched. 

And there are still all those other benefits, like testing – it’s easy! No matter what’s in the template, the directive’s internal API is never touched, so refactoring is easy. You can change the template as much as you want without touching the directive. And no matter what you change, your tests still pass.

So if directives aren’t just collections of jQuery-like functions, what are they? Directives are actually extensions of HTML. If HTML doesn’t do something you need it to do, you write a directive to do it for you, and then use it just as if it was part of HTML.

Put another way, if AngularJS doesn’t do something out of the box, think how the team would accomplish it to fit right in with ngClick, ngClass, et al.

Summary

Don’t even use jQuery. Don’t even include it. It will hold you back. And when you come to a problem that you think you know how to solve in jQuery already, before you reach for the $, try to think about how to do it within the confines the AngularJS. If you don’t know, ask! 19 times out of 20, the best way to do it doesn’t need jQuery and to try to solve it with jQuery results in more work for you.

Federated Identities: OpenID vs SAML vs OAuth

Single sign-on (SSO) started it all. Organizations needed a way to unify authentication systems in the enterprise for easier management and better security. Single sign-on was widely adopted and provided a solution for keeping one repository of usernames and passwords that could be used transparently across several internal applications.

Service-oriented software kicked off the next wave of change. Organizations wanted to open APIs in their software so partners and independent developers could use them. Managing authentication and authorization for entities looking to consume these APIs was obviously a challenge. 

Social media moved things even further. Various platforms spread far and wide on a plethora of devices, and many applications were built on top of those platforms. Now we have countless apps and services hooked into Twitter, Facebook, and LinkedIn. 

The problem? How to bring together user login information across many applications and platforms to simplify sign-on and increase security. The solution? Federated identities . . .

What is federated identity?

Federated identity means linking and using the electronic identities a user has across several identity management systems. In simpler terms, an application does not necessarily need to obtain and store users’ credentials in order to authenticate them. Instead, the application can use an identity management system that is already storing a user’s electronic identity to authenticate the user—given, of course, that the application trusts that identity management system. 

This approach allows the decoupling of the authentication and authorization functions. It also makes it easier to centralize these two functions in the enterprise to avoid a situation where every application has to manage a set of credentials for every user. It is also very convenient for users, since they don’t have to keep a set of usernames and passwords for every single application that they use. 

There are three major protocols for federated identity: OpenID, SAML, and OAuth.

OpenID:

OpenID is an open standard sponsored by Facebook, Microsoft, Google, PayPal, Ping Identity, Symantec, and Yahoo. OpenID allows user to be authenticated using a third-party services called identity providers. Users can choose to use their preferred OpenID providers to log in to websites that accept the OpenID authentication scheme. 

The OpenID specification defines three roles:

  • The end user or the entity that is looking to verify its identity
  • The relying party (RP), which is the entity looking to verify the identity of the end user
  • The OpenID provider (OP), which is the entity that registers the OpenID URL and can verify the end user’s identity


The following diagram explains a use case for an OpenID scenario: 

OpenIDIllo
Security Considerations
OpenID had a few interesting vulnerabilities in the past, for example:

  • Phishing Attacks: Since the relying party controls the authentication process (if necessary) to the OpenID provider, it is possible for a rogue relying party to forward the user to a bogus OpenID provider and collects the user’s credentials for the legal OpenID provider.
  • Authentication Flaws: In March 2012, three researchers presented a paper that highlighted two vulnerabilities in OpenID. Both vulnerabilities allow an attacker to impersonate any user to a website if the website doesn’t properly check whether the response from the OpenID provider contains a properly signed email address.

SAML:

Security Assertion Markup Language (SAML) is a product of the OASIS Security Services Technical Committee. Dating from 2001, SAML is an XML-based open standard for exchanging authentication and authorization data between parties. 

The SAML specification defines three roles:

  • The principal, which is typically the user looking to verify his or her identity
  • The identity provider (idP), which is the entity  that is capable of verifying the identity of the end user
  • The service provider (SP), which is the entity looking to use the identity provider to verify the identity of the end user

The following diagram explains a use case for a SAML scenario: 

SAMLIllo
Security Considerations

OAuth:

OAuth is another open standard. Dating back to 2006, OAuth is different than OpenID and SAML in being exclusively for authorization purposes and not for authentication purposes. 

The OAuth specifications define the following roles:

  • The end user or the entity that owns the resource in question
  • The resource server (OAuth Provider), which is the entity hosting the resource
  • The client (OAuth Consumer), which is the entity that is looking to consume the resource after getting authorization from the client

The following diagram explains a user case for an OAuth scenario: 
OAuthIllo
Security Considerations

  • A session fixation vulnerability flaw was found in OAuth 1.0. An attacker can fix a token for the victim that gets authorized. The attacker then uses the fixated token.
  • OAuth 2.0 was described as an inherently insecure protocol since it does not support signature, encryption, channel binding, or client verification. The protocol relies entirely on the underlying transport layer security (for example, SSL/TLS) to provide confidentiality and integrity.


This table explains the major differences between the three protocols:

 

OpenID

OAuth

SAML

Dates from

2005

2006

2001

Current version

OpenID 2.0

OAuth 2.0

SAML 2.0

Main purpose

Single sign-on for consumers

API authorization between applications

Single sign-on for enterprise users

Protocols used

XRDS, HTTP

JSON, HTTP

SAM, XML, HTTP, SOAP

No. of related CVEs

24

3

17

Other protocols

There is a growing number of other federated identity options. Here are a few examples. 
Higgins: Higgins is a new open source protocol that allows users to control which identity information is released to an enterprise. 

Windows CardSpace: CardSpace is Microsoft new identity metasystem that provides interoperability between identity providers and relying parties with the user in control. This protocol is retired though and Microsoft is working on a replacement called U-Prove. 

MicroID: MicroID is a new identity layer to the web and microformats that allow anyone to simply claim verifiable ownership over their own pages and content hosted anywhere. 

Liberty Alliance: Liberty Alliance is a large commercially oriented protocol providing inter-enterprise identity trust. It is the largest existing identity trust protocol deployed around the world.

Conclusion:

In a world with increased interconnectivity between hybrid systems, protocols, and devices, federated identity seems to be here to stay. Although federated identity is much more convenient for users who don’t have to remember so many different usernames and passwords, it comes with a security price. However, proper implementation of OAuth, SAML, OpenID, or any other federated identity protocol adds convenience without extra threat surface.

10 Programming Languages You Should Learn in 2014

The tech sector is booming. If you’ve used a smartphone or logged on to a computer at least once in the last few years, you’ve probably noticed this.

As a result, coding skills are in high demand, with programming jobs paying significantly more than the average position. Even beyond the tech world, an understanding of at least one programming language makes an impressive addition to any resumé.

The in-vogue languages vary by employment sector. Financial and enterprise systems need to perform complicated functions and remain highly organized, requiring languages like Java and C#. Media- and design-related webpages and software will require dynamic, versatile and functional languages with minimal code, such as Ruby, PHP, JavaScript and Objective-C.

With some help from Lynda.com, we’ve compiled a list of 10 of the most sought-after programming languages to get you up to speed.

1. Java

Java

What it is: Java is a class-based, object-oriented programming language developed by Sun Microsystems in the 1990s. It’s one of the most in-demand programming languages, a standard for enterprise software, web-based content, games and mobile apps, as well as the Androidoperating system. Java is designed to work across multiple software platforms, meaning a program written on Mac OS X, for example, could also run on Windows.

Where to learn it: Udemy, Lynda.com, Oracle.com, LearnJavaOnline.org.

2. C Language

C Language

What it is: A general-purpose, imperative programming language developed in the early ’70s, C is the oldest and most widely used language, providing the building blocks for other popular languages, such as C#, Java, JavaScript and Python. C is mostly used for implementing operating systems and embedded applications.

Because it provides the foundation for many other languages, it is advisable to learn C (and C++) before moving on to others.

Where to learn it: Learn-C, Introduction To Programming, Lynda.com, CProgramming.com,Learn C The Hard Way.

3. C++

C Plus Plus
What it is: C++ is an intermediate-level language with object-oriented programming features, originally designed to enhance the C language. C++ powers major software like Firefox, Winampand Adobe programs. It’s used to develop systems software, application software, high-performance server and client applications and video games.

Where to learn it: Udemy, Lynda.com, CPlusPlus.com, LearnCpp.com, CProgramming.com.

4. C#

C Sharp

What it is: Pronounced “C-sharp,” C# is a multi-paradigm language developed by Microsoft as part of its .NET initiative. Combining principles from C and C++, C# is a general-purpose language used to develop software for Microsoft and Windows platforms.

Where to learn it: Udemy, Lynda.com, Microsoft Virtual Academy, TutorialsPoint.com.

5. Objective-C

Objective-C

What it is: Objective-C is a general-purpose, object-oriented programming language used by theApple operating system. It powers Apple’s OS X and iOS, as well as its APIs, and can be used to create iPhone apps, which has generated a huge demand for this once-outmoded programming language.

Where to learn it: Udemy, Lynda.com, Mac Developer Library, Cocoa Dev Central, Mobile Tuts+.

6. PHP

PHP

What it is: PHP (Hypertext Processor) is a free, server-side scripting language designed for dynamic websites and app development. It can be directly embedded into an HTML source document rather than an external file, which has made it a popular programming language for web developers. PHP powers more than 200 million websites, including WordPress, Digg andFacebook.

Where to learn it: Udemy, Codecademy, Lynda.com, Treehouse, Zend Developer Zone,PHP.net.

7. Python

Python

What it is: Python is a high-level, server-side scripting language for websites and mobile apps. It’s considered a fairly easy language for beginners due to its readability and compact syntax, meaning developers can use fewer lines of code to express a concept than they would in other languages. It powers the web apps for Instagram, Pinterest and Rdio through its associated web framework, Django, and is used by Google, Yahoo! and NASA.

Where to learn it: Udemy, Codecademy, Lynda.com, LearnPython.org, Python.org.

8. Ruby

Ruby

What it is: A dynamic, object-oriented scripting language for developing websites and mobile apps, Ruby was designed to be simple and easy to write. It powers the Ruby on Rails (or Rails) framework, which is used on Scribd, GitHub, Groupon and Shopify. Like Python, Ruby is considered a fairly user-friendly language for beginners.

Where to learn it: Codecademy, Code School, TryRuby.org, RubyMonk.

9. JavaScript

JavaScript

What it is: JavaScript is a client and server-side scripting language developed by Netscape that derives much of its syntax from C. It can be used across multiple web browsers and is considered essential for developing interactive or animated web functions. It is also used in game development and writing desktop applications. JavaScript interpreters are embedded in Google’s Chrome extensions, Apple’s Safari extensions, Adobe Acrobat and Reader, and Adobe’s Creative Suite.

Where to learn it: Codecademy, Lynda.com, Code School, Treehouse, Learn-JS.org.

10. SQL

SQL

What it is: Structured Query Language (SQL) is a special-purpose language for managing data in relational database management systems. It is most commonly used for its “Query” function, which searches informational databases. SQL was standardized by the American National Standards Institute (ANSI) and the International Organization for Standardization (ISO) in the 1980s.

 

How Streaming Video and Audio Work

Streaming Servers

If you work in an office that shares files over a network, you might think of a server as a computer that holds lots of data. But when it comes to streaming video and audio, a server is more than just a massive hard drive. It’s also the software that delivers data to your computer. Some streaming servers can handle multiple file types, but others work only with specific formats. For example, Apple QuickTime Streaming Server can stream QuickTime files but not Windows Media files.

Streaming servers typically deliver files to you with a little help from aWeb server. First, you go to a Web page, which is stored on the Web server. When you click the file you want to use, the Web server sends a message to the streaming server, telling it which file you want. The streaming server sends the file directly to you, bypassing the Web server.

All of this data gets to where it needs to go because of sets of rules known as protocols, which govern the way data travels from one device to another. You’ve probably heard of one protocol — hypertext transfer protocol (HTTP) deals with hypertext documents, or Web pages. Every time you surf the Web, you’re using HTTP.

Many protocols, such as transmission control protocol (TCP) and file transfer protocol (FTP), break data into packets. These protocols can re-send lost or damaged packets, and they allow randomly ordered packets to be reassembled later. This is convenient for downloading files and surfing the Web — if Web traffic slows down or some of your packets disappear, you’ll still get your file. But these protocols won’t work as well for streaming media. With streaming media, data needs to arrive quickly and with all the pieces in the right order.

Too many outgoing streams can overload a server, causing users to see an error message.

For this reason, streaming video and audio use protocols that allow the transfer of data in real time. They break files into very small pieces and send them to a specific location in a specific order. These protocols include:

  • Real-time transfer protocol (RTP)
  • Real-time streaming protocol (RTSP)
  • Real-time transport control protocol (RTCP)

These protocols act like an added layer to the protocols that govern Web traffic. So when the real-time protocols are streaming the data where it needs to go, the other Web protocols are still working in the background. These protocols also work together to balance the load on the server. If too many people try to access a file at the same time, the server can delay the start of some streams until others have finished.

Scraping Data from HTML

Web-scraping is essentially the task of finding out what input a website expects and understanding the format of its response. For example, Recovery.gov takes a user’s zip code as input before returning a page showing federal stimulus contracts and grants in the area.

This tutorial will teach you how to identify the inputs for a website and how to design a program that automatically sends requests and downloads the resulting web pages.

Pfizer disclosed its doctor payments in March as part of a $2.3 billion settlement – the largest health care fraud settlement in U.S. history – of allegations that it illegally promoted its drugs for unapproved uses.

Of the disclosing companies so far, Pfizer’s disclosures are the most detailed and its site is well-designed for users looking up individual doctors. However, its doctor list is not downloadable, or easily aggregated.

So we will write a scraper to download Pfizer’s list and record the data in spreadsheet form. The example code is written in Ruby and uses the Nokogiri parsing library.

You may also find Firefox’s Firebug plugin useful for inspecting the source HTML. Safari and Chrome have similar tools built in.

Scouting the Parameters

Pfizer’s website presents us with a list of health care providers – but not all of them at once. There are links to lists that show payees by the first letter of their names, with links to jump deeper into the lists, as each page only shows 10 entries at a time.

Other than the actual list entries themselves, Pfizer’s website looks pretty much the same no matter where in the list you are. It’s safe to assume that there’s one template with a slot open for the varied list results.

How does the site know what part of the list to serve up? The answer is in your browser URL bar. Clicking on the link for ‘D’ gives us this address:

http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=D

For the letter ‘E’, the address is this:

http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=E

So the only difference in the address is the letters themselves. Instead of physically clicking the links for each letter, you can just change them in the address bar.

If you click through other pages of the list, you’ll see that the one constant is this:

http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=E

With this as the base address, what follows the question mark are the parametersthat tell Pfizer’s website what results to show. For the list of alphabetized names, the name of the is enPdNm.

So just as it’s possible to skip the link-clicking part to navigate the list, it’s possible to automate the changing of these parameters instead of manually typing them.

How long is this list anyway?

The Pfizer website gives us the option of paging through their entire list, from start to finish, 10 entries at a time. Clicking on the “Last” link reveals a couple of things about the website:

  • There are 4,850 entries total.
  • There are 485 pages total. You can see that even without dividing 10 into 4,850 by looking at the page’s url. There’s a parameter there called iPageNo and it’s set to 485. Changing this number allows you to jump to any of the numbered pages of the list.

Reading the site URL in the browser’s address bar. Click to enlarge.

Data Structure

Upon closer inspection of the page, you’ll notice that the names in the left-most column don’t refer to health care providers, but the person to whom the check was made out. In most cases, it’s the same name of the doctor; in others, it’s the name of the clinic, company, or university represented by the doctor.

The right column lists the actual doctor and provides a link to a page that includes the breakdown of payments for that doctor.

In this list view, however, the payment details are not always the same as what’s shown on each doctor’s page. Click on an doctor’s name to visit his/her page. In your browser’s address bar, you’ll notice that the website incorporates a new parameter, called “hcpdisplayName”. Changing its value gets you to another doctor’s page. Caveat:after scraping the site, you’ll see that this page does not always include just one doctor. See the caveats at the end of this section.

The Scraping Strategy

I’ll describe a reasonably thorough if not quick scraping method. Without knowing beforehand the details of how Pfizer has structured their data, we can at least assume from scouting out their site’s navigation that:

  • Every entity is listed once in the 485 pages.
  • Every doctor who’s worked with Pfizer in a disclosed capacity is connected to at least one of these entities.

So, our scraping script will take the following steps:

  1. Using the parameter enPdNm, Iterate through each pages 1 to 485 and collect the doctors’ names and page links
  2. Using the links collected in step 1, visit each doctor’s page – using the parameterhcpdisplayName – and copy the payment details.

Downloading the Lists

We can write a script to help us save the list pages to our hard drive, which we’ll read later with another script.

	# Call Ruby's OpenURI module which gives us a method to 'open' a specified webpage
	require 'open-uri'

	# This is the basic address of Pfizer's all-inclusive list. Adding on the iPageNo parameter will get us from page to page.
	BASE_LIST_URL = 'http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=All&iPageNo='

	# We found this by looking at Pfizer's listing
	LAST_PAGE_NUMBER = 485

	# create a subdirectory called 'pfizer-list-pages'
	LIST_PAGES_SUBDIR = 'pfizer-list-pages'

	Dir.mkdir(LIST_PAGES_SUBDIR) unless File.exists?(LIST_PAGES_SUBDIR)

	# So, from 1 to 485, we'll open the same address on Pfizer's site, but change the last number

	for page_number in 1..LAST_PAGE_NUMBER
		page = open("#{BASE_LIST_URL}#{page_number}")

		# create a new file into to which we copy the webpage contents
		# and then write the contents of the downloaded page (with the readlines method) to this
		# new file on our hard drive
		file = File.open("#{LIST_PAGES_SUBDIR}/pfizer-list-page-#{page_number}.html", 'w')

		# write to this new html file
		file.write(page.readlines)

		# close the file
		file.close

		# the previous three commands could be condensed to:
		# File.open("#{LIST_PAGES_SUBDIR}/pfizer-list-page-#{page_number}.html", 'w'){|f| f.write(page.readlines)}

		puts "Copied page #{page_number}"
		# wait 4 seconds before getting the next page, to not overburden the website.
		sleep 4 
	end

And that’s it for collecting the webpages.

Parsing HTML with Ruby’s Nokogiri

A webpage is nothing more than text, and then more text describing how its all structured. To make your job much easier, you’ll want to use a programming library that can quickly sort through HTML’s structure.

Nokogiri is the recommended library for Ruby users. Here’s an example of it in use and how to install it. I describe its use for some basic parsing in my tutorial on reading data from Flash sites.

Data-heavy websites like Pfizer’s pull data from a database and then use a template to display that data. For our purposes, that means every page will have the same HTML structure. The particular datapoint you want will always be nested within the same tags.

Finding a webpage’s structure is best done through your browser’s web development tools, or plugins such as Firefox’s immensely useful Firebug. Right clicking on a doctor’s name (and link) in Google Chrome will bring up a pop-up menu.

Inspecting Pfizer page structure

Inspecting Pfizer page structure. Click to enlarge.

The doctor’s name is wrapped up in an <a> tag, which is what HTML uses to determine a link. The ‘href’ bit is the address of that doctor’s page.

So these <a> tags contain all the information we need right now. The problem is that there are non-doctor links on every page, too, and we don’t want those.

Here’s a quick, inelegant solution: Use Nokogiri to select all of the links and then filter out the ones we don’t want. Looking at our browser’s web inspector, we see that every doctor link has an href attribute that includes the parameter hcpdisplayName:

	
	http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?hcpdisplayName=SACKS,+GERALD+MICHAEL

So using the Nokogiri library (here’s a primer to its basic syntax), this is how you open a page:

	require 'rubygems'
	require 'nokogiri'
	require 'open-uri'

	url = 'http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?enPdNm=All&iPageNo=1'
	page = Nokogiri::HTML(open(url))
	# page now contains the HTML that your browser would have read had you visited the url with your browser

The following code example assumes you downloaded the Pfizer list pages as described in the previous step to your hard drive into the specified directory.

The variable n_page contains the same thing that the open command gave us in the pre-step, but in a special Nokogiri construct.

Nokogiri’s library now lets us navigate the HTML in a more orderly fashion than if it were plain text. So, to grab all the links (<a> tags), we use Nokogiri’s css method, which returns an array of all those links:

	links = page.css('a')
	puts links.length  # this prints out '201'

We know there aren’t 201 doctor links on this page. So let’s filter this collection of links using Ruby’s map and select methods.

Remember that each “link” contains more than an address. Each member of the linksarray is a Nokogiri object that contains numerous properties. The only one we want is ‘href’. This can be done by calling Ruby’s map method, which will return an array of just what was in each corresponding link’s href attribute (i.e. the array will contain the same number of entries as before: 201).

	hrefs = links.map{ |link|   link['href'] }

The select method will keep members of a collection that meet a certain true/false test. We want that test to be: does the link’s href attribute contains the wordhcpdisplayName?

	doc_hrefs = hrefs.select{ |href| 
		href.match('hcpdisplayName') != nil
	}

	doc_hrefs = doc_hrefs.uniq

The match method returns nil – i.e. Ruby’s term for ‘nothing/nada/zilch’ – if there was no match. So, we want all href’s that don’t return nil.

We finish by calling the uniq method, which returns only unique values. Remember that a doctor’s name can be repeated on any given list page, depending on the kind of payment records he/she is listed under. We only have to worry about finding the one link to his/her individual page.

You can narrow the number of links to consider by using Nokogiri’s css method to limit the links to just the third column in the table. If you view the page’s source (made easier by using Firebug), then you’ll see that the payments table has an id of “hcpPayments.” Use the CSS pseudo-class nth-child to specify the fourth column.

The full script to read our previously downloaded list pages and write the doctor page URLs to a text file is:

  require 'rubygems'
  require 'nokogiri'
  LIST_PAGES_SUBDIR = 'pfizer-list-pages'
  DOC_URL_FILE = "doc-pages-urls.txt"
  # Since a doctor's name could appear on different pages, we use a temporary array
  # to hold all the links as we iterate through each page and then call its uniq method 
  # to filter out repeated URLs before writing to file.
  all_links = []

  # This Dir.entries method will pick up the .html files we created in the last step
  Dir.entries(LIST_PAGES_SUBDIR).select{|p| p.match('pfizer-list-page')}.each do |filename|

    puts "Reading #{filename}"
    n_page = Nokogiri::HTML(open("#{LIST_PAGES_SUBDIR}/#{filename}"))
    all_links += n_page.css('table#hcpPayments td:nth-child(4) a').map{|link| link['href'] }.select{|href| href.match('hcpdisplayName') != nil}
  end

  File.open(DOC_URL_FILE, 'w'){|f| f.puts all_links.uniq}

If you open the doc-pages-urls.txt that you just created, you should have a list of more than 4,500 (relative) urls. The next step will be to request each of these pages and save them to your hard drive, which is what we did earlier, just with a lot more pages.

For beginning coders: There’s a couple of new things here. We call the match method using a regular expression to match the doctor’s name that’s used as the parameter. If you’ve never used regular expressions, we highly recommend learning them, as they’re extremely useful for both programming and non-programming tasks (such as searching documents or cleaning data).

Because this script has a long execution time, we’ve added some very rudimentary error-handling so that the script doesn’t fail if you have a network outage. In the rescue block, the script will sleep for a long period before resuming. Read more about error-handling.

	require 'open-uri'

	# This is the basic address of Pfizer's all-inclusive list. Adding on the iPageNo parameter will get us from page to page.
	BASE_URL = 'http://www.pfizer.com/responsibility/working_with_hcp/'
	DOC_URL_FILE = "doc-pages-urls.txt"
	DOC_PAGES_SUBDIR = 'pfizer-doc-pages'

	Dir.mkdir(DOC_PAGES_SUBDIR) unless File.exists?(DOC_PAGES_SUBDIR)

	File.open(DOC_URL_FILE).readlines.each do |url|
	  # using regular expression to get the doctor name parameter to name our local files
		doc_name=url.match(/hcpdisplayName=(.+)/)[1] 
		begin
	    puts "Retrieving #{url}"
	    page = open("#{BASE_URL}#{url}")
	  rescue Exception=>e
	    puts "Error trying to retrieve page #{url}"
	    puts "Sleeping..."
	    sleep 100
	    retry # go back to the begin clause
	  else  
		  File.open("#{DOC_PAGES_SUBDIR}/pfizer-doc-page--#{doc_name}.html", 'w'){|f| f.write(page.readlines)}
		ensure
		  sleep 4
		end
	end

Reading Every Doctor Page

This step is basically a combination of everything we’ve done so far, with the added step of writing the data for each doctor in a normalized format. So:

  1. For each address listed in the ‘doc-pages-urls.txt’ file, retrieve the page and save it.
  2. Read each of the saved pages and parse out the tables for the relevant data.
  3. Delimit, normalize and print the data to a textfile

Normalizing the Data

We’ll be storing Pfizer’s data in a single text file, and so we want every kind of payment to have its own line. This means we’ll be repeating some fields, such as the doctor’s name, for each payment. It may look ugly and redundant, but it’s the only way to make this data sortable in a spreadsheet. Here’s a diagram of how the web site information will translate into the text-file. Click to enlarge.

How data from the Pfizer site translates to a database table

How data from the Pfizer site translates to a database table. Click to enlarge.

The most important nuance involves how you keep track of payments per category (speaking, consulting, etc.). Other companies’ disclosures (such as Eli Lilly’s) have column headers for each categories, so you know exactly how many types of payment to expect.

But in Pfizer’s case, you don’t know for sure that the above example covers all the possible categories of payment. If you’ve programmed your parser to find a dollar value and place it in either the “Meals”, “Business Related Travel”, “Professional Advising”, or “Forums”, what happens when it reads a page that has a different category, like “Research”?

Rather than design a scraper to add unanticipated columns on the fly, let’s just plan on our compiled textfile to handle categories with two columns, one for the category of service, and one for amount:

Normalized, but not flexible:

Forums Travel Advising Meals
872
2813
4875
5750
133

Normalized and flexible

Service Amount
Meals 872
Business Related Travel 2813
Professional Advising 4875
Expert-Led Forums 5750
Business Related Travel 133

Copying the information from a website is straightforward, but actually parsing HTML that’s meant for easy-readability into normalized data can be an exercise in patience. For example, in Pfizer’s case, each entity-payment-record spans an unpredictable number of rows. The fourth column contains the details of the payments but structures them as lists within lists.

So it takes a little trial-and-error, which is why it’s important to save web pages to your hard drive so you don’t have to re-download them when retrying your parsing code.

Again, we are providing the data we’ve collected for Dollars for Docs, including Pfizer’s disclosures, upon request. The following code is presented as a learning example for Pfizer’s particular data format.

		require 'rubygems'
		require 'nokogiri'

		class String
		  # a helper function to turn all tabs, carriage returns, nbsp into regular spaces
		  def astrip
		    self.gsub(/([\302|\240|\s|\n|\t])|(\&nbsp;?){1,}/, ' ').strip
		  end
		end

		# We want to keep track of the doc_name to make sure we only grab items with the doctors' name. 
		# since hcpdisplayName shows ALL doctors who have the same root name, so we want to avoid double counting doc payments
		DOC_URL_FILE = "doc-pages-urls.txt"
		DOC_PAGES_SUBDIR = 'pfizer-doc-pages'
		COMPILED_FILE_NAME = 'all-payments.txt'

		compiled_file = File.open(COMPILED_FILE_NAME, 'w')
		payment_array = []

		Dir.entries(DOC_PAGES_SUBDIR).select{|f| f.match(/\.html/)}.each do |url|
		  doc_name=url.match(/pfizer-doc-page--(.+?)\.html/)[1]
		  puts "Reading #{doc_name} page"

		  page=Nokogiri::HTML(open("#{DOC_PAGES_SUBDIR}/#{url}")).css('#hcpPayments')

		  # All paid entities are in rows in which the first cell is filled. Each entities associated payments have this first cell
		  # blank
		  rows = page.css('tr')
		  if rows.length > 0 
		    entity_paid,city,state = nil
		    rows[1..-1].each do |row| # row 0 is the header row, so skip it

		    # iterate through each row. Rows that have a name in the first cell (entity paid) denote distict entities to which payments went to
		      cells = row.css('td')
		      if !cells[0].text.astrip.empty?
		      # we have a new entity, and city and state here
		        entity_paid,city,state = cells[0..2].map{|c| c.text.astrip}
		      end

		     # the fourth column should always have at least one link that corresponds to the doc_name
		      doc_link = cells[3].css('a').select{|a| a['href']=="payments_report.jsp?hcpdisplayName=#{doc_name}"}

		      if doc_link.length  == 1
		        # this is a cell that contains a doctor name and a payment
		        # it should also contain exactly one link that describe the service provided in a tooltip
		        service_link = cells[3].css('a').select{|a| a.attributes['onmouseover'] && a.attributes['onmouseover'].value.match(/showTooltip/)}

		          raise "Not exactly one service link for payee #{entity_paid}: #{url}" if service_link.length != 1

		        # Now, capture the cash/non-cash cells:
		        cash,non_cash = cells[4..5]

		        ##
		        ## Write this row to the file
		        ##
		        compiled_file.puts([entity_paid,city,state, doc_link[0].text, service_link[0].text, cash.text, non_cash.text].map{|t| t.astrip}.join("\t"))

			  else
				## This means that none, or more than one doctor's name was found here
				## So the cell was either blank, or it could contain an unexpected name.
				## You should write some test conditions here and log the records
				##  that don't fit your assumptions about the data

		      end # end of if doc_link.length==1

		    end # end of rows.each

		  end 
		  #end of if  rows.length > 0 

		end # end of Dir.entries(DOC_PAGES_SUBDIR).select

		compiled_file.close

Scraping Caveats

No matter how well you plan out your data collection, you still might not be able to predict all the idiosyncrasies in the data source. In fact, two basic assumptions I’ve had regarding the Pfizer site weren’t correct:

Each entity paid has an associated doctor – In fact, only some universities only have the amount of research funding, with no names listed. If you’re only interested in doctors with Pfizer connections, this isn’t a huge deal. But you might run into problems if you’ve hard-coded your scraper to expect at least one name there.

The doctor pages only list one doctor – You might assume that a url that reads:

	http://www.pfizer.com/responsibility/working_with_hcp/payments_report.jsp?hcpdisplayName=PATEL,+MUKESH

…would belong to a single doctor named “PATEL, MUKESH.” However, it turns out that the hcpdisplayName parameter will retrieve every name that begins with, in this case, “PATEL, MUKESH”. In nearly every case, this returns a page with just one actual doctor. But there are apparently two doctors in Pfizer’s database with that name, “PATEL, MUKESH” and “PATEL, MUKESH MANUBHAI”

So, in the main listing, both PATEL, MUKESH and PATEL, MUKESH MANUBHAI will be listed. Visiting the page with a hcpdisplayName of the shorter-named Patel will bring up the other doctor’s payment records. So to avoid double-counting, you’ll want to program your scraper to only copy the data associated with the exact hcpdisplayNamethat you’re querying.

It’s possible to set hcpdisplayName to something very broad. ‘F’, for example, brings up a single page of records for the 166 doctors whose last names start with ‘F’.

So, in your pre-scrape scouting of a website, you should test whether queries require exact parameters or just partial matches. Instead of collecting the 485 pages of 10 doctors each, you could save you and Pfizer some time by collecting the 26 alphabetical lists with the hcpdisplayName.

Once you’re well acquainted with Nokogiri and its powerful HTML parsing methods, you may want to familiarize yourself with mechanize, which uses Nokogiri for parsing, but also provides a convenient set of methods to work with sites that require filling out forms.

(Originated from http://www.propublica.org/nerds/item/scraping-websites)

Sample of PHP: Design Patterns Adapter

About the Adapter

In the Adapter Design Pattern, a class converts the interface of one class to be what another class expects.

In this example we have a SimpleBook class that has a getAuthor() and getTitle() methods. The client, testAdapter.php, expects a getAuthorAndTitle() method. To “adapt” SimpleBook for testAdapter we have an adapter class, BookAdapter, which takes in an instance of SimpleBook, and uses the SimpleBook getAuthor() and getTitle() methods in it’s own getAuthorAndTitle() method.

Adapters are helpful if you want to use a class that doesn’t have quite the exact methods you need, and you can’t change the orignal class. The adapter can take the methods you can access in the original class, and adapt them into the methods you need.

SimpleBook.php

//copyright Lawrence Truett and FluffyCat.com 2006, all rights reserved

  class SimpleBook {

    private $author;
    private $title;

    function __construct($author_in, $title_in) {
      $this->author = $author_in;
      $this->title  = $title_in;
    }

    function getAuthor() {return $this->author;}

    function getTitle() {return $this->title;}

  }
download source, use right-click and “Save Target As…” to save with a .php extension.

BookAdapter.php

//copyright Lawrence Truett and FluffyCat.com 2006, all rights reserved

  include_once('SimpleBook.php');

  class BookAdapter {

    private $book;

    function __construct(SimpleBook $book_in) {
      $this->book = $book_in;
    }

    function getAuthorAndTitle() {
      return $this->book->getTitle() . ' by ' . $this->book->getAuthor();
    }

  }
download source, use right-click and “Save Target As…” to save with a .php extension.

testAdapter.php

//copyright Lawrence Truett and FluffyCat.com 2006, all rights reserved

  include_once('SimpleBook.php');
  include_once('BookAdapter.php');

  define('BR', '<'.'BR'.'>');

  echo 'BEGIN TESTING ADAPTER PATTERN'.BR;
  echo BR;

  $book = new SimpleBook("Gamma, Helm, Johnson, and Vlissides",
                         "Design Patterns");

  $bookAdapter = new BookAdapter($book);

  echo 'Author and Title: '.$bookAdapter->getAuthorAndTitle();

  echo BR.BR;
  echo 'END TESTING ADAPTER PATTERN'.BR;
download source, use right-click and “Save Target As…” to save with a .php extension.

output of testAdapter.php

BEGIN TESTING ADAPTER PATTERN

Author and Title: Design Patterns by Gamma, Helm, Johnson, and Vlissides

END TESTING ADAPTER PATTERN

Sample of PHP: Design Patterns Facade

About the Facade

In the facade pattern a class hides a complex subsystem from a calling class. In turn, the complex subsystem will know nothing of the calling class.

In this example, the CaseReverseFacade class will call a subsystem to reverse the case of a string passed from the Book class. The subsystem is controlled by the reverseCase function in the CaseReverseFacade, which in turn calls functions in the ArrayCaseReverse and ArrayStringFunctions classes. As written, the CaseReverseFacade can reverse the case of any string, but it could easily be changed to only reverse a single element of a single class.

In my example I make all elements of the Facade and the subsystem static. This could also easily be changed.

Book.php

//copyright Lawrence Truett and FluffyCat.com 2005, all rights reserved
  class Book {
    private $author;
    private $title;
    function __construct($title_in, $author_in) {
      $this->author = $author_in;
      $this->title  = $title_in;
    }
    function getAuthor() {return $this->author;}
    function getTitle() {return $this->title;}
    function getAuthorAndTitle() {
      return $this->getTitle() . ' by ' . $this->getAuthor();
    }
  }
download source, use right-click and “Save Target As…” to save with a .php extension.

CaseReverseFacade.php

//copyright Lawrence Truett and FluffyCat.com 2005, all rights reserved
  class CaseReverseFacade {
    public static function reverseStringCase($stringIn) {
      $arrayFromString = 
	    ArrayStringFunctions::stringToArray($stringIn);
      $reversedCaseArray = 
	    ArrayCaseReverse::reverseCase($arrayFromString);
      $reversedCaseString = 
	    ArrayStringFunctions::arrayToString($reversedCaseArray);
	  return $reversedCaseString;
    }
  }
download source, use right-click and “Save Target As…” to save with a .php extension.

ArrayCaseReverse.php

//copyright Lawrence Truett and FluffyCat.com 2005, all rights reserved
  class ArrayCaseReverse {
	private static $uppercase_array = 
	  array('A', 'B', 'C', 'D', 'E', 'F',
	        'G', 'H', 'I', 'J', 'K', 'L',
	        'M', 'N', 'O', 'P', 'Q', 'R',
	        'S', 'T', 'U', 'V', 'W', 'X',
	        'Y', 'Z');
	private static $lowercase_array = 
	  array('a', 'b', 'c', 'd', 'e', 'f',
	        'g', 'h', 'i', 'j', 'k', 'l',
	        'm', 'n', 'o', 'p', 'q', 'r',
	        's', 't', 'u', 'v', 'w', 'x',
	        'y', 'z');
    public static function reverseCase($arrayIn) {
      $array_out = array();

	  for ($x = 0; $x < count($arrayIn); $x++) {
	    if (in_array($arrayIn[$x], self::$uppercase_array)) {
          $key = array_search($arrayIn[$x], self::$uppercase_array);
		  $array_out[$x] = self::$lowercase_array[$key];
	    } elseif (in_array($arrayIn[$x], self::$lowercase_array)) {
          $key = array_search($arrayIn[$x], self::$lowercase_array);
		  $array_out[$x] = self::$uppercase_array[$key];
		} else {
		  $array_out[$x] = $arrayIn[$x];
		}
	  }
	  return $array_out;
    }
  }
download source, use right-click and “Save Target As…” to save with a .php extension.

ArrayStringFunctions.php

//copyright Lawrence Truett and FluffyCat.com 2005, all rights reserved
  class ArrayStringFunctions {
    public static function arrayToString($arrayIn) {
      $string_out = NULL;
	  foreach ($arrayIn as $oneChar) {
	    $string_out .= $oneChar;
	  }
	  return $string_out;
    }
    public static function stringToArray($stringIn) {
      return str_split($stringIn);
    }
  }
download source, use right-click and “Save Target As…” to save with a .php extension.

testFacade.php

//copyright Lawrence Truett and FluffyCat.com 2005, all rights reserved
  include_once('ArrayCaseReverse.php');  
  include_once('ArrayStringFunctions.php');
  include_once('Book.php');  
  include_once('CaseReverseFacade.php');
  echo tagins("html");
  echo tagins("head");  
  echo tagins("/head");  
  echo tagins("body");
  echo "BEGIN TESTING FACADE PATTERN";
  echo tagins("br").tagins("br");

  $book = 
    new Book("Design Patterns",
	            "Gamma, Helm, Johnson, and Vlissides");
  echo "Original book title: ".$book->getTitle();
  echo tagins("br").tagins("br");
  $bookTitleReversed = 
    CaseReverseFacade::reverseStringCase($book->getTitle());  

  echo "Reversed book title: ".$bookTitleReversed;
  echo tagins("br").tagins("br");
  echo "END TESTING FACADE PATTERN";
  echo tagins("br");
  echo tagins("/body");
  echo tagins("/html");
  //doing this so code can be displayed without breaks
  function tagins($stuffing) {
    return "<".$stuffing.">";
  }
download source, use right-click and “Save Target As…” to save with a .php extension.

output of testFacade.php

BEGIN TESTING FACADE PATTERN
Original book title: Design Patterns
Reversed book title: dESIGNpATTERNS
END TESTING FACADE PATTERN