A Quick Start On GitLab Continuous Integration

GitLab offers a continuous integration service. If you add a .gitlab-ci.yml file to the root directory of your repository, and configure your GitLab project to use a Runner, then each merge request or push triggers your CI pipeline.

The .gitlab-ci.yml file tells the GitLab runner what to do. By default it runs a pipeline with three stages: build, test, and deploy. You don’t need to use all three stages; stages with no jobs are simply ignored.

If everything runs OK (no non-zero return values), you’ll get a nice green checkmark associated with the pushed commit or merge request. This makes it easy to see whether a merge request caused any of the tests to fail before you even look at the code.

Most projects use GitLab’s CI service to run the test suite so that developers get immediate feedback if they broke something.

There’s a growing trend to use continuous delivery and continuous deployment to automatically deploy tested code to staging and production environments.

So in brief, the steps needed to have a working CI can be summed up to:

  1. Add .gitlab-ci.yml to the root directory of your repository
  2. Configure a Runner

From there on, on every push to your Git repository, the Runner will automagically start the pipeline and the pipeline will appear under the project’s /pipelines page.

This guide assumes that you:

  • have a working GitLab instance of version 8.0 or higher or are using GitLab.com
  • have a project in GitLab that you would like to use CI for

Let’s break it down to pieces and work on solving the GitLab CI puzzle.

Creating a .gitlab-ci.yml file

Before you create .gitlab-ci.yml let’s first explain in brief what this is all about.

What is .gitlab-ci.yml

The .gitlab-ci.yml file is where you configure what CI does with your project. It lives in the root of your repository.

On any push to your repository, GitLab will look for the .gitlab-ci.yml file and start builds on Runners according to the contents of the file, for that commit.

Because .gitlab-ci.yml is in the repository and is version controlled, old versions still build successfully, forks can easily make use of CI, branches can have different pipelines and jobs, and you have a single source of truth for CI. You can read more about the reasons why we are using .gitlab-ci.yml in our blog about it.

Note: .gitlab-ci.yml is a YAML file so you have to pay extra attention to indentation. Always use spaces, not tabs.

Creating a simple .gitlab-ci.yml file

You need to create a file named .gitlab-ci.yml in the root directory of your repository. Below is an example for a Ruby on Rails project.

  - apt-get update -qq && apt-get install -y -qq sqlite3 libsqlite3-dev nodejs
  - ruby -v
  - which ruby
  - gem install bundler --no-ri --no-rdoc
  - bundle install --jobs $(nproc)  "${FLAGS[@]}"

    - bundle exec rspec

    - bundle exec rubocop

This is the simplest possible build configuration that will work for most Ruby applications:

  1. Define two jobs rspec and rubocop (the names are arbitrary) with different commands to be executed.
  2. Before every job, the commands defined by before_script are executed.

The .gitlab-ci.yml file defines sets of jobs with constraints of how and when they should be run. The jobs are defined as top-level elements with a name (in our case rspec and rubocop) and always have to contain the script keyword. Jobs are used to create builds, which are then picked by Runners and executed within the environment of the Runner.

What is important is that each job is run independently from each other.

If you want to check whether your .gitlab-ci.yml file is valid, there is a Lint tool under the page /ci/lint of your GitLab instance. You can also find a “CI Lint” button to go to this page under Pipelines > Pipelines and Pipelines > Builds in your project.

For more information and a complete .gitlab-ci.yml syntax, please read the documentation on .gitlab-ci.yml.

Push .gitlab-ci.yml to GitLab

Once you’ve created .gitlab-ci.yml, you should add it to your git repository and push it to GitLab.

git add .gitlab-ci.yml
git commit -m "Add .gitlab-ci.yml"
git push origin master

Now if you go to the Pipelines page you will see that the pipeline is pending.

You can also go to the Commits page and notice the little clock icon next to the commit SHA.

New commit pending

Clicking on the clock icon you will be directed to the builds page for that specific commit.

Single commit builds page

Notice that there are two jobs pending which are named after what we wrote in .gitlab-ci.yml. The red triangle indicates that there is no Runner configured yet for these builds.

The next step is to configure a Runner so that it picks the pending builds.

Configuring a Runner

In GitLab, Runners run the builds that you define in .gitlab-ci.yml. A Runner can be a virtual machine, a VPS, a bare-metal machine, a docker container or even a cluster of containers. GitLab and the Runners communicate through an API, so the only requirement is that the Runner’s machine has Internet access.

A Runner can be specific to a certain project or serve multiple projects in GitLab. If it serves all projects it’s called a Shared Runner.

Find more information about different Runners in the Runners documentation.

You can find whether any Runners are assigned to your project by going to Settings > Runners. Setting up a Runner is easy and straightforward. The official Runner supported by GitLab is written in Go and can be found at https://gitlab.com/gitlab-org/gitlab-ci-multi-runner.

In order to have a functional Runner you need to follow two steps:

  1. Install it
  2. Configure it

Follow the links above to set up your own Runner or use a Shared Runner as described in the next section.

For other types of unofficial Runners written in other languages, see the instructions for the various GitLab Runners.

Once the Runner has been set up, you should see it on the Runners page of your project, following Settings > Runners.

Activated runners

Shared Runners

If you use GitLab.com you can use Shared Runners provided by GitLab Inc.

These are special virtual machines that run on GitLab’s infrastructure and can build any project.

To enable Shared Runners you have to go to your project’s Settings > Runners and click Enable shared runners.

Read more on Shared Runners.

Seeing the status of your pipeline and builds

After configuring the Runner successfully, you should see the status of your last commit change from pending to either running,success or failed.

You can view all pipelines by going to the Pipelines page in your project.

Commit status

Or you can view all builds, by going to the Pipelines > Builds page.

Commit status

By clicking on a Build ID, you will be able to see the log of that build. This is important to diagnose why a build failed or acted differently than you expected.

Build log

You are also able to view the status of any commit in the various pages in GitLab, such as Commits and Merge Requests.

Enabling build emails

If you want to receive e-mail notifications about the result status of the builds, you should explicitly enable the Builds Emails service under your project’s settings.

For more information read the Builds emails service documentation.


Visit the examples README to see a list of examples using GitLab CI with various languages.

Awesome! You started using CI in GitLab!

GitLab Continuous Integration

What Are The Advantages?

  • Integrated: GitLab CI is part of GitLab. You can use it for free on GitLab.com
  • Easy to learn: See our Quick Start guide
  • Beautiful: GitLab CI offers the same great experience as GitLab. Familiar, easy to use, and beautiful.
  • Scalable: Tests run distributed on separate machines of which you can add as many as you want
  • Faster results: Each build can be split in multiple jobs that run in parallel on multiple machines
  • Continuous Delivery (CD): multiple stages, manual deploys, environments, and variables
  • Open source: CI is included with both the open source GitLab Community Edition and the proprietary GitLab Enterprise Edition


  • Multi-platform: you can execute builds on Unix, Windows, OSX, and any other platform that supports Go.
  • Multi-language: build scripts are command line driven and work with Java, PHP, Ruby, C, and any other language.
  • Stable: your builds run on a different machine than GitLab.
  • Parallel builds: GitLab CI splits builds over multiple machines, for fast execution.
  • Realtime logging: a link in the merge request takes you to the current build log that updates dynamically.
  • Versioned tests: a .gitlab-ci.yml file that contains your tests, allowing everyone to contribute changes and ensuring every branch gets the tests it needs.
  • Pipeline: you can define multiple jobs per stage and you cantrigger other builds.
  • Autoscaling: you can automatically spin up and down VM’sto make sure your builds get processed immediately and minimize costs.
  • Build artifacts: you can upload binaries and other build artifacts to GitLab and browse and download them.
  • Test locally there are multiple executors and you canreproduce tests locally.
  • Docker support: you can easily spin up other Docker containers as a service as part of the test and build docker images.

Fully integrated with GitLab

  • Fully integrated with GitLab.
  • Quick project setup: Add projects with a single click, all hooks are setup automatically via the GitLab API.
  • Merge request integration: See the status of each build within the Merge Request in GitLab.


GitLab CI is a part of GitLab, a web application with an API that stores its state in a database. It manages projects/builds and provides a nice user interface, besides all the features of GitLab.

GitLab Runner is an application which processes builds. It can be deployed separately and works with GitLab CI through an API.

In order to run tests, you need at least one GitLab instance and one GitLab Runner.

GitLab Runner

To perform the actual build, you need to install GitLab Runner which is written in Go.

It can run on any platform for which you can build Go binaries, including Linux, OSX, Windows, FreeBSD and Docker.

It can test any programming language including .Net, Java, Python, C, PHP and others.

GitLab Runner has many features including autoscaling,

Install GitLab Runner

Jenkins to GitLab CI

The railway world is a fast-moving environment. To bring you the latest improvements and fixes as quick as possible, Captain Train’s web-app is often updated, sometimes several times per day.

Did you always wonder how we manage building and deploying all of this without a jolt? Then read-on: here is a technical peek into our engineering process.

Note: this post tells the customer story of Captain Train.

From Jenkins to GitLab CI

We used to build our web-app using Jenkins. A robust and proven solution—which was polling our repositories every minute, and built the appropriate integration and production branches.

However we recently switched to a new system for building our web-app. To host our source-code and perform merge-requests, we’re using a self-hosted instance of GitLab. It’s nice, open-source—and features an integrated build system: GitLab CI.

See it like Travis, but integrated: just add a custom .gitlab-ci.yml file at the root of your repository, and GitLab will automatically start building your app in the way you specified.

Now what’s cool about this?

Reliable dockerized builds

Jenkins builds were all executed on a resource-constrained server—and this made builds slow and unreliable. For instance, we observed several times PhantomJS crashing randomly during tests: apparently it didn’t like several builds running on the same machine at the same time—and a single PhantomJS process crashing would bring all of the others down.

So the first step of our migration was to insulate builds into Docker containers. In this way:

  • Every build is isolated from the others, and processes don’t crash each other randomly.
  • Building the same project on different architectures is easy, and that’s good news, because we need this to support multiple Debian versions.
  • Project maintainers have greater control on the setup of their build environment: no need to bother an admin when upgrading an SDK on the shared build machine.

It scales

GitLab CI allows us to add more runners very easily. And now that builds are performed in Docker containers, we don’t have to configure the runners specifically with our build tools: any out-of-the-box server will do.

Once a new runner is declared, scaling is automatic: the most available runner will be picked to start every new build. It’s so simple that you can even add your own machine to build locally.

We’ve already reduced our build time by switching to a more powerful runner—a migration that would have been more difficult to do using Jenkins. Although we regularly optimize the run time of our test suite, sometimes you also need to just throw more CPU at it.

Easier to control

With Jenkins, the configuration of the build job is stored in an external admin-restricted tool. You need the right credentials to edit the build configuration, and it’s not obvious how to do it.

Using GitLab CI, the build jobs are determined solely from the .gitlab-ci.yml file in the repository. This makes it really simple to edit, and you get all the niceties of your usual git work-flow: versioning, merge requests, and so on. You don’t need to ask permission to add CI to your project. Lowering the barrier to entry for CI is definitely a good thing for engineering quality and developer happiness.

Tests on merge requests

GitLab CI makes it really easy to build and test the branch of a merge request (or a “Pull request” in GitHub slang). Just a few lines added to our .gitlab-ci.yml file, and we were running tests for every push to a merge request.

Merge automatically when the build succeeds

We get nice red-or-green-status, the quite useful “Merge automatically when the build succeeds” button — and, as branches are now tested before being merged, much less build breakage.

Build Passed

A slick UI

GitLab CI provides “Pipelines”, an overview of all your build jobs. This points you quickly to a failing build, and the stage where the problem occurs. Plus it gets you this warm and fuzzy feeling of safeness when everything is green.


In a nutshell

We found the overall experience quite positive. Once the initial hurdle of making the build pass in a Docker container, integrating it into GitLab CI was really easy. And it gave us tons of positive signals, new features and neat integrations. 10/10, would build again.👍

Our Android team also migrated their pipeline, and are now building the integration and production Android APK with GitLab CI.

For further reading, you can find on the official website a nice overview of GitLab CI features, and some examples of .gitlab-ci.ymlfiles.

Continuous Integration Newbie

Let’s assume that you don’t know anything about what Continuous Integration is and why it’s needed. Or, you just forgot. Anyway, we’re starting from scratch here.

Imagine that you work on a project, where all the code consists of two text files. Moreover, it is super-critical that the concatenation of these two files contains the phrase “Hello world.”

If there’s no such phrase, the whole development team stays without a salary for a month. Yeah, it is that serious!

The most responsible developer wrote a small script to run every time we are about to send our code to customers. The code is pretty sophisticated:

cat file1.txt file2.txt | grep -q "Hello world"

The problem is that there are ten developers in the team, and, you know, human factors can hit hard.

A week ago, a new guy forgot to run the script and three clients got broken builds. So you decided to solve the problem once and for all. Luckily, your code is already on GitLab, and you remember that there is a built-in CI system. Moreover, you heard at a conference that people use CI to run tests…

Run our first test inside CI

After a couple minutes to find and read the docs, it seems like all we need is these two lines of code in a file called .gitlab-ci.yml:

  script: cat file1.txt file2.txt | grep -q 'Hello world'

Committing it, and hooray! Our build is successful: Build succeeded

Let’s change “world” to “Africa” in the second file and check what happens: Build failed

The build fails as expected!

Okay, we now have automated tests here! GitLab CI will run our test script every time we push new code to the repository.

Make results of builds downloadable

The next business requirement is to package the code before sending it to our customers. Let’s automate that as well!

All we need to do is define another job for CI. Let’s name the job “package”:

  script: cat file1.txt file2.txt | grep -q 'Hello world'

  script: cat file1.txt file2.txt | gzip > package.gz

We have two tabs now: Two tabs - generated from two jobs

However, we forgot to specify that the new file is a build artifact, so that it could be downloaded. We can fix it by adding an artifactssection:

  script: cat file1.txt file2.txt | grep -q 'Hello world'

  script: cat file1.txt file2.txt | gzip > packaged.gz
    - packaged.gz

Checking… It is there: Checking the download buttons

Perfect! However, we have a problem to fix: the jobs are running in parallel, but we do not want to package our application if our tests fail.

Run jobs sequentially

We only want to run the ‘package’ job if the tests are successful. Let’s define the order by specifying stages:

  - test
  - package

  stage: test
  script: cat file1.txt file2.txt | grep -q 'Hello world'

  stage: package
  script: cat file1.txt file2.txt | gzip > packaged.gz
    - packaged.gz

That should be good!

Also, we forgot to mention, that compilation (which is represented by concatenation in our case) takes a while, so we don’t want to run it twice. Let’s define a separate step for it:

  - compile
  - test
  - package

  stage: compile
  script: cat file1.txt file2.txt > compiled.txt
    - compiled.txt

  stage: test
  script: cat compiled.txt | grep -q 'Hello world'

  stage: package
  script: cat compiled.txt | gzip > packaged.gz
    - packaged.gz

Let’s take a look at our artifacts:

Unnecessary artifact

Hmm, we do not need that “compile” file to be downloadable. Let’s make our temporary artifacts expire by setting expire_in to ’20 minutes’:

  stage: compile
  script: cat file1.txt file2.txt > compiled.txt
    - compiled.txt
    expire_in: 20 minutes

Now our config looks pretty impressive:

  • We have three sequential stages to compile, test, and package our application.
  • We are passing the compiled app to the next stages so that there’s no need to run compilation twice (so it will run faster).
  • We are storing a packaged version of our app in build artifacts for further usage.

Learning which Docker image to use

So far so good. However, it appears our builds are still slow. Let’s take a look at the logs.

Ruby 2.1 is the logs

Wait, what is this? Ruby 2.1?

Why do we need Ruby at all? Oh, GitLab.com uses Docker images to run our builds, and by default it uses the ruby:2.1 image. For sure, this image contains many packages we don’t need. After a minute of googling, we figure out that there’s an image called alpinewhich is an almost blank Linux image.

OK, let’s explicitly specify that we want to use this image by adding image: alpine to .gitlab-ci.yml. Now we’re talking! We shaved almost 3 minutes off:

Build speed improved

It looks like there’s a lot of public images around. So we can just grab one for our technology stack. It makes sense to specify an image which contains no extra software because it minimizes download time.

Dealing with complex scenarios

So far so good. However, let’s suppose we have a new client who wants us to package our app into .iso image instead of .gz Since CI does the whole work, we can just add one more job to it. ISO images can be created using the mkisofs command. Here’s how our config should look:

image: alpine

  - compile
  - test
  - package

# ... "compile" and "test" jobs are skipped here for the sake of compactness

  stage: package
  script: cat compiled.txt | gzip > packaged.gz
    - packaged.gz

  stage: package
  - mkisofs -o ./packaged.iso ./compiled.txt
    - packaged.iso

Note that job names shouldn’t necessarily be the same. In fact if they were the same, it wouldn’t be possible to make the jobs run in parallel inside the same stage. Hence, think of same names of jobs & stages as coincidence.

Anyhow, the build is failing: Failed build because of missing mkisofs

The problem is that mkisofs is not included in the alpine image, so we need to install it first.

Dealing with missing software/packages

According to the Alpine Linux website mkisofs is a part of the xorriso and cdrkit packages. These are the magic commands that we need to run to install a package:

echo "ipv6" >> /etc/modules  # enable networking
apk update                   # update packages list
apk add xorriso              # install package

For CI, these are just like any other commands. The full list of commands we need to pass to script section should look like this:

- echo "ipv6" >> /etc/modules
- apk update
- apk add xorriso
- mkisofs -o ./packaged.iso ./compiled.txt

However, to make it semantically correct, let’s put commands related to package installation in before_script. Note that if you usebefore_script at the top level of a configuration, then the commands will run before all jobs. In our case, we just want it to run before one specific job.

Our final version of .gitlab-ci.yml:

image: alpine

  - compile
  - test
  - package

  stage: compile
  script: cat file1.txt file2.txt > compiled.txt
    - compiled.txt
    expire_in: 20 minutes

  stage: test
  script: cat compiled.txt | grep -q 'Hello world'

  stage: package
  script: cat compiled.txt | gzip > packaged.gz
    - packaged.gz

  stage: package
  - echo "ipv6" >> /etc/modules
  - apk update
  - apk add xorriso
  - mkisofs -o ./packaged.iso ./compiled.txt
    - packaged.iso

Wow, it looks like we have just created a pipeline! We have three sequential stages, but jobs pack-gz and pack-iso, inside thepackage stage, are running in parallel:

Pipelines illustration


There’s much more to cover but let’s stop here for now. I hope you liked this short story. All examples were made intentionally trivial so that you could learn the concepts of GitLab CI without being distracted by an unfamiliar technology stack. Let’s wrap up what we have learned:

  1. To delegate some work to GitLab CI you should define one or more jobs in .gitlab-ci.yml.
  2. Jobs should have names and it’s your responsibility to come up with good ones.
  3. Every job contains a set of rules & instructions for GitLab CI, defined by special keywords.
  4. Jobs can run sequentially, in parallel, or you can define a custom pipeline.
  5. You can pass files between jobs and store them in build artifacts so that they can be downloaded from the interface.

Below is the last section containing a more formal description of terms and keywords we used, as well as links to the detailed description of GitLab CI functionality.

Keywords description & links to the documentation

Keyword/term Description
.gitlab-ci.yml File containing all definitions of how your project should be built
script Defines a shell script to be executed
before_script Used to define the command that should be run before (all) jobs
image Defines what docker image to use
stage Defines a pipeline stage (default: test)
artifacts Defines a list of build artifacts
artifacts:expire_in Used to delete uploaded artifacts after the specified time
pipelines A pipeline is a group of builds that get executed in stages (batches)

A Simple Explanation Of ‘The Internet Of Things’

The “Internet of things” (IoT) is becoming an increasingly growing topic of conversation both in the workplace and outside of it. It’s a concept that not only has the potential to impact how we live but also how we work. But what exactly is the “Internet of things” and what impact is it going to have on you, if any? There are a lot of complexities around the “Internet of things” but I want to stick to the basics. Lots of technical and policy-related conversations are being had but many people are still just trying to grasp the foundation of what the heck these conversations are about.

Let’s start with understanding a few things.

Broadband Internet is become more widely available, the cost of connecting is decreasing, more devices are being created with Wi-Fi capabilities and sensors built into them, technology costs are going down, and smartphone penetration is sky-rocketing.  All of these things are creating a “perfect storm” for the IoT.

So What Is The Internet Of Things?

Simply put, this is the concept of basically connecting any device with an on and off switch to the Internet (and/or to each other). This includes everything from cellphones, coffee makers, washing machines, headphones, lamps, wearable devices and almost anything else you can think of.  This also applies to components of machines, for example a jet engine of an airplane or the drill of an oil rig. As I mentioned, if it has an on and off switch then chances are it can be a part of the IoT.  The analyst firm Gartner says that by 2020 there will be over 26 billion connected devices… That’s a lot of connections (some even estimate this number to be much higher, over 100 billion).  The IoT is a giant network of connected “things” (which also includes people).  The relationship will be between people-people, people-things, and things-things.

How Does This Impact You?

The new rule for the future is going to be, “Anything that can be connected, will be connected.” But why on earth would you want so many connected devices talking to each other? There are many examples for what this might look like or what the potential value might be. Say for example you are on your way to a meeting; your car could have access to your calendar and already know the best route to take. If the traffic is heavy your car might send a text to the other party notifying them that you will be late. What if your alarm clock wakes up you at 6 a.m. and then notifies your coffee maker to start brewing coffee for you? What if your office equipment knew when it was running low on supplies and automatically re-ordered more?  What if the wearable device you used in the workplace could tell you when and where you were most active and productive and shared that information with other devices that you used while working?

On a broader scale, the IoT can be applied to things like transportation networks: “smart cities” which can help us reduce waste and improve efficiency for things such as energy use; this helping us understand and improve how we work and live. Take a look at the visual below to see what something like that can look like.


The reality is that the IoT allows for virtually endless opportunities and connections to take place, many of which we can’t even think of or fully understand the impact of today. It’s not hard to see how and why the IoT is such a hot topic today; it certainly opens the door to a lot of opportunities but also to many challenges. Security is a big issue that is oftentimes brought up. With billions of devices being connected together, what can people do to make sure that their information stays secure? Will someone be able to hack into your toaster and thereby get access to your entire network? The IoT also opens up companies all over the world to more security threats. Then we have the issue of privacy and data sharing. This is a hot-button topic even today, so one can only imagine how the conversation and concerns will escalate when we are talking about many billions of devices being connected. Another issue that many companies specifically are going to be faced with is around the massive amounts of data that all of these devices are going to produce. Companies need to figure out a way to store, track, analyze and make sense of the vast amounts of data that will be generated.

So what now?

Conversations about the IoT are (and have been for several years) taking place all over the world as we seek to understand how this will impact our lives. We are also trying to understand what the many opportunities and challenges are going to be as more and more devices start to join the IoT. For now the best thing that we can do is educate ourselves about what the IoT is and the potential impacts that can be seen on how we work and live.

This article is taken from Jacob Morgan of Forbes Magazine

The Three Breakthroughs That Have Finally Unleashed AI on the World

A FEW MONTHS ago I made the trek to the sylvan campus of the IBM research labs in Yorktown Heights, New York, to catch an early glimpse of the fast-arriving, long-overdue future of artificial intelligence. This was the home of Watson, the electronic genius that conquered Jeopardy! in 2011. The original Watson is still here—it’s about the size of a bedroom, with 10 upright, refrigerator-shaped machines forming the four walls. The tiny interior cavity gives technicians access to the jumble of wires and cables on the machines’ backs. It is surprisingly warm inside, as if the cluster were alive.

Today’s Watson is very different. It no longer exists solely within a wall of cabinets but is spread across a cloud of open-standard servers that run several hundred “instances” of the AI at once. Like all things cloudy, Watson is served to simultaneous customers anywhere in the world, who can access it using their phones, their desktops, or their own data servers. This kind of AI can be scaled up or down on demand. Because AI improves as people use it, Watson is always getting smarter; anything it learns in one instance can be immediately transferred to the others. And instead of one single program, it’s an aggregation of diverse software engines—its logic-deduction engine and its language-parsing engine might operate on different code, on different chips, in different locations—all cleverly integrated into a unified stream of intelligence.

Consumers can tap into that always-on intelligence directly, but also through third-party apps that harness the power of this AI cloud. Like many parents of a bright mind, IBM would like Watson to pursue a medical career, so it should come as no surprise that one of the apps under development is a medical-diagnosis tool. Most of the previous attempts to make a diagnostic AI have been pathetic failures, but Watson really works. When, in plain English, I give it the symptoms of a disease I once contracted in India, it gives me a list of hunches, ranked from most to least probable. The most likely cause, it declares, is Giardia—the correct answer. This expertise isn’t yet available to patients directly; IBM provides access to Watson’s intelligence to partners, helping them develop user-friendly interfaces for subscribing doctors and hospitals. “I believe something like Watson will soon be the world’s best diagnostician—whether machine or human,” says Alan Greene, chief medical officer of Scanadu, a startup that is building a diagnostic device inspired by the Star Trek medical tricorder and powered by a cloud AI. “At the rate AI technology is improving, a kid born today will rarely need to see a doctor to get a diagnosis by the time they are an adult.”

Medicine is only the beginning. All the major cloud companies, plus dozens of startups, are in a mad rush to launch a Watson-like cognitive service. According to quantitative analysis firm Quid, AI has attracted more than $17 billion in investments since 2009. Last year alone more than $2 billion was invested in 322 companies with AI-like technology. Facebook and Google have recruited researchers to join their in-house AI research teams. Yahoo, Intel, Dropbox, LinkedIn, Pinterest, and Twitter have all purchased AI companies since last year. Private investment in the AI sector has been expanding 62 percent a year on average for the past four years, a rate that is expected to continue.

Amid all this activity, a picture of our AI future is coming into view, and it is not the HAL 9000—a discrete machine animated by a charismatic (yet potentially homicidal) humanlike consciousness—or a Singularitan rapture of superintelligence. The AI on the horizon looks more like Amazon Web Services—cheap, reliable, industrial-grade digital smartness running behind everything, and almost invisible except when it blinks off. This common utility will serve you as much IQ as you want but no more than you need. Like all utilities, AI will be supremely boring, even as it transforms the Internet, the global economy, and civilization. It will enliven inert objects, much as electricity did more than a century ago. Everything that we formerly electrified we will now cognitize. This new utilitarian AI will also augment us individually as people (deepening our memory, speeding our recognition) and collectively as a species. There is almost nothing we can think of that cannot be made new, different, or interesting by infusing it with some extra IQ. In fact, the business plans of the next 10,000 startups are easy to forecast: Take X and add AI. This is a big deal, and now it’s here.

AROUND 2002 I attended a small party for Google—before its IPO, when it only focused on search. I struck up a conversation with Larry Page, Google’s brilliant cofounder, who became the company’s CEO in 2011. “Larry, I still don’t get it. There are so many search companies. Web search, for free? Where does that get you?” My unimaginative blindness is solid evidence that predicting is hard, especially about the future, but in my defense this was before Google had ramped up its ad-auction scheme to generate real income, long before YouTube or any other major acquisitions. I was not the only avid user of its search site who thought it would not last long. But Page’s reply has always stuck with me: “Oh, we’re really making an AI.”

I’ve thought a lot about that conversation over the past few years as Google has bought 14 AI and robotics companies. At first glance, you might think that Google is beefing up its AI portfolio to improve its search capabilities, since search contributes 80 percent of its revenue. But I think that’s backward. Rather than use AI to make its search better, Google is using search to make its AI better. Every time you type a query, click on a search-generated link, or create a link on the web, you are training the Google AI. When you type “Easter Bunny” into the image search bar and then click on the most Easter Bunny-looking image, you are teaching the AI what an Easter bunny looks like. Each of the 12.1 billion queries that Google’s 1.2 billion searchers conduct each day tutor the deep-learning AI over and over again. With another 10 years of steady improvements to its AI algorithms, plus a thousand-fold more data and 100 times more computing resources, Google will have an unrivaled AI. My prediction: By 2024, Google’s main product will not be search but AI.

This is the point where it is entirely appropriate to be skeptical. For almost 60 years, AI researchers have predicted that AI is right around the corner, yet until a few years ago it seemed as stuck in the future as ever. There was even a term coined to describe this era of meager results and even more meager research funding: the AI winter. Has anything really changed?

Yes. Three recent breakthroughs have unleashed the long-awaited arrival of artificial intelligence:

1. Cheap parallel computation

Thinking is an inherently parallel process, billions of neurons firing simultaneously to create synchronous waves of cortical computation. To build a neural network—the primary architecture of AI software—also requires many different processes to take place simultaneously. Each node of a neural network loosely imitates a neuron in the brain—mutually interacting with its neighbors to make sense of the signals it receives. To recognize a spoken word, a program must be able to hear all the phonemes in relation to one another; to identify an image, it needs to see every pixel in the context of the pixels around it—both deeply parallel tasks. But until recently, the typical computer processor could only ping one thing at a time.

That began to change more than a decade ago, when a new kind of chip, called a graphics processing unit, or GPU, was devised for the intensely visual—and parallel—demands of videogames, in which millions of pixels had to be recalculated many times a second. That required a specialized parallel computing chip, which was added as a supplement to the PC motherboard. The parallel graphical chips worked, and gaming soared. By 2005, GPUs were being produced in such quantities that they became much cheaper. In 2009, Andrew Ng and a team at Stanford realized that GPU chips could run neural networks in parallel.

That discovery unlocked new possibilities for neural networks, which can include hundreds of millions of connections between their nodes. Traditional processors required several weeks to calculate all the cascading possibilities in a 100 million-parameter neural net. Ng found that a cluster of GPUs could accomplish the same thing in a day. Today neural nets running on GPUs are routinely used by cloud-enabled companies such as Facebook to identify your friends in photos or, in the case of Netflix, to make reliable recommendations for its more than 50 million subscribers.

2. Big Data

Every intelligence has to be taught. A human brain, which is genetically primed to categorize things, still needs to see a dozen examples before it can distinguish between cats and dogs. That’s even more true for artificial minds. Even the best-programmed computer has to play at least a thousand games of chess before it gets good. Part of the AI breakthrough lies in the incredible avalanche of collected data about our world, which provides the schooling that AIs need. Massive databases, self-tracking, web cookies, online footprints, terabytes of storage, decades of search results, Wikipedia, and the entire digital universe became the teachers making AI smart.

3. Better algorithms

Digital neural nets were invented in the 1950s, but it took decades for computer scientists to learn how to tame the astronomically huge combinatorial relationships between a million—or 100 million—neurons. The key was to organize neural nets into stacked layers. Take the relatively simple task of recognizing that a face is a face. When a group of bits in a neural net are found to trigger a pattern—the image of an eye, for instance—that result is moved up to another level in the neural net for further parsing. The next level might group two eyes together and pass that meaningful chunk onto another level of hierarchical structure that associates it with the pattern of a nose. It can take many millions of these nodes (each one producing a calculation feeding others around it), stacked up to 15 levels high, to recognize a human face. In 2006, Geoff Hinton, then at the University of Toronto, made a key tweak to this method, which he dubbed “deep learning.” He was able to mathematically optimize results from each layer so that the learning accumulated faster as it proceeded up the stack of layers. Deep-learning algorithms accelerated enormously a few years later when they were ported to GPUs. The code of deep learning alone is insufficient to generate complex logical thinking, but it is an essential component of all current AIs, including IBM’s Watson, Google’s search engine, and Facebook’s algorithms.

This perfect storm of parallel computation, bigger data, and deeper algorithms generated the 60-years-in-the-making overnight success of AI. And this convergence suggests that as long as these technological trends continue—and there’s no reason to think they won’t—AI will keep improving.

As it does, this cloud-based AI will become an increasingly ingrained part of our everyday life. But it will come at a price. Cloud computing obeys the law of increasing returns, sometimes called the network effect, which holds that the value of a network increases much faster as it grows bigger. The bigger the network, the more attractive it is to new users, which makes it even bigger, and thus more attractive, and so on. A cloud that serves AI will obey the same law. The more people who use an AI, the smarter it gets. The smarter it gets, the more people use it. The more people that use it, the smarter it gets. Once a company enters this virtuous cycle, it tends to grow so big, so fast, that it overwhelms any upstart competitors. As a result, our AI future is likely to be ruled by an oligarchy of two or three large, general-purpose cloud-based commercial intelligences.

AI Everywhere

IN 1997, WATSON’S precursor, IBM’s Deep Blue, beat the reigning chess grand master Garry Kasparov in a famous man-versus-machine match. After machines repeated their victories in a few more matches, humans largely lost interest in such contests. You might think that was the end of the story (if not the end of human history), but Kasparov realized that he could have performed better against Deep Blue if he’d had the same instant access to a massive database of all previous chess moves that Deep Blue had. If this database tool was fair for an AI, why not for a human? To pursue this idea, Kasparov pioneered the concept of man-plus-machine matches, in which AI augments human chess players rather than competes against them.

Now called freestyle chess matches, these are like mixed martial arts fights, where players use whatever combat techniques they want. You can play as your unassisted human self, or you can act as the hand for your supersmart chess computer, merely moving its board pieces, or you can play as a “centaur,” which is the human/AI cyborg that Kasparov advocated. A centaur player will listen to the moves whispered by the AI but will occasionally override them—much the way we use GPS navigation in our cars. In the championship Freestyle Battle in 2014, open to all modes of players, pure chess AI engines won 42 games, but centaurs won 53 games. Today the best chess player alive is a centaur: Intagrand, a team of humans and several different chess programs.

But here’s the even more surprising part: The advent of AI didn’t diminish the performance of purely human chess players. Quite the opposite. Cheap, supersmart chess programs inspired more people than ever to play chess, at more tournaments than ever, and the players got better than ever. There are more than twice as many grand masters now as there were when Deep Blue first beat Kasparov. The top-ranked human chess player today, Magnus Carlsen, trained with AIs and has been deemed the most computer-like of all human chess players. He also has the highest human grand master rating of all time.

If AI can help humans become better chess players, it stands to reason that it can help us become better pilots, better doctors, better judges, better teachers. Most of the commercial work completed by AI will be done by special-purpose, narrowly focused software brains that can, for example, translate any language into any other language, but do little else. Drive a car, but not converse. Or recall every pixel of every video on YouTube but not anticipate your work routines. In the next 10 years, 99 percent of the artificial intelligence that you will interact with, directly or indirectly, will be nerdily autistic, supersmart specialists.

In fact, this won’t really be intelligence, at least not as we’ve come to think of it. Indeed, intelligence may be a liability—especially if by “intelligence” we mean our peculiar self-awareness, all our frantic loops of introspection and messy currents of self-consciousness. We want our self-driving car to be inhumanly focused on the road, not obsessing over an argument it had with the garage. The synthetic Dr. Watson at our hospital should be maniacal in its work, never wondering whether it should have majored in English instead. As AIs develop, we might have to engineer ways to prevent consciousness in them—and our most premium AI services will likely be advertised as consciousness-free.

What we want instead of intelligence is artificial smartness. Unlike general intelligence, smartness is focused, measurable, specific. It also can think in ways completely different from human cognition. A cute example of this nonhuman thinking is a cool stunt that was performed at the South by Southwest festival in Austin, Texas, in March of this year. IBM researchers overlaid Watson with a culinary database comprising online recipes, USDA nutritional facts, and flavor research on what makes compounds taste pleasant. From this pile of data, Watson dreamed up novel dishes based on flavor profiles and patterns from existing dishes, and willing human chefs cooked them. One crowd favorite generated from Watson’s mind was a tasty version of fish and chips using ceviche and fried plantains. For lunch at the IBM labs in Yorktown Heights I slurped down that one and another tasty Watson invention: Swiss/Thai asparagus quiche. Not bad! It’s unlikely that either one would ever have occurred to humans.

Nonhuman intelligence is not a bug, it’s a feature. The chief virtue of AIs will be their alien intelligence. An AI will think about food differently than any chef, allowing us to think about food differently. Or to think about manufacturing materials differently. Or clothes. Or financial derivatives. Or any branch of science and art. The alienness of artificial intelligence will become more valuable to us than its speed or power.

As it does, it will help us better understand what we mean by intelligence in the first place. In the past, we would have said only a superintelligent AI could drive a car, or beat a human at Jeopardy! or chess. But once AI did each of those things, we considered that achievement obviously mechanical and hardly worth the label of true intelligence. Every success in AI redefines it.

But we haven’t just been redefining what we mean by AI—we’ve been redefining what it means to be human. Over the past 60 years, as mechanical processes have replicated behaviors and talents we thought were unique to humans, we’ve had to change our minds about what sets us apart. As we invent more species of AI, we will be forced to surrender more of what is supposedly unique about humans. We’ll spend the next decade—indeed, perhaps the next century—in a permanent identity crisis, constantly asking ourselves what humans are for. In the grandest irony of all, the greatest benefit of an everyday, utilitarian AI will not be increased productivity or an economics of abundance or a new way of doing science—although all those will happen. The greatest benefit of the arrival of artificial intelligence is that AIs will help define humanity. We need AIs to tell us who we are.

This article is taken from Kevin Kelly of Wired Magazine

8 big trends in big data analytics

Bill Loconzolo, vice president of data engineering at Intuit, jumped into a data lake with both feet. Dean Abbott, chief data scientist at Smarter Remarketer, made a beeline for the cloud. The leading edge of big data and analytics, which includes data lakes for holding vast stores of data in its native format and, of course, cloud computing, is a moving target, both say. And while the technology options are far from mature, waiting simply isn’t an option.

“The reality is that the tools are still emerging, and the promise of the [Hadoop] platform is not at the level it needs to be for business to rely on it,” says Loconzolo. But the disciplines of big data and analytics are evolving so quickly that businesses need to wade in or risk being left behind. “In the past, emerging technologies might have taken years to mature,” he says. “Now people iterate and drive solutions in a matter of months — or weeks.” So what are the top emerging technologies and trends that should be on your watch list — or in your test lab? Computerworld asked IT leaders, consultants and industry analysts to weigh in. Here’s their list.

1. Big data analytics in the cloud

Hadoop, a framework and set of tools for processing very large data sets, was originally designed to work on clusters of physical machines. That has changed. “Now an increasing number of technologies are available for processing data in the cloud,” says Brian Hopkins, an analyst at Forrester Research. Examples include Amazon’s Redshift hosted BI data warehouse, Google’s BigQuery data analytics service, IBM’s Bluemix cloud platform and Amazon’s Kinesis data processing service. “The future state of big data will be a hybrid of on-premises and cloud,” he says.

Smarter Remarketer, a provider of SaaS-based retail analytics, segmentation and marketing services, recently moved from an in-house Hadoop and MongoDB database infrastructure to the Amazon Redshift, a cloud-based data warehouse. The Indianapolis-based company collects online and brick-and-mortar retail sales and customer demographic data, as well as real-time behavioral data and then analyzes that information to help retailers create targeted messaging to elicit a desired response on the part of shoppers, in some cases in real time.

Redshift was more cost-effective for Smart Remarketer’s data needs, Abbott says, especially since it has extensive reporting capabilities for structured data. And as a hosted offering, it’s both scalable and relatively easy to use. “It’s cheaper to expand on virtual machines than buy physical machines to manage ourselves,” he says.

For its part, Mountain View, Calif.-based Intuit has moved cautiously toward cloud analytics because it needs a secure, stable and auditable environment. For now, the financial software company is keeping everything within its private Intuit Analytics Cloud. “We’re partnering with Amazon and Cloudera on how to have a public-private, highly available and secure analytic cloud that can span both worlds, but no one has solved this yet,” says Loconzolo. However, a move to the cloud is inevitable for a company like Intuit that sells products that run in the cloud. “It will get to a point where it will be cost-prohibitive to move all of that data to a private cloud,” he says.

2. Hadoop: The new enterprise data operating system

Distributed analytic frameworks, such as MapReduce, are evolving into distributed resource managers that are gradually turning Hadoop into a general-purpose data operating system, says Hopkins. With these systems, he says, “you can perform many different data manipulations and analytics operations by plugging them into Hadoop as the distributed file storage system.”

What does this mean for the enterprise? As SQL, MapReduce, in-memory, stream processing, graph analytics and other types of workloads are able to run on Hadoop with adequate performance, more businesses will use Hadoop as an enterprise data hub. “The ability to run many different kinds of [queries and data operations] against data in Hadoop will make it a low-cost, general-purpose place to put data that you want to be able to analyze,” Hopkins says.

Intuit is already building on its Hadoop foundation. “Our strategy is to leverage the Hadoop Distributed File System, which works closely with MapReduce and Hadoop, as a long-term strategy to enable all types of interactions with people and products,” says Loconzolo.

3. Big data lakes

Traditional database theory dictates that you design the data set before entering any data. A data lake, also called an enterprise data lake or enterprise data hub, turns that model on its head, says Chris Curran, principal and chief technologist in PricewaterhouseCoopers’ U.S. advisory practice. “It says we’ll take these data sources and dump them all into a big Hadoop repository, and we won’t try to design a data model beforehand,” he says. Instead, it provides tools for people to analyze the data, along with a high-level definition of what data exists in the lake. “People build the views into the data as they go along. It’s a very incremental, organic model for building a large-scale database,” Curran says. On the downside, the people who use it must be highly skilled.

As part of its Intuit Analytics Cloud, Intuit has a data lake that includes clickstream user data and enterprise and third-party data, says Loconzolo, but the focus is on “democratizing” the tools surrounding it to enable business people to use it effectively. Loconzolo says one of his concerns with building a data lake in Hadoop is that the platform isn’t really enterprise-ready. “We want the capabilities that traditional enterprise databases have had for decades — monitoring access control, encryption, securing the data and tracing the lineage of data from source to destination,” he says.

4. More predictive analytics

With big data, analysts have not only more data to work with, but also the processing power to handle large numbers of records with many attributes, Hopkins says. Traditional machine learning uses statistical analysis based on a sample of a total data set. “You now have the ability to do very large numbers of records and very large numbers of attributes per record” and that increases predictability, he says.

The combination of big data and compute power also lets analysts explore new behavioral data throughout the day, such as websites visited or location. Hopkins calls that “sparse data,” because to find something of interest you must wade through a lot of data that doesn’t matter. “Trying to use traditional machine-learning algorithms against this type of data was computationally impossible. Now we can bring cheap computational power to the problem,” he says. “You formulate problems completely differently when speed and memory cease being critical issues,” Abbott says. “Now you can find which variables are best analytically by thrusting huge computing resources at the problem. It really is a game changer.”

“To enable real-time analysis and predictive modeling out of the same Hadoop core, that’s where the interest is for us,” says Loconzolo. The problem has been speed, with Hadoop taking up to 20 times longer to get questions answered than did more established technologies. So Intuit is testing Apache Spark, a large-scale data processing engine, and its associated SQL query tool, Spark SQL. “Spark has this fast interactive query as well as graph services and streaming capabilities. It is keeping the data within Hadoop, but giving enough performance to close the gap for us,” Loconzolo says.

5. SQL on Hadoop: Faster, better

If you’re a smart coder and mathematician, you can drop data in and do an analysis on anything in Hadoop. That’s the promise — and the problem, says Mark Beyer, an analyst at Gartner. “I need someone to put it into a format and language structure that I’m familiar with,” he says. That’s where SQL for Hadoop products come in, although any familiar language could work, says Beyer. Tools that support SQL-like querying let business users who already understand SQL apply similar techniques to that data. SQL on Hadoop “opens the door to Hadoop in the enterprise,” Hopkins says, because businesses don’t need to make an investment in high-end data scientists and business analysts who can write scripts using Java, JavaScript and Python — something Hadoop users have traditionally needed to do.

These tools are nothing new. Apache Hive has offered a structured a structured, SQL-like query language for Hadoop for some time. But commercial alternatives from Cloudera, Pivotal Software, IBM and other vendors not only offer much higher performance, but also are getting faster all the time. That makes the technology a good fit for “iterative analytics,” where an analyst asks one question, receives an answer, and then asks another one. That type of work has traditionally required building a data warehouse. SQL on Hadoop isn’t going to replace data warehouses, at least not anytime soon, says Hopkins, “but it does offer alternatives to more costly software and appliances for certain types of analytics.”

6. More, better NoSQL

Alternatives to traditional SQL-based relational databases, called NoSQL (short for “Not Only SQL”) databases, are rapidly gaining popularity as tools for use in specific kinds of analytic applications, and that momentum will continue to grow, says Curran. He estimates that there are 15 to 20 open-source NoSQL databases out there, each with its own specialization. For example, a NoSQL product with graph database capability, such as ArangoDB, offers a faster, more direct way to analyze the network of relationships between customers or salespeople than does a relational database.

Open-source SQL databases “have been around for a while, but they’re picking up steam because of the kinds of analyses people need,” Curran says. One PwC client in an emerging market has placed sensors on store shelving to monitor what products are there, how long customers handle them and how long shoppers stand in front of particular shelves. “These sensors are spewing off streams of data that will grow exponentially,” Curran says. “A NoSQL key-value pair database is the place to go for this because it’s special-purpose, high-performance and lightweight.”

7. Deep learning

Deep learning, a set of machine-learning techniques based on neural networking, is still evolving but shows great potential for solving business problems, says Hopkins. “Deep learning . . . enables computers to recognize items of interest in large quantities of unstructured and binary data, and to deduce relationships without needing specific models or programming instructions,” he says.

In one example, a deep learning algorithm that examined data from Wikipedia learned on its own that California and Texas are both states in the U.S. “It doesn’t have to be modeled to understand the concept of a state and country, and that’s a big difference between older machine learning and emerging deep learning methods,” Hopkins says.

“Big data will do things with lots of diverse and unstructured text using advanced analytic techniques like deep learning to help in ways that we only now are beginning to understand,” Hopkins says. For example, it could be used to recognize many different kinds of data, such as the shapes, colors and objects in a video — or even the presence of a cat within images, as a neural network built by Google famously did in 2012. “This notion of cognitive engagement, advanced analytics and the things it implies . . . are an important future trend,” Hopkins says.

8. In-memory analytics

The use of in-memory databases to speed up analytic processing is increasingly popular and highly beneficial in the right setting, says Beyer. In fact, many businesses are already leveraging hybrid transaction/analytical processing (HTAP) — allowing transactions and analytic processing to reside in the same in-memory database.

But there’s a lot of hype around HTAP, and businesses have been overusing it, Beyer says. For systems where the user needs to see the same data in the same way many times during the day — and there’s no significant change in the data — in-memory is a waste of money.

And while you can perform analytics faster with HTAP, all of the transactions must reside within the same database. The problem, says Beyer, is that most analytics efforts today are about putting transactions from many different systems together. “Just putting it all on one database goes back to this disproven belief that if you want to use HTAP for all of your analytics, it requires all of your transactions to be in one place,” he says. “You still have to integrate diverse data.”

Moreover, bringing in an in-memory database means there’s another product to manage, secure, and figure out how to integrate and scale.

For Intuit, the use of Spark has taken away some of the urge to embrace in-memory databases. “If we can solve 70% of our use cases with Spark infrastructure and an in-memory system could solve 100%, we’ll go with the 70% in our analytic cloud,” Loconzolo says. “So we will prototype, see if it’s ready and pause on in-memory systems internally right now.”

Staying one step ahead

With so many emerging trends around big data and analytics, IT organizations need to create conditions that will allow analysts and data scientists to experiment. “You need a way to evaluate, prototype and eventually integrate some of these technologies into the business,” says Curran.

“IT managers and implementers cannot use lack of maturity as an excuse to halt experimentation,” says Beyer. Initially, only a few people — the most skilled analysts and data scientists — need to experiment. Then those advanced users and IT should jointly determine when to deliver new resources to the rest of the organization. And IT shouldn’t necessarily rein in analysts who want to move ahead full-throttle. Rather, Beyer says, IT needs to work with analysts to “put a variable-speed throttle on these new high-powered tools.”

This article is taken from Robert L. Mitchell of ComputerWorld

Evolving Toward a Persistence Layer

One of the most confusing design pattern is persistence. The need for an application to persist its internal state and data is so tremendous that there are likely tens – if not hundreds – of different technologies to address this single problem. Unfortunately, no technology is a magic bullet. Each application, and sometimes each component of the application, is unique in its own way – thus, requiring a unique solution.

In this tutorial, I will teach you some best practices to help you determine which approach to take, when working on future applications. I will briefly discuss some high level design concerns and principles, followed by a more detailed view on the Active Record design pattern, combined with a few words about the Table Data Gateway design pattern.

Of course, I will not merely teach you the theory behind the design, but I will also guide you through an example that begins as random code and transforms into a structured persistence solution.

Today, no programmer can understand this archaic system.

The oldest project I have to work on began in the year 2000. Back then, a team of programmers started a new project by evaluating different requirements, thought about the workloads the application will have to handle, tested different technologies and reached a conclusion: all the PHP code of the application, except the index.php file, should reside in a MySQL database. Their decision may sound outrageous today, but it was acceptable twelve years ago (OK… maybe not).

They started by creating their base tables, and then other tables for each web page. The solution worked… for a time. The original authors knew how to maintain it, but then each author left one by one–leaving the code base in the hands of other newcomers.

Today, no programmer can understand this archaic system. Everything starts with a MySQL query fromindex.php. The result of that query returns some PHP code that executes even more queries. The simplest scenario involves at least five database tables. Naturally, there are no tests or specifications. Modifying anything is a no-go, and we simply have to rewrite the entire module if something goes wrong.

The original developers ignored the fact that a database should only contain data, not business logic or presentation. They mixed PHP and HTML code with MySQL and ignored high level design concepts.

All applications should concentrate on respecting a clean, high level design.

As time passed, the new programmers needed to add additional features to the system while, at the same time, fixing old bugs. There was no way to continue using MySQL tables for everything, and everyone involved in maintaining the code agreed that its design was horribly flawed. So the new programmers evaluated different requirements, thought about the workloads the application will have to handle, tested different technologies and reached a conclusion: they decided to move as much code as possible to the final presentation. Again, this decision may sound outrageous today, but it was light years from the previous outrageous design.

The developers adopted a templating framework and based the application around it, starting every new feature and module with a new template. It was easy; the template was descriptive and they knew where to find the code that performs a specific task. But that’s how they ended up with template files containing the engine’s Domain Specific Language (DSL), HTML, PHP and of course MySQL queries.

Today, my team just watches and wonders. It is a miracle that many of the views actually work. It can take a hefty amount of time just to determine how information gets from the database to the view. Like its predecessor, it’s all a big mess!

Those developers ignored the fact that a view should not contain business or persistence logic. They mixed PHP and HTML code with MySQL and ignored high level design concepts.

High level schema

A mock is an object that acts like its real counterpart, but doesn’t execute the real code.

All applications should concentrate on respecting a clean, high level design. This is not always achievable, but it should be a high priority. A good high level design has well-isolated business logic. Object creation, persistence, and delivery are outside of the core and dependencies point only toward the business logic.

Isolating the business logic opens the door to great possibilities, and everything becomes somewhat of a plugin, if the external dependencies always point towards the business logic. For example, you could swap the heavy MySQL database with a lightweight SQLite3 database.

  • Imagine being able to drop your current MVC framework and replacing it with another, without touching the business logic.
  • Imagine delivering the results of your application through a third party API and not over HTTP, or changing any third party technology you use today (except the programming language of course) without touching the business logic (or without much hassle).
  • Imagine making all these changes and your tests would still pass.

To better identify the problems with a bad, albeit working, design, I will start with a simple example of, you guessed it, a blog. Throughout this tutorial, I will follow some test-driven development (TDD) principles and make the tests easily understandable – even if you don’t have TDD experience. Let’s imagine that you use a MVC framework. When saving a blog post, a controller named BlogPost executes a save() method. This method connects to an SQLite database to store a blog post in the database.

Let’s create a folder, called Data in our code’s folder and browse to that directory in the console. Create a database and a table, like this:

Our save() method gets the values from the form as an array, called $data:

This code works, and you can verify it by calling it from another class, passing a predefined $data array, like this:

The content of the $data variable was indeed saved in the database:

Inheritance is the strongest type of dependency.

A characterization test describes and verifies the current behavior of preexisting code. It is most frequently used to characterize legacy code, and it makes refactoring that code much easier.

A characterization test can test a module, a unit, or go all the way from the UI to the database; it all depends on what we want to test. In our case, such a test should exercise the controller and verify the contents of the database. This is not a typical unit, functional, or integration test, and it usually cannot be associated with either of those testing levels.

Characterization tests are a temporary safety net, and we typically delete them after the code is properly refactored and unit tested. Here is an implementation of a test, placed in the Test folder:

This test creates a new controller object and executes its save() method. The test then reads the information from the database and compares it with the predefined $data[] array. We preform this comparison by using the $this->assertEquals() method, an assertion that presumes that its parameters are equal. If they are different, the test fails. Also, we clean the BlogPosts database table each time we run the test.

Legacy code is untested code. – Michael Feathers

With our test up and running, let’s clean a little of the code. Open the database with the whole directory name and use sprintf() to compose the query string. This results in much simpler code:

Gateway Pattern

We recognize that our code needs to be moved from the controller to the business logic and persistence layer, and the Gateway Pattern can help us get started down that path. Here is the revised testSave()method:

This represents how we want to use the save() method on the controller. We expect the controller to call a method named persist($blogPostObject) on the gateway object. Let’s change our BlogPostController to do that:

A good high level design has a well isolated business logic.

Nice! Our BlogPostController became much simpler. It uses the gateway (either supplied or instantiated) to persist the data by calling its persist() method. There is absolutely no knowledge about how the data is persisted; the persistence logic became modular.

In the previous test, we created the controller with a mock persistence object, ensuring that data never gets written to the database when running the test. In production code, the controller creates its own persisting object to persist the data using a SqlitePost object. A mock is an object that acts like its real counterpart, but it doesn’t execute the real code.

Now let’s retrieve a blog post from the data store. It’s just as easy as saving data, but please note that I refactored the test a bit.

And the implementation in the BlogPostController is just a one statement method:

Isn’t this cool? The BlogPost class is now part of the business logic (remember the high level design schema from above). The UI/MVC creates BlogPost objects and uses concrete Gateway implementations to persist the data. All dependencies point to the business logic.

There’s only one step left: create a concrete implementation of Gateway. Following is the SqlitePost class:

Note: The test for this implementation is also available in the source code, but, due to its complexity and length, I did not include it here.

Active Record is one of the most controversial patterns. Some embrace it (like Rails and CakePHP), and others avoid it. Many Object Relational Mapping (ORM) applications use this pattern to save objects in tables. Here is its schema:

Active record pattern

As you can see, Active Record-based objects can persist and retrieve themselves. This is usually achieved by extending an ActiveRecordBase class, a class that knows how to work with the database.

The biggest problem with Active Record is the extends dependency. As we all know, inheritance is the strongest type of dependency, and it’s best to avoid it most of the time.

Before we go further, here is where we are right now:

Gateway in high level schema

The gateway interface belongs to the business logic, and its concrete implementations belong to the persistence layer. Our BlogPostController has two dependencies, both pointing toward the business logic: the SqlitePost gateway and BlogPost class.


There are many other patterns, like the Proxy Pattern, that are closely related to persistence.

If we were to follow the Active Record pattern exactly as it is presented by Martin Fowler in his 2003 book,Patterns of Enterprise Application Architecture, then we would need to move the SQL queries into theBlogPost class. This, however, has the problem of violating both the Dependency Inversion Principle and the Open Closed Principle. The Dependency Inversion Principle states that:

  • High-level modules should not depend on low-level modules. Both should depend on abstractions.
  • Abstractions should not depend upon details. Details should depend upon abstractions.

And the Open Closed Principle states: software entities (classes, modules, functions, etc.) should be open for extension but closed for modification. We will take a more interesting approach and integrate the gateway into our Active Record solution.

If you try to do this on your own, you probably already realized that adding the Active Record pattern to the code will mess things up. For this reason, I took the option of disabling the controller and SqlitePost tests to concentrate only on the BlogPost class. The first steps are: make BlogPost load itself by setting its constructor as private and connect it to the gateway interface. Here is the first version of the BlogPostTestfile:

It tests that a blog post is correctly initialized and that it can have a gateway if set. It is a good practice to use multiple asserts when they all test the same concept and logic.

Our second test has several assertions, but all of them refer to the same common concept of empty blog post. Of course, the BlogPost class has also been modified:

It now has a load() method that returns a new object with a valid gateway. From this point on, we will continue with the implementation of a load($title) method to create a new BlogPost with information from the database. For easy testing, I implemented an InMemoryPost class for persistence. It just keeps a list of objects in memory and returns information as desired:

Next, I realized that the initial idea of connecting the BlogPost to a gateway via a separate method was useless. So, I modified the tests, accordingly:

As you can see, I radically changed the way BlogPost is used.

The load() method checks the $content parameter for a value and creates a new BlogPost if a value was supplied. If not, the method tries to find a blog post with the given title. If a post is found, it is returned; if there is none, the method creates an empty BlogPost object.

In order for this code to work, we will also need to change how the gateway works. Our implementation needs to return an associative array with title, content, and timestamp elements instead of the object itself. This is a convention I’ve chosen. You may find other variants, like a plain array, more attractive. Here are the modifications in SqlitePostTest:

And the implementation changes are:

We are almost done. Add a persist() method to the BlogPost and call all the newly implemented methods from the controller. Here is the persist() method that will just use the gateway’s persist() method:

And the controller:

The BlogPostController became so simple that I removed all of its tests. It simply calls the BlogPostobject’s persist() method. Naturally, you’ll want to add tests if, and when, you have more code in the controller. The code download still contains a test file for the BlogPostController, but its content is commented.

This is just the tip of the iceberg.

You’ve seen two different persistence implementations: the Gateway and Active Record patterns. From this point, you can implement an ActiveRecordBase abstract class to extend for all your classes that need persistence. This abstract class can use different gateways in order to persist data, and each implementation can even use different logic to fit your needs.

But this is just the tip of the iceberg. There are many other patterns, such as the Proxy Pattern, which are closely related to persistence; each pattern works for a particular situation. I recommend that you always implement the simplest solution first, and then implement another pattern when your needs change.

An Introduction to Securing your Linux VPS


Taking control of your own Linux server is an opportunity to try new things and leverage the power and flexibility of a great platform. However, Linux server administrators must take the same caution that is appropriate with any network-connected machine to keep it secure and safe.

There are many different security topics that fall under the general category of “Linux security” and many opinions as to what an appropriate level of security looks like for a Linux server.

The main thing to take away from this is that you will have to decide for yourself what security protections will be necessary. Before you do this, you should be aware of the risks and the trade offs, and decide on the balance between usability and security that makes sense for you.

This article is meant to help orient you with some of the most common security measures to take in a Linux server environment. This is not an exhaustive list, and does not cover recommended configurations, but it will provide links to more thorough resources and discuss why each component is an important part of many systems.

Blocking Access with Firewalls

One of the easiest steps to recommend to all users is to enable and configure a firewall. Firewalls act as a barrier between the general traffic of the internet and your machine. They look at traffic headed in and out of your server, and decide if it should allow the information to be delivered.

They do this by checking the traffic in question against a set of rules that are configured by the user. Usually, a server will only be using a few specific networking ports for legitimate services. The rest of the ports are unused, and should be safely protected behind a firewall, which will deny all traffic destined for these locations.

This allows you to drop data that you are not expecting and even conditionalize the usage of your real services in some cases. Sane firewall rules provide a good foundation to network security.

There are quite a few firewall solutions available. We’ll briefly discuss some of the more popular options below.


UFW stands for uncomplicated firewall. Its goal is to provide good protection without the complicated syntax of other solutions.

UFW, as well as most Linux firewalls, is actually a front-end to control the netfilter firewall included with the Linux kernel. This is usually a simple firewall to use for people not already familiar with Linux firewall solutions and is generally a good choice.

You can learn how to enable and configure the UFW firewall and find out more by clicking this link.


Perhaps the most well-known Linux firewall solution is iptables. IPTables is another component used to administer the netfilter firewall included in the Linux kernel. It has been around for a long time and has undergone intense security audits to ensure its safety. There is a version of iptables called ip6tables for creating IPv6 restrictions.

You will likely come across iptables configurations during your time administering Linux machines. The syntax can be complicated to grasp at first, but it is an incredibly powerful tool that can be configured with very flexible rule sets.

You can learn more about how to implement some iptables firewall rules on Ubuntu or Debian systemshere, or learn how to use iptables on CentOS/Fedora/RHEL-based distros here.


As mentioned above, the iptables is used to manipulate the tables that contain IPv4 rules. If you have IPv6 enabled on your server, you will need to also pay attention to the IPv6 equivalent: ip6tables.

The netfilter firewall that is included in the Linux kernel keeps IPv4 and IPv6 traffic completely separate. These are stored in different tables. The rules that dictate the ultimate fate of a packet are determined by the protocol version that is being used.

What this means to the server’s administer is that a separate ruleset must be maintained when version 6 is enabled. The ip6tables command shares the same syntax as the iptables command, so implementing the same set of restrictions in the version 6 table is usually straight forward. You must be sure match traffic directed at your IPv6 addresses however, for this to work correctly.


Although iptables has long been the standard for firewalls in a Linux environment, a new firewall called nftables has recently been added into the Linux kernel. This is a project by the same team that makes iptables, and is intended to eventually replace iptables.

The nftables firewall attempts to implement more readable syntax than that found its iptables predecessor, and implements IPv4 and IPv6 support into the same tool. While most versions of Linux at this time do not ship with a kernel new enough to implement nftables, it will soon be very commonplace, and you should try to familiarize yourself with its usage.

Using SSH to Securely Login Remotely

When administering a server where you do not have local access, you will need to log in remotely. The standard, secure way of accomplishing this on a Linux system is through a protocol known called SSH, which stands for secure shell.

SSH provides end-to-end encryption, the ability to tunnel insecure traffic over a secure connection, X-forwarding (graphical user interface over a network connection), and much more. Basically, if you do not have access to a local connection or out-of-band management, SSH should be your primary way of interacting with your machine.

While the protocol itself is very secure and has undergone extensive research and code review, your configuration choices can either aid or hinder the security of the service. We will discuss some options below.

Password vs SSH-Key Logins

SSH has a flexible authentication model that allows you to sign in using a number of different methods. The two most popular choices are password and SSH-key authentication.

While password authentication is probably the most natural model for most users, it is also the less secure of these two choices. Password logins allow a potential intruder to continuously guess passwords until a successful combination is found. This is known as brute-forcing and can easily be automated by would-be attackers with modern tools.

SSH-keys, on the other hand, operate by generating a secure key pair. A public key is created as a type of test to identify a user. It can be shared publicly without issues, and cannot be used for anything other than identifying a user and allowing a login to the user with the matching private key. The private key should be kept secret and is used to pass the test of its associated public key.

Basically, you can add your public SSH key on a server, and it will allow you to login by using the matching private key. These keys are so complex that brute-forcing is not practical. Furthermore, you can optionally add long passphrases to your key that adds even more security.

To learn more about how to use SSH click here, and check out this link to learn how to set up SSH keys on your server.

Implement fail2ban to Ban Malicious IP Addresses

One step that will help with the general security of your SSH configuration is to implement a solution like fail2ban. Fail2ban is a service that monitors log files in order to determine if a remote system is likely not a legitimate user, and then temporarily ban future traffic from the associated IP address.

Setting up a sane fail2ban policy can allow you to flag computers that are continuously trying to log in unsuccessfully and add firewall rules to drop traffic from them for a set period of time. This is an easy way of hindering often used brute force methods because they will have to take a break for quite a while when banned. This usually is enough to discourage further brute force attempts.

You can learn how to implement a fail2ban policy on Ubuntu here. There are similar guides for Debian andCentOS here.

Implement an Intrusion Detection System to Detect Unauthorized Entry

One important consideration to keep in mind is developing a strategy for detecting unauthorized usage. You may have preventative measures in place, but you also need to know if they’ve failed or not.

An intrusion detection system, also known as an IDS, catalogs configuration and file details when in a known-good state. It then runs comparisons against these recorded states to find out if files have been changed or settings have been modified.

There are quite a few intrusion detection systems. We’ll go over a few below.


One of the most well-known IDS implementations is tripwire. Tripwire compiles a database of system files and protects its configuration files and binaries with a set of keys. After configuration details are chosen and exceptions are defined, subsequent runs notify of any alterations to the files that it monitors.

The policy model is very flexible, allowing you to shape its properties to your environment. You can then configure tripwire runs via a cron job and even implement email notifications in the event of unusual activity.

Learn more about how to implement tripwire here.


Another option for an IDS is Aide. Similar to tripwire, Aide operates by building a database and comparing the current system state to the known-good values it has stored. When a discrepancy arises, it can notify the administrator of the problem.

Aide and tripwire both offer similar solutions to the same problem. Check out the documentation and try out both solutions to find out which you like better.

For a guide on how to use Aide as an IDS, check here.


The psad tool is concerned with a different portion of the system than the tools listed above. Instead of monitoring system files, psad keeps an eye on the firewall logs to try to detect malicious activity.

If a user is trying to probe for vulnerabilities with a port scan, for instance, psad can detect this activity and dynamically alter the firewall rules to lock out the offending user. This tool can register different threat levels and base its response on the severity of the problem. It can also optionally email the administrator.

To learn how to use psad as a network IDS, follow this link.


Another option for a network-based IDS is Bro. Bro is actually a network monitoring framework that can be used as a network IDS or for other purposes like collecting usage stats, investigating problems, or detecting patterns.

The Bro system is divided into two layers. The first layer monitors activity and generates what it considers events. The second layer runs the generated events through a policy framework that dictates what should be done, if anything, with the traffic. It can generate alerts, execute system commands, simply log the occurrence, or take other paths.

To find out how to use Bro as an IDS, click here.


While not technically an intrusion detection system, rkhunter operates on many of the same principles as host-based intrusion detection systems in order to detect rootkits and known malware.

While viruses are rare in the Linux world, malware and rootkits are around that can compromise your box or allow continued access to a successful exploiter. RKHunter downloads a list of known exploits and then checks your system against the database. It also alerts you if it detects unsafe settings in some common applications.

You can check out this article to learn how to use RKHunter on Ubuntu.

General Security Advice

While the above tools and configurations can help you secure portions of your system, good security does not come from just implementing a tool and forgetting about it. Good security manifests itself in a certain mindset and is achieved through diligence, scrutiny, and engaging in security as a process.

There are some general rules that can help set you in the right direction in regards to using your system securely.

Pay Attention to Updates and Update Regularly

Software vulnerabilities are found all of the time in just about every kind of software that you might have on your system. Distribution maintainers generally do a good job of keeping up with the latest security patches and pushing those updates into their repositories.

However, having security updates available in the repository does your server no good if you have not downloaded and installed the updates. Although many servers benefit from relying on stable, well-tested versions of system software, security patches should not be put off and should be considered critical updates.

Most distributions provide security mailing lists and separate security repositories to only download and install security patches.

Take Care When Downloading Software Outside of Official Channels

Most users will stick with the software available from the official repositories for their distribution, and most distributions offer signed packages. Users generally can trust the distribution maintainers and focus their concern on the security of software acquired outside of official channels.

You may choose to trust packages from your distribution or software that is available from a project’s official website, but be aware that unless you are auditing each piece of software yourself, there is risk involved. Most users feel that this is an acceptable level of risk.

On the other hand, software acquired from random repositories and PPAs that are maintained by people or organizations that you don’t recognize can be a huge security risk. There are no set rules, and the majority of unofficial software sources will likely be completely safe, but be aware that you are taking a risk whenever you trust another party.

Make sure you can explain to yourself why you trust the source. If you cannot do this, consider weighing your security risk as more of a concern than the convenience you’ll gain.

Know your Services and Limit Them

Although the entire point of running a server is likely to provide services that you can access, limit the services running on your machine to those that you use and need. Consider every enabled service to be a possible threat vector and try to eliminate as many threat vectors as you can without affecting your core functionality.

This means that if you are running a headless (no monitor attached) server and don’t run any graphical (non-web) programs, you should disable and probably uninstall your X display server. Similar measures can be taken in other areas. No printer? Disable the “lp” service. No Windows network shares? Disable the “samba” service.

You can discover which services you have running on your computer through a variety of means. This article covers how to detect enabled services under the “create a list of requirements” section.

Do Not Use FTP; Use SFTP Instead

This might be a hard one for many people to come to terms with, but FTP is a protocol that is inherently insecure. All authentication is sent in plain-text, meaning that anyone monitoring the connection between your server and your local machine can see your login details.

There are only very few instances where FTP is probably okay to implement. If you are running an anonymous, public, read-only download mirror, FTP is a decent choice. Another case where FTP is an okay choice is when you are simply transferring files between two computers that are behind a NAT-enabled firewall, and you trust your network is secure.

In almost all other cases, you should use a more secure alternative. The SSH suite comes complete with an alternative protocol called SFTP that operates on the surface in a similar way, but it based on the same security of the SSH protocol.

This allows you to transfer information to and from your server in the same way that you would traditionally use FTP, but without the risk. Most modern FTP clients can also communicate with SFTP servers.

To learn how to use SFTP to transfer files securely, check out this guide.

Implement Sensible User Security Policies

There are a number of steps that you can take to better secure your system when administering users.

One suggestion is to disable root logins. Since the root user is present on any POSIX-like systems and it is an all-powerful account, it is an attractive target for many attackers. Disabling root logins is often a good idea after you have configured sudo access, or if you are comfortable using the su command. Many people disagree with this suggestion, but examine if it is right for you.

It is possible to disable remote root logins within the SSH daemon or to disable local logins, you can make restrictions in the /etc/securetty file. You can also set the root user’s shell to a non-shell to disable root shell access and set up PAM rules to restrict root logins as well. RedHat has a great article on how to disable root logins.

Another good policy to implement with user accounts is creating unique accounts for each user and service, and give them only the bare minimum permissions to get their job done. Lock down everything that they don’t need access to and take away all privileges short of crippling them.

This is an important policy because if one user or service gets compromised, it doesn’t lead to a domino affect that allows the attacker to gain access to even more of the system. This system of compartmentalization helps you to isolate problems, much like a system of bulkheads and watertight doors can help prevent a ship from sinking when there is a hull breach.

In a similar vein to the services policies we discussed above, you should also take care to disable any user accounts that are no longer necessary. This may happen when you uninstall software, or if a user should no longer have access to the system.

Pay Attention to Permission Settings

File permissions are a huge source of frustration for many users. Finding a balance for permissions that allow you to do what you need to do while not exposing yourself to harm can be difficult and demands careful attention and thought in each scenario.

Setting up a sane umask policy (the property that defines default permissions for new files and directories) can go a long way in creating good defaults. You can learn about how permissions work and how to adjust your umask value here.

In general, you should think twice before setting anything to be world-writeable, especially if it is accessible in any way to the internet. This can have extreme consequences. Additionally, you should not set the SGID or SUID bit in permissions unless you absolutely know what you are doing. Also, check that your files have an owner and a group.

Your file permissions settings will vary greatly based on your specific usage, but you should always try to see if there is a way to get by with fewer permissions. This is one of the easiest things to get wrong and an area where there is a lot of bad advice floating around on the internet.

Regularly Check for Malware on your Servers

While Linux is generally less targeted by Malware than Windows, it is by no means immune to malicious software. In conjunction with implementing an IDS to detect intrusion attempts, scanning for malware can help identify traces of activity that indicate that illegitimate software is installed on your machine.

There are a number of malware scanners available for Linux systems that can be used to regularly validate the integrity of your servers. Linux Malware Detect, also known as maldet or LMD, is one popular option that can be easily installed and configured to scan for known malware signatures. It can be run manually to perform one-off scans and can also be daemonized to run regularly scheduled scans. Reports from these scans can be emailed to the server administrators.

How To Secure the Specific Software you are Using

Although this guide is not large enough to go through the specifics of securing every kind of service or application, there are many tutorials and guidelines available online. You should read the security recommendations of every project that you intend to implement on your system.

Furthermore, popular server software like web servers or database management systems have entire websites and databases devoted to security. In general, you should read up on and secure every service before putting it online.

You can check our security section for more specific advice for the software you are using.


You should now have a decent understanding of general security practices you can implement on your Linux server. While we’ve tried hard to mention many areas of high importance, at the end of the day, you will have to make many decisions on your own. When you administer a server, you have to take responsibility for your server’s security.

This is not something that you can configure in one quick spree in the beginning, it is a process and an ongoing exercise in auditing your system, implementing solutions, evaluating logs and alerts, reassessing your needs, etc. You need to be vigilant in protecting your system and always be evaluating and monitoring the results of your solutions.

Changing Password of Specific User Account In Linux

Rules for changing passwords for user accounts

  1. A normal user may only change the password for his/her own account.
  2. The superuser (root user) may change the password for any account or specific account.
  3. The passwd command also changes the account or associated password validity period.

First, login as the root user. Use sudo -s or su - command to login as root. To change password of specific user account, use the following syntax:

passwd userNameHere

To change the password for user called vivek, enter:
# passwd vivek
Sample outputs:

Change Users Local Linux Password Command Line

To see password status of any user account, enter:
# passwd -S userNameHere
# passwd -S vivek

Sample outputs:

vivek P 05/05/2012 0 99999 7 -1

The status information consists of 7 fields as follows:

  1. vivek : Account login name (username)
  2. P : This field indicates if the user account has a locked password (L), has no password (NP), or has a usable password (P)
  3. 05/05/2012 : Date of the last password change.
  4. 0 : Password expiry minimum age
  5. 99999 : Password expiry maximum age.
  6. 7 : Password expiry warning period.
  7. -1 : Inactivity period for the password (see chage command for more information).

To get more info about password aging for a specific user called vivek, enter:
# chage -l vivek