PlaNet: Classifying Planets with Neural Networks

After completing Jeremy Howard’s Deep Learning course, I wanted to put my skills to the test on something fun and interesting, so I set out to train a neural network that classified planets. I’m happy with the end result (and its cheeky name): plaNet.

I wanted to classify major solar system planets based on salient features. The issue with this approach is that there isn’t very much data to train a neural network on. I scraped AstroBin for amateur photos of planets, but I found that most of them simply looked like smudges, and the outer planets were either unrecognizable or missing entirely.

Some of the unaugmented training data used for Jupiter, mostly from NASA.

To get around these issues, I based my approach on two methods: data augmentation on my small dataset, and fine-tuning an existing neural network. Data augmentation is simple in Keras, so I dramatically increased my dataset size simply by applying transformations to my initial images. I fine-tuned my network on VGG’s ImageNet convolutional layers (a classic approach to transfer learning). I dropped out the last fully-connected layer, which was trained to classify everyday objects, and kept the convolutional layers. These layers are great for identifying features — edges, shapes, and patterns — that could still be found in my images of planets. At this point, I pre-calculated the output of the convolutional layer on the initial and augmented datasets in order to easily combine them into one feature set, then I was able to train with a relatively solid test accuracy (~90%). I used a high dropout rate in order to avoid overfitting to my small training dataset, and it seems to have worked.

I want to highlight the simplicity of this approach. Because we’re simply fine-tuning a pre-trained neural network, we can access what is essentially the state of the art in deep learning with just a few lines of code and a small amount of computing time and power (compared to training an entire network from scratch). My work was mostly in preparing the datasets and fine-tuning different parameters until I was happy with the results. If you haven’t already, I encourage you to take a look at the course online. Many thanks to Jeremy Howard for giving me a practical approach to something I’ve only had theoretical backing for so far.

Installing Python and TensorFlow on Yeti

UPDATED 12/24/2016 to support TensorFlow r0.12.

Prepare to fall down a rabbit hole of Linux compiler errors — here’s a guide on how to set up a proper Python and TensorFlow development environment on Columbia’s Yeti HPC cluster. This should also work for other RHEL 6.7 and certain CentOS HPC systems where GLIBC and other dependencies are out of date and you don’t have root access to dig deep into the system. A living, breathing guide is on my GitHub here, and I will keep this post updated in case future versions of TensorFlow are easier to install.

Python Setup

Create an alias for the directory where we’ll do our installation and computing.

$WORK = /vega/<group>/users/<username>

Now, install and setup the latest version of Python (2 or 3).

cd $WORK
mkdir applications
cd applications
mkdir python
cd python
wget https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz
tar -xvzf Python-2.7.12.tgz
find Python-2.7.12 -type d | xargs chmod 0755
cd Python-2.7.12
./configure --prefix=$WORK/applications/python --enable-shared
make && make install
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$WORK/applications/python/lib"

You can add this Python to your path, but I am just going to work entirely out of virtual environments and will leave the default path as-is. If you’re particular about folder structure, you can install specific Python versions in (for example) $WORK/applications/python/Python-2.7.12 to keep separate versions well-organized and easily available.

Now, we’ll install pip.

cd $WORK/applications
wget https://bootstrap.pypa.io/get-pip.py
$WORK/applications/python/bin/python get-pip.py

Now to install and set up a virtualenv:

$WORK/applications/python/bin/pip install virtualenv
cd $WORK/applications
$WORK/applications/python/bin/virtualenv pythonenv

Now, create an alias in your ~/.profile to allow easy access to the virtualenv.

alias pythonenv="source $WORK/applications/pythonenv/bin/activate"

There you have it! Your own local python installation in a virtualenv just a pythonenv command away. You can also install multiple Python versions and pick which one you want for a particular virtualenv. Nice and self-contained.

Bazel Setup

The TensorFlow binary requires GLIBC 2.14, but Yeti runs RHEL 6.7, which ships with GLIBC 2.12. Installing a new GLIBC from source will lead you down a rabbit hole of system dependencies and compilation errors, but we have another option. Installing Bazel will let us compile TensorFlow from source. Bazel requires OpenJDK 8:

# Do this in an interactive session because submit queues don't have enough memory.
cd $WORK/applications
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-x64.tar.gz
tar -xzf jdk-8u112-linux-x64.tar.gz

Add these two lines to your ~/.profile:

export PATH=$WORK/applications/jdk1.8.0_112/bin:$PATH
export JAVA_HOME=$WORK/applications/jdk1.8.0_112

Now, get a copy of Bazel. We also need to load a newer copy of gcc to compile Bazel:

wget https://github.com/bazelbuild/bazel/releases/download/0.4.2/bazel-0.4.2-dist.zip
unzip bazel-0.4.2-dist.zip -d bazel
cd bazel
module load gcc/4.9.1
./compile.sh

Add the following to your ~/.profile:

export PATH=$WORK/applications/bazel/output:$PATH

TensorFlow Setup

We’re going to install TensorFlow from source using Bazel.
Make sure numpy is installed in your pythonenv: pip install numpy.
Clone the TensorFlow repository.

cd $WORK/applications
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
git checkout r0.12

We also need to install swig:

cd $WORK/applications
# Get swig-3.0.10.tar.gz from SourceForge.
tar -xzf swig-3.0.10.tar.gz
mkdir swig
cd swig-3.0.10
./configure --prefix=$WORK/applications/swig
make
make install

Add the following to your ~/.profile:

export PATH=$WORK/applications/swig/bin:$PATH

We need to set the following environment variables. Add them to your ~/.profile:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-7.5/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda-7.5

Note that /usr/local/cuda-7.5/lib64 is automatically added to $LD_LIBRARY_PATH when you run module load cuda, so we only need to add the other directories. Also note that /usr/local/cuda is symlinked to /usr/local/cuda-7.5, so you don’t need to include the versions in the path directories, but I’m doing it to be explicit.

To install TensorFlow, we just need to load some GPU nodes and libraries, which we can also access in an interactive session. Running module load cuda loads CUDA 7.5 and cuDNN. Then we can install with Bazel:

# This gives you a 1-hour interactive session with GPU support.
# It may take a while to start the interactive session, depending on current wait times.
qsub -I -W group_list=<yetigroup> -l walltime=01:00:00,nodes=1:gpus=1:exclusive_process
# Use latest available gcc for compatibility.
# CUDA loads 7.5 by default.
# Load the proxy to allow TF to download and install protobuf and other dependencies.
module load gcc/4.9.1 cuda proxy 
pythonenv
cd $WORK/applications/tensorflow
./configure
# I used all the default settings except for CUDA compute capabilities, which I set to 3.5 for our k20 and k40 GPUs. 

Once that is done, make the following change to third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl to add the -fno-use-linker-plugin compiler flag:

index 20449a1..48a4e60 100755
--- a/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
+++ b/third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
@@ -309,6 +309,7 @@ def main():
     # TODO(eliben): rename to a more descriptive name.
     cpu_compiler_flags.append('-D__GCUDACC_HOST__')

+  cpu_compiler_flags.append('-fno-use-linker-plugin')
   return subprocess.call([CPU_COMPILER] + cpu_compiler_flags)

 if __name__ == '__main__':

Now we can build with Bazel:

bazel build -c opt --config=cuda --verbose_failures //tensorflow/cc:tutorials_example_trainer

The build should fail with an error that goes something like undefined reference to symbol 'ceil@@GLIBC_2.2.5' or undefined reference to symbol 'clock_gettime@@GLIBC_2.2.5'. To fix this, modify LINK_OPTS in bazel-tensorflow/external/protobuf/BUILD by adding the -lm and -lrt flags to //conditions:default:

LINK_OPTS = select({
    ":android": [],
    "//conditions:default": ["-lpthread", "-lm", "-lrt"],
})

Re-start the build and run the sample trainer:

bazel build -c opt --config=cuda --verbose_failures //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

If everything goes okay, build the pip wheel:

bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package $WORK/applications/tensorflow
pip install $WORK/applications/tensorflow/tensorflow-0.12.0-cp27-cp27m-linux_x86_64.whl

Testing TensorFlow

Try training on MNIST data to see if your installation works:

cd tensorflow/models/image/mnist
python convolutional.py

Troubleshooting

Observations from the New York Scientific Data Summit

Deep learning impresses and disappoints

Multiple talks discussed results from deep learning techniques, especially convolutional neural networks, and the effectiveness of the methods varied wildly. Some experiments yielded only 50% classification accuracy, which doesn’t ultimately seem helpful or effective at all. I’m unsure whether other techniques were attempted or considered, but it’s clear that deep learning isn’t the most effective approach for every single problem. It’s a shiny new hammer that makes every problem look like a nail. Libraries like TensorFlow make it more accessible, but there is still a visible gap between those who can implement it and those who can implement it effectively.

Re-inventing the wheel

A few groups demonstrated tools that were developed in-house that already have excellent open source alternatives. I’m not sure whether they were unaware of the existing libraries or just wanted something more finely-tuned for their own purposes, but it seems that a lot of scientific time is spent coming up with solutions for problems that are already solved. Regardless, there were plenty of examples of people who did use open source libraries effectively, so the progress there is something to be proud of.

Sagan Exoplanet Workshop Day 1

Morning

Eric Feigelson (Penn State) discusses Statistics and the Astronomical Enterprise, and why statistics are so essential to the discovery and study of Exoplanets. The history and development of statistics in relation to astronomy as well as the present state of the field are covered. Key examples of essential statistical applications in astronomy and astrophysics are discussed, as well as an outlook on potential future developments and applications. Practical computing implementations with R are discussed.

Jessi Cisewski (Yale) goes over Bayesian Methods. She covers the basics of Bayesian analysis (Bayes’ Theorem, prior and posterior distributions, and inference with posteriors). Examples of different analyses using different models and distributions are given. Classical/Frequentist approaches are contrasted with the Bayesian approach, and best practices are covered.

Xavier Dumusque (Universite de Geneva) and Nikole Lewis (STScI) introduce the hands-on sessions of the week. Xavier discussed differentiating Radial Velocity signals from planetary signals, and Nikole discussed detecting exoplanets in upcoming JWST data and analyzing their spectra.

Afternoon

David Kipping (Columbia) gives A Beginner’s Guide to Monte Carlo Markov Chain (MCMC) Analysis. MCMC examples are covered as a way to find a posterior distribution. Metropolis and Metropolis-Hastings algorithms are discussed, and situations where other algorithms may be more or less effective are presented.

POPs Session I

Slides here.

  • Ines Juvan (Space Research Institute Graz) – PyTranSpot – A tool for combined transit and stellar spot light-curve modeling
  • Anthony Gai (Univ. at Albany) – Bayesian Model Testing of Models for Ellipsoidal Variation on Stars Due to Hot Jupiters
  • Emiliano Jofre (OAC-CONICET) – Searching for planets in southern stars via transit timing variations
  • Sean McCloat (Univ. of North Dakota) – Follow-up Observations of Recently Discovered Hot Jupiters
  • Romina Petrucci (OAC-CONICET) – A search for orbital decay in southern transiting planets
  • Luke Bouma (MIT) – What should we do next with the Transiting Exoplanet Survey Satellite?
  • Akshata Krishnamurthy (MIT) – A precision optical test bench to measure the absolute quantum efficiency of the Transiting Exoplanet Survey Satellite CCD detectors

David Kipping (Columbia) teaches us about using Bayesian Priors for Transits and RVs. Selection, implementation, strengths and weaknesses, and ideal use cases for different types of priors are discussed.

POPs Session II

Slides here.

  • Allen Davis (Yale) – Assessing the information content of spectra using PCA
  • Matteo Pinamonti (Univ. of Trieste) – Searching for planetary signals in Doppler time series: a performance evaluation of tools for periodogram analysis
  • Keara Wright (Univ. of Florida) – Stellar Parameters for FGK MARVELS Targets
  • Richard Hall (Univ. of Cambridge) – Measuring the Effective Pixel Positions of the HARPS3 CCD
  • Sarah Millholland (UCSC) – A Search for Non-Transiting Hot Jupiters with Transiting Super-Earth Companions
  • Tarun Kumar (Thapar Univ.) – Radial Velocity Curves of Polytropic Models of Stars of Polytropic Index N=1.5
  • Fei Dai (MIT) – The K2-ESPRINT Collaboration

Eric Feigelson (Penn State) covers Statistical Approaches to Exoplanetary Science. The talk focuses on time series analysis. Parametric and nonparametric methods are shown for time domain and frequency domain  problems. Code examples and other potentially useful methods are given.

dotAstronomy Day 1

James Webb Space Telescope and Astronomy

Sarah Kendrew (ESA, STScI)

JWST goes well into the infrared
Launch Autumn/winter 2018 — lots of things that can go wrong, but these engineers are awesome.
Science proposals start November 2017.
Routine science observations start six months after launch.
Compared to next-gen observatories, JWST is an old school telescope. We can bring it into the 21st century with better tools for research.
Coordination of development tools with Astropy developers.
Watch the clean room live on the WebbCam(ha!).

Bruno Merin (ESA)

ESASky – a Multi-Mission Interface

Open Source Hardware in Astronomy

Carl Ferkinhoff (Winona State University)

hardware.astronomy
Bringing the open hardware movement to astronomy
1) Develop low(er) cost astronomical instruments
2) Invest undergrads in the development (helps keep costs low).
3) Make hardware available to broader community
4) develop an open standard for hardware in astronomy

Citizen Science with the Zooniverse: turning data into discovery (Oxford)

Ali Swanson

Crowdsourcing has been proven effective at dealing with large, messy data in many cases across different fields.
Amateur consensus agrees with experts 97% of the time (experts agree with each other 98% of the time), and remaining 3% are deemed “impossible” even by experts.
Create your own zooniverse!

Gaffa tape and string: Professional hardware hacking (in astronomy)

James Gilbert (Oxford)

Spectra with fiber optic cables on a focal plane.
Move the cables to new locations.
Use a ring-magnet and piezoelectric movement to move “Starbugs” around — messy, inefficient.
Prototyped a vacuum solution that worked fine! This is now the final design.
Hacking/lean prototypes/live demos are effective in showing and proving results to people. Kinks can be ironed out later, but faith is won in showing something can work.

Open Science with K2

Geert Barentsen (NASA Ames)

Science is woefully underfunded.
Qatar World Cup ($220 billion) vs. Kepler mission ($0.6 billion)
Open science disseminates research and data to all levels of society.
We need more than a bunch of papers on the ArXiv.
Zooniverse promotes active participation.
K2 mission shows the impact of extreme openness.
Kepler contributed immensely to science, but it was closed.
Large missions are too valuable to give exclusively to the PI team — don’t build a wall.
Proprietary data slows down science, misses opportunities for limited-lifetime missions, blocks early-career researchers, and reduces diversity by favoring rich universities.
People are afraid of getting scooped, but we can have more than one paper.
Putting work on GitHub is publishing, and getting “scooped” is actually plagiarism.
K2 is basically a huge hack — using solar photon pressure to balance an axis after K1 broke.
Open approach: no proprietary data, funding other groups to do the same science, requires large programs to keep data open.
K2 vs K1: The broken spacecraft with a 5x smaller budget has more authors and most publications, and more are early-career researchers because all the data is open. 2x increase, and a more fair representation of the astro community.
Call to action: question restrictive policies and proprietary periods. Question the idea of one paper for the same dataset or discovery. Don’t fear each other as competition — fear losing public support.
The next mission will have open data from Day 0 thanks to K2.

Lightning Talks

#foundthem

Aleks Scholz (University of St Andrews)

SETI, closed science vs open science and communicating with the public.

astrobites

Ashley Villar (Harvard)

Send your undergrads to Astrobites! Advice, articles, tutorials.

There is no such thing as a stupid question comic book

Edward Gomez (Las Cumbres Observatory)

Neat astro comic book for kiddos.

Astronomy projects for the blind and visually impaired

Coleman Krawczyk (University of Portsmouth)

3D printing galaxies as a tool for the blind.

NOAO Data Lab

Matthew Graham (Caltech/NOAO)

Classifying Stellar Bubbles

Justyn Campbell-White (University of Kent)

Citizen science data being used in a PhD project.

The Pynterferometer

Adam Avison (ALMA)

A short history of JavaScript

William Roby (Caltech)

JavaScript is more usable thanks to ES6, and it follows functional principles. Give it another try if you’ve written it off!

Asteroid Day – June 30th, 2016

Edward Gomez (Las Cumbres Observatory)

International effort to observe NEAs with Las Cumbres.

Science on Supercomputers

A slice of the universe, created on Stampede.
Our simulations are far too computationally intensive to run on normal computers, so they’re run on the Stampede supercomputer at the Texas Advanced Computing Center. Stampede is a massive cluster of computers. It’s made up of 6400 nodes, and each node has multiple processors and 32GB to 1TB of RAM. The total system has 270TB of RAM and 14PB of storage. It’s hard to put these numbers into terms we can compare to our laptops, but essentially, this is enough computing power to simulate the universe with.
Stampede
Sometimes people ask if I use the telescope on top of Pupin, and when I answer “No,” they wonder what on earth I’m doing with my time. Mostly I write code and run scripts. This sort of astrophysics sounds unglamorous, but it amazes me. All I need is my computer and an internet connection, and I have the real universe and any theoretical universe I can dream up at my fingertips. Computers and the internet have completely changed the way we do science, and Stampede is just one reminder of the capability and potential of these new scientific tools.

The Importance of Data Visualization in Astronomy

It is difficult to understate the importance of data visualization in astronomy and astrophysics. Behind every great discovery, there is some simple visualization of the complex data that makes the science behind it seem obvious. As good at computers are becoming at making fits and finding patterns, the human eye and mind are still unparalleled when it comes to detecting interesting patterns in data to reach new conclusions. Here are a few of my favorite visualizations that simply illustrate complex concepts.

Large Scale Structure

As we’ve mapped increasing portions of the known universe, we’ve discovered astounding structures on the largest scales. Visualizing this structure in 2D or 3D maps gives us an intuitive grasp of the arrangement of galaxies within the universe and the forces the creation of that structure.

Galaxy filaments

Sloan Digital Sky Survey
The Sloan Digital Sky Survey is a massive scientific undertaking to map objects of the known universe. Hundreds of millions of objects have been observed going back billions of years. It may seem overwhelming to even begin processing this data, but a simple map of the objects in the sky provides immediate insight into the large scale structure of our universe. We find that galaxies are bound by gravity to form massive filaments, and that these filaments must contain mass beyond what we can see (in the form of dark matter) to form these web-like structures.

Fingers of God

Fingers of God
If you plot galaxies to observe large-scale structure, a peculiar pattern emerges. The structures seem to point inward and outward from our position in the universe. This violates the Cosmological Principle, which states that no position in the universe should be favored over any other. So why do these filaments seem to point at us? The cause of these “Fingers of God” is an observational effect called redshift-space distortion. The galaxies are moving due to larger gravitational forces of their cluster, as well as the expansion of the universe, so their light seems to be accelerated towards or away from us. Correcting for this effect gives us the more random filaments we see above.

Expansion of the Universe

Hubble's Law
In 1929, Edwin Hubble published a simple yet revolutionary plot. He plotted the distance of galaxies from us, and the velocities at which they moved toward or away from us. What he found was that the farther away a galaxy was, the faster it moved away from us. This could not be the case in what was thought to be a static universe. Hubble’s Law came to prove that our universe was in fact expanding.

Galaxy Rotation Curves

Rotation Curve of NGC3198
When we plot the rotational velocity of galaxies, we expect the rotational velocity to fall off as the radius increases based on the mass we can observe. As we get further away from the center and into less-dense regions, the matter should lose angular momentum and rotate slower. However, plotting rotational velocity curves reveals something peculiar — the rotational velocity remains constant regardless of radius as you leave the center. This means there must be matter we aren’t seeing: dark matter.

Our own research

Density Plot
Visualization has proved important in our own research as well. Simple sanity checks on the large-scale structure of our simulations helps us make sure our simulations are running properly. Plots of different parameters show simple relationships that arise from the physics of our simulations.