Ph.D. Candidate - Computations and Neural Systems - Caltech


``Create something creative.’’

My goal is to model human intelligence and build intelligent machines. My primary interests include:

Machine Learning active learning, deep learning, graphical models

Neuroscience neural substrates of sequential decision making

Computer Vision real-time object detection, scotopic vision

Curriculum Vitae



    Vision without the Image
    B. Chen and P. Perona.
    Sesnors: Special Issue on Photon-Counting Sensors new
    There is a disconnect between the current computer vision algorithms and the next-generation sensor technology. Sensors in the near future will report the world as a collection of single photon arrival events, whereas computer vision still view the world as a static image. We bridge this disconnect by considering classification, search and tracking algorithms coupled with the recently-developed photon-counting sensors.

    Scotopic Visual Recognition
    B. Chen and P. Perona.
    ICCV Extreme Imagine Workshop 2015 (Oral)
    How many photons do we need to collect from a scene before we can identify the objects in it? In scenarios where the input to a recognition system is incremental and sparse, the system must make wise use of the input in order to respond as quickly and as accurately as possible. Every bit counts. These scenarios include driving under low light, passive surveilance, live cell imaging, competitive trading, etc. We present the first study of the near-optimal trade-off between the amount of input and the accuracy of visual recognition. The tradeoff may be implemented by a self-recurrent deep ConvNet.

    Speed versus accuracy in visual search: optimal performance and neural architecture.
    B. Chen and P. Perona.
    Journal of Vision 2015
    You are picking apples at a farm. How long should you spent inspecting a tree before moving on? How often will you miss the juiciest Red Delicious? How does the brain trade off speed versus accuracy, in an environment where the number and variety of apples are unknown and changing? How does the brain transform sensory information encoded in trains of action potentials, using neurons that communicate with action potentials, to a decision about when to stop inspection and which apple to pick? Our nominal visual search model is the first attempt to answer all the questions above.

    Hierarchical cascade of classifiers for efficient poselet evaluation.
    B. Chen, P. Perona and L. Bourdev.
    BMVC 2014 (Oral)
    Scaling up object detection systems to handle more objects and more complex categories is key to industrial applications. We present the first attempt to speed up Poselets, the state-of-the-art (as of 2013) in general object detection, by marrying the idea of cascades and hierarchical organization. Our framework may be adapted to scale Deformable Part Models and deep networks to large numbers of categories and parts.

    Learning fine-grained image similarity with deep ranking.
    J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen and Y. Wu.
    CVPR 2014
    How to evaluate whether two images are similar? Dr. Song and I pioneered in using deep learning to study fine-grained image similarity. Thanks to the efforts from the Google Image team, especially from my fellow intern Jiang Wang, we present the first ConvNet that thinks Keira Knightley resembles Natalie Portman more than Taylor Swift.

    Joint optimization and variable selection of high-dimensional Gaussian processes.
    B. Chen, R. Castro and A. Krause.
    ICML 2012
    Active optimization of a high-dimensional stochastic function is hard: every evaluation is costly and noisy, and dimensionality is a curse. We present the first no-regret solution to optimize a free-formed smooth function in high dimensions, provided that only a small subset (<6) of dimensions are relevant.

    Predicting response time and error rates in visual search.
    B. Chen, V. Navalpakkam and P. Perona.
    NIPS 2011
    When we search for things under time pressure, do we zoom in to the most likely item (MAX), or do we aggregate noisy information over the entire visual field (Bayesian)? We suggest psychophysics experiments that may distinguish between the two.

    Deep learning of invariant spatio-temporal features from video.
    B. Chen, J. Ting, B. Marlin and N. de Freitas.
    NIPS 2010 Deep learning and unsupservised feature learning workshop
    This is the first time a ConvNet is used to learn features from video (in the pre-GPU and pre-Caffe era!). We found that the features are more invariant to transformations such as scaling and rotations, and able to, in short term, generate predictions about the future and reminisce about the past.

    Inductive principles for Restricted Botlzmann Machine learning.
    B. Marlin, K. Swersky, B. Chen and N. de Freitas.
    AISTATS 2010
    Restricted Botlzmann Machines are the basic building blocks of deep networks (as of 2010). We present a unifying framework to describe current learning strategies of RBMs, and compare them in denoising and classification tasks.