Custom Search

Sunday, November 4, 2012

Why HoG is good?

Well, this land is almost wasted for a year...

Recently I was thinking why Histogram of Gradient is a good feature. Is it because as human being, we are by our nature tend to differentiate objects into different perceptual classes by their color gradients variations.

In other words, the question is, whether perceptually different objects are also conceptually different, and which way of perception differentiate objects conceptually. Taking Zibra as an example, we call the horse like animal with striped texture as Zibra, but we never make up a new word for horses with different uni-colors, even the color is extremely eccentric, like snow-white. Is it a cue to show that human beings by their nature, tends to classify objects into different conceptual classes by gradient distribution? What if there exist a kind of animal shown in the pic below? I think we may already have a fancy new name for this thing, maybe Zirog?



Continue with this line of thinking, an infants gaze experiment is a possible way to prove this presumption.  For example, prepare two sets of images, one set are two apples with different colors (say one green, one red), and another set are two apples with different texture (say one unicolor, another striped). The objects in the first set has similar HoG representation, while those in the second set has dramatically different HoG. We show the infant with one of the image in each set at the beginning, then use a box to occlude them. Since from [1], we know that even child at five-month-old knows that the object continue to exist if occluded. Now we secretly change the object occluded to the other image in each set. Now by comparing how surprise or how long the infants gaze at the later object, we can verify that how different the infant think those objects are, and whether they think the object is dramatically changed, which violates their assumption, thus they tends to put them into two different conceptual classes. If infants gaze reliably longer at the second set, then it suggests at the beginning of our life, we tends to differentiate objects into different classes by their gradient distribution (at least more than just using color itself). If that is the case, it in some way supports why HoG is successful in object recognition tasks.

[1] ``Object permanence in five-month-old infants'', R. Baillargeon, E. Spelke, S. Wasserman, Cognition, 20 (1985) 191-208

Saturday, November 19, 2011

Two years later, I went back to Gestalt Theory

Two years later, reading the chapter The Helmholts Principle again, I am astonished by this statement again.

"We immediately perceive whatever could not happen by chance"

It gives us a simple, so simple, criteria to check whether a visual event is meaningful. If some event does not happen by chance, it is meaningful.

There exists a whole system developed to measure meaningfulness in this sense, aka \[ \delta \] meaningful.

It then relates to several bounding theories from Theoretical Computer Science smoothly.

I found F. Cao is doing very interesting work under this trend:

 
http://www.irisa.fr/vista/Equipe/People/Frederic.Cao.html

Wednesday, April 20, 2011

Virtual layer of ROS

The Operating System class I am TAing comes to the file system part, in where a virtual file system layer above the different file systems is introduced.

On the way driving to Qualcomm New Jersey, a question came into my mind: Is there a virtual action manipulation layer between our mind and our motor system? For example, our mind may pass a command: grab the cup on the desk. After the interpretation of he virtual action manipulation layer, the same command is translated into several different low level commands to different motor systems:

To eyes, the command becomes Eye_Grabcup, which actually will focus on the cup;

To legs, the command becomes Leg_Grabcup, which may drive the legs walk towards the cup;

And to hands, the command becomes Hand_Grabcup, which is go and grad the cup!

Now when we talking about the Vision Executive and the Language Executive, is it reasonable to put a virtual action manipulation layer over them? For example, "focus" is a actual action of VE, and "search related knowledge" is the action of LE, both of them share the same virtual layer function: "Attend!";



Finally we completed the Qualcomm presentation~~ It always feels good to have a chance to share the idea and our preliminary results. The researchers from Qualcomm and other students attending the finalists also gave us a lot of valuable suggestions. We appreciate of every comment and still a loooot of work ahead~~

Thursday, April 14, 2011

Something about TA

I am TAing Operating System this semester.

Every time I saw students struggling and being tortured by the projects (such as implementing paging system or file system on toy OS: GeekOS), I seriously felt the difference between US education system and the CHN system. Undergraduate students here are soooo busy working on all kind of projects, which granted them first hand experience on system implementing while in China, the OS class is more about the "concept".

Even if you know all the "concept", you still won't be able to actually how actually OS is implemented. The only "projects" are seeking into an old version of linux kernel and never ever actually implement a function... If we think it a step deeper, what we learned is how to "copy", not how to "create". That is a huge difference.

BTW, playing around GeekOS is interesting (as long as you don't actually need to implement those head aching projects).

Friday, April 1, 2011

Micheal A. Arbib's talk

Michael A. Arbib

Template construction grammar and the generation of descriptions of visual scenes.




At the beginning of the lecture, Prof. Arbib showed us a static image about three woman racing, one fell down and all of them only has one leg ( the other one is synthetic). He showed us the image within 5 seconds and asked people to describe it. The interesting thing is that people tend to attend to each image part with a hierarchy following a attention mechanism.

In other word, people tend to focus on woman racing, and then "ah, someone fell down", and then "ah! they all have one synthetic leg!".

Then he introduce the constructive grammar, and it basically works like this: when a object or action observed, it automatically add a node into the graph structure, until every attractive thing and event has been perceived, the structure then going to parse to high level until a sentence or a description are generated.

For example, a woman punch a man in his face. If you attend to woman's fist, then the man's face, the description tends to be: a fist punch the man. On the contrary, if you attend to the man's face before the fist, the description tends to be: a man is hit by a fist.

And all the construction grammar generated based on a strong assumption: vision system gives perfect output.

***********************************************************************************

Now it comes to the earth: at least till now, vision system is far away from giving perfect output. All we can do is giving a some kind of probability to something or some action exists in the image or video. Then what will happen to this grammar? It becomes probabiliticalized (I made the word....)! Just like what we have done for the EMNLP paper, we take noisy vision output as input, using a tweaked HMM system to generate most likely description of the image. I believe if we can combine the uncertainty of vision output with the language construction prof. Arbib introduced here, a robust scene description generator is not far away.










PS: Prof. Arbib is a really humorous senior professor:
He referred the traditional "S->NPVP..." grammar as "Cheerleader" grammar, why? Give me a S! Give me a N! Give me a V!
He also said he is planning to retire in five years, although the number five is a constant~

^^

Saturday, March 26, 2011

Qualcomm Innovation Fellowship finalists

I feel extremely exciting to have a chance to deliver a presentation and meet other professors and Phd students in those top CS and EE departments. And it always feel good to got an opportunity to share research ideas and results within the society. Have been working on language vision stuff for almost one year, the beauty of combining these two main cognition processes within a computational framework makes me until now have not get bored by PhD life. So far so well.



QInF 2011 East Coast Finals: Bridgewater, NJ

SchoolStudentsRecommendersInnovation Title
UMDChing L Teo
Yezhou Yang
Yiannis Aloimonos
Hal Daumé III
Robots Need Language: A computational model for the integration of vision, language and action
PrincetonMohammed Shoaib
Kyong Ho Lee
Naveen Verma
Niraj K. Jha
Algorithm-driven Platforms for Low-energy Intelligent Biomedical Systems
MITSoheil Feizi
Georgios Angelopoulos
Muriel Medard
Vivek Goyal
Energy-Efficient Time-Stampless Adaptive Nonuniform Sampling
UMDTimir Datta
Filiz Yesilkoy
Martin Peckerar
Pamela Abshire
An Ultra-low Power Infrared Scavenging Autonomous Silicon Micro-robot
MITAdam Marcus
Eugene Wu
Samuel MaddenQurk: A Crowdsourced Database for Query Processing with People
PrincetonZhen Xiang
Hao Xu
Peter RamadgeA Cloud-based Low Bandwidth Machine Learning Service
RutgersSejong Yoon
Shahriar Shariat
Vladimir PavlovicA Novel Multimodal Mobile Media Rank System for Faster Media Flow
RutgersAkash Baid
Tam Vu
Dipankar RaychaudhuriNASCOR: Network Assisted Spectrum Coordination Service for Coexistence between Heterogeneous Radio Systems
UCBReza Naima
Pablo Paredes
John CannyThe Mobile Stress Platform: Detection, Feedback & Mitigation
MITAhmed Kirmani
Andrea Colaco
Vivek K Goyal
Franco Wong
Single Pixel Depth Sensing and 3D Camera
MITSushmit Goswami
Sungwon Chung
Joel Dawson
Anantha Chandrakasan
A Frequency Agile Architecture for Fully Integrated Radio Frontends

  • Venue: Qualcomm Research Center New Jersey
  • Address: 500 Somerset Corporate Blvd, Bridgewater, NJ 08807
  • Date: Tue Apr 1

We are planning to drive to NJ, to visit Princeton and enjoy the presentation. ^^


Friday, March 18, 2011

Hal's Legacy

I am reading the book "Hal's Legacy" (2001) and found something interesting, on page 8 when the author tried to summarize Prof. Rosenfeld's vision on Computer Vision in one paragraph:

" Imagine, for example, a computer that could look at an arbitrary scene - anything from a sunset over a fishing village to Grand Central Station at rush hour - and produce a verbal description. This is a problem of overwhelming difficulty, relying as it does on finding solutions to both vision and language and then integrating them. I suspect that scene analysis will be one of the last cognitive tasks to be performed well by computers."
Custom Search