Well, this land is almost wasted for a year...
Recently I was thinking why Histogram of Gradient is a good feature. Is it because as human being, we are by our nature tend to differentiate objects into different perceptual classes by their color gradients variations.
In other words, the question is, whether perceptually different objects are also conceptually different, and which way of perception differentiate objects conceptually. Taking Zibra as an example, we call the horse like animal with striped texture as Zibra, but we never make up a new word for horses with different uni-colors, even the color is extremely eccentric, like snow-white. Is it a cue to show that human beings by their nature, tends to classify objects into different conceptual classes by gradient distribution? What if there exist a kind of animal shown in the pic below? I think we may already have a fancy new name for this thing, maybe Zirog?
Continue with this line of thinking, an infants gaze experiment is a possible way to prove this presumption. For example, prepare two sets of images, one set are two apples with different colors (say one green, one red), and another set are two apples with different texture (say one unicolor, another striped). The objects in the first set has similar HoG representation, while those in the second set has dramatically different HoG. We show the infant with one of the image in each set at the beginning, then use a box to occlude them. Since from [1], we know that even child at five-month-old knows that the object continue to exist if occluded. Now we secretly change the object occluded to the other image in each set. Now by comparing how surprise or how long the infants gaze at the later object, we can verify that how different the infant think those objects are, and whether they think the object is dramatically changed, which violates their assumption, thus they tends to put them into two different conceptual classes. If infants gaze reliably longer at the second set, then it suggests at the beginning of our life, we tends to differentiate objects into different classes by their gradient distribution (at least more than just using color itself). If that is the case, it in some way supports why HoG is successful in object recognition tasks.
[1] ``Object permanence in five-month-old infants'', R. Baillargeon, E. Spelke, S. Wasserman, Cognition, 20 (1985) 191-208
Yang's Blog
THIS IS A BLOG OF YEZHOU YANG, A PH.D STUDENT AT UNIVERSITY OF MARYLAND, COLLEGE PARK. MOST OF THE POSTS HERE ARE MY STUDY AND RESEARCH NOTES FOR QUICK ONLINE ACCESS. OCCASIONALLY, MY STUPID IDEAS WILL ALSO BE SHARED HERE.
Sunday, November 4, 2012
Saturday, November 19, 2011
Two years later, I went back to Gestalt Theory
Two years later, reading the chapter The Helmholts Principle again, I am astonished by this statement again.
"We immediately perceive whatever could not happen by chance"
It gives us a simple, so simple, criteria to check whether a visual event is meaningful. If some event does not happen by chance, it is meaningful.
There exists a whole system developed to measure meaningfulness in this sense, aka \[ \delta \] meaningful.
It then relates to several bounding theories from Theoretical Computer Science smoothly.
I found F. Cao is doing very interesting work under this trend:
http://www.irisa.fr/vista/Equipe/People/Frederic.Cao.html
"We immediately perceive whatever could not happen by chance"
It gives us a simple, so simple, criteria to check whether a visual event is meaningful. If some event does not happen by chance, it is meaningful.
There exists a whole system developed to measure meaningfulness in this sense, aka \[ \delta \] meaningful.
It then relates to several bounding theories from Theoretical Computer Science smoothly.
I found F. Cao is doing very interesting work under this trend:
http://www.irisa.fr/vista/Equipe/People/Frederic.Cao.html
Wednesday, April 20, 2011
Virtual layer of ROS
The Operating System class I am TAing comes to the file system part, in where a virtual file system layer above the different file systems is introduced.
On the way driving to Qualcomm New Jersey, a question came into my mind: Is there a virtual action manipulation layer between our mind and our motor system? For example, our mind may pass a command: grab the cup on the desk. After the interpretation of he virtual action manipulation layer, the same command is translated into several different low level commands to different motor systems:
To eyes, the command becomes Eye_Grabcup, which actually will focus on the cup;
To legs, the command becomes Leg_Grabcup, which may drive the legs walk towards the cup;
And to hands, the command becomes Hand_Grabcup, which is go and grad the cup!
Now when we talking about the Vision Executive and the Language Executive, is it reasonable to put a virtual action manipulation layer over them? For example, "focus" is a actual action of VE, and "search related knowledge" is the action of LE, both of them share the same virtual layer function: "Attend!";
Finally we completed the Qualcomm presentation~~ It always feels good to have a chance to share the idea and our preliminary results. The researchers from Qualcomm and other students attending the finalists also gave us a lot of valuable suggestions. We appreciate of every comment and still a loooot of work ahead~~
On the way driving to Qualcomm New Jersey, a question came into my mind: Is there a virtual action manipulation layer between our mind and our motor system? For example, our mind may pass a command: grab the cup on the desk. After the interpretation of he virtual action manipulation layer, the same command is translated into several different low level commands to different motor systems:
To eyes, the command becomes Eye_Grabcup, which actually will focus on the cup;
To legs, the command becomes Leg_Grabcup, which may drive the legs walk towards the cup;
And to hands, the command becomes Hand_Grabcup, which is go and grad the cup!
Now when we talking about the Vision Executive and the Language Executive, is it reasonable to put a virtual action manipulation layer over them? For example, "focus" is a actual action of VE, and "search related knowledge" is the action of LE, both of them share the same virtual layer function: "Attend!";
Finally we completed the Qualcomm presentation~~ It always feels good to have a chance to share the idea and our preliminary results. The researchers from Qualcomm and other students attending the finalists also gave us a lot of valuable suggestions. We appreciate of every comment and still a loooot of work ahead~~
Thursday, April 14, 2011
Something about TA
I am TAing Operating System this semester.
Every time I saw students struggling and being tortured by the projects (such as implementing paging system or file system on toy OS: GeekOS), I seriously felt the difference between US education system and the CHN system. Undergraduate students here are soooo busy working on all kind of projects, which granted them first hand experience on system implementing while in China, the OS class is more about the "concept".
Even if you know all the "concept", you still won't be able to actually how actually OS is implemented. The only "projects" are seeking into an old version of linux kernel and never ever actually implement a function... If we think it a step deeper, what we learned is how to "copy", not how to "create". That is a huge difference.
BTW, playing around GeekOS is interesting (as long as you don't actually need to implement those head aching projects).
Every time I saw students struggling and being tortured by the projects (such as implementing paging system or file system on toy OS: GeekOS), I seriously felt the difference between US education system and the CHN system. Undergraduate students here are soooo busy working on all kind of projects, which granted them first hand experience on system implementing while in China, the OS class is more about the "concept".
Even if you know all the "concept", you still won't be able to actually how actually OS is implemented. The only "projects" are seeking into an old version of linux kernel and never ever actually implement a function... If we think it a step deeper, what we learned is how to "copy", not how to "create". That is a huge difference.
BTW, playing around GeekOS is interesting (as long as you don't actually need to implement those head aching projects).
Friday, April 1, 2011
Micheal A. Arbib's talk
Michael A. Arbib
Template construction grammar and the generation of descriptions of visual scenes.
At the beginning of the lecture, Prof. Arbib showed us a static image about three woman racing, one fell down and all of them only has one leg ( the other one is synthetic). He showed us the image within 5 seconds and asked people to describe it. The interesting thing is that people tend to attend to each image part with a hierarchy following a attention mechanism.
In other word, people tend to focus on woman racing, and then "ah, someone fell down", and then "ah! they all have one synthetic leg!".
Then he introduce the constructive grammar, and it basically works like this: when a object or action observed, it automatically add a node into the graph structure, until every attractive thing and event has been perceived, the structure then going to parse to high level until a sentence or a description are generated.
For example, a woman punch a man in his face. If you attend to woman's fist, then the man's face, the description tends to be: a fist punch the man. On the contrary, if you attend to the man's face before the fist, the description tends to be: a man is hit by a fist.
And all the construction grammar generated based on a strong assumption: vision system gives perfect output.
***********************************************************************************
Now it comes to the earth: at least till now, vision system is far away from giving perfect output. All we can do is giving a some kind of probability to something or some action exists in the image or video. Then what will happen to this grammar? It becomes probabiliticalized (I made the word....)! Just like what we have done for the EMNLP paper, we take noisy vision output as input, using a tweaked HMM system to generate most likely description of the image. I believe if we can combine the uncertainty of vision output with the language construction prof. Arbib introduced here, a robust scene description generator is not far away.
PS: Prof. Arbib is a really humorous senior professor:
He referred the traditional "S->NPVP..." grammar as "Cheerleader" grammar, why? Give me a S! Give me a N! Give me a V!
He also said he is planning to retire in five years, although the number five is a constant~
^^
Template construction grammar and the generation of descriptions of visual scenes.
At the beginning of the lecture, Prof. Arbib showed us a static image about three woman racing, one fell down and all of them only has one leg ( the other one is synthetic). He showed us the image within 5 seconds and asked people to describe it. The interesting thing is that people tend to attend to each image part with a hierarchy following a attention mechanism.
In other word, people tend to focus on woman racing, and then "ah, someone fell down", and then "ah! they all have one synthetic leg!".
Then he introduce the constructive grammar, and it basically works like this: when a object or action observed, it automatically add a node into the graph structure, until every attractive thing and event has been perceived, the structure then going to parse to high level until a sentence or a description are generated.
For example, a woman punch a man in his face. If you attend to woman's fist, then the man's face, the description tends to be: a fist punch the man. On the contrary, if you attend to the man's face before the fist, the description tends to be: a man is hit by a fist.
And all the construction grammar generated based on a strong assumption: vision system gives perfect output.
***********************************************************************************
Now it comes to the earth: at least till now, vision system is far away from giving perfect output. All we can do is giving a some kind of probability to something or some action exists in the image or video. Then what will happen to this grammar? It becomes probabiliticalized (I made the word....)! Just like what we have done for the EMNLP paper, we take noisy vision output as input, using a tweaked HMM system to generate most likely description of the image. I believe if we can combine the uncertainty of vision output with the language construction prof. Arbib introduced here, a robust scene description generator is not far away.
PS: Prof. Arbib is a really humorous senior professor:
He referred the traditional "S->NPVP..." grammar as "Cheerleader" grammar, why? Give me a S! Give me a N! Give me a V!
He also said he is planning to retire in five years, although the number five is a constant~
^^
Saturday, March 26, 2011
Qualcomm Innovation Fellowship finalists
I feel extremely exciting to have a chance to deliver a presentation and meet other professors and Phd students in those top CS and EE departments. And it always feel good to got an opportunity to share research ideas and results within the society. Have been working on language vision stuff for almost one year, the beauty of combining these two main cognition processes within a computational framework makes me until now have not get bored by PhD life. So far so well.
QInF 2011 East Coast Finals: Bridgewater, NJ
| School | Students | Recommenders | Innovation Title |
|---|---|---|---|
| UMD | Ching L Teo Yezhou Yang | Yiannis Aloimonos Hal Daumé III | Robots Need Language: A computational model for the integration of vision, language and action |
| Princeton | Mohammed Shoaib Kyong Ho Lee | Naveen Verma Niraj K. Jha | Algorithm-driven Platforms for Low-energy Intelligent Biomedical Systems |
| MIT | Soheil Feizi Georgios Angelopoulos | Muriel Medard Vivek Goyal | Energy-Efficient Time-Stampless Adaptive Nonuniform Sampling |
| UMD | Timir Datta Filiz Yesilkoy | Martin Peckerar Pamela Abshire | An Ultra-low Power Infrared Scavenging Autonomous Silicon Micro-robot |
| MIT | Adam Marcus Eugene Wu | Samuel Madden | Qurk: A Crowdsourced Database for Query Processing with People |
| Princeton | Zhen Xiang Hao Xu | Peter Ramadge | A Cloud-based Low Bandwidth Machine Learning Service |
| Rutgers | Sejong Yoon Shahriar Shariat | Vladimir Pavlovic | A Novel Multimodal Mobile Media Rank System for Faster Media Flow |
| Rutgers | Akash Baid Tam Vu | Dipankar Raychaudhuri | NASCOR: Network Assisted Spectrum Coordination Service for Coexistence between Heterogeneous Radio Systems |
| UCB | Reza Naima Pablo Paredes | John Canny | The Mobile Stress Platform: Detection, Feedback & Mitigation |
| MIT | Ahmed Kirmani Andrea Colaco | Vivek K Goyal Franco Wong | Single Pixel Depth Sensing and 3D Camera |
| MIT | Sushmit Goswami Sungwon Chung | Joel Dawson Anantha Chandrakasan | A Frequency Agile Architecture for Fully Integrated Radio Frontends |
- Venue: Qualcomm Research Center New Jersey
- Address: 500 Somerset Corporate Blvd, Bridgewater, NJ 08807
- Date: Tue Apr 1
We are planning to drive to NJ, to visit Princeton and enjoy the presentation. ^^
Friday, March 18, 2011
Hal's Legacy
I am reading the book "Hal's Legacy" (2001) and found something interesting, on page 8 when the author tried to summarize Prof. Rosenfeld's vision on Computer Vision in one paragraph:
" Imagine, for example, a computer that could look at an arbitrary scene - anything from a sunset over a fishing village to Grand Central Station at rush hour - and produce a verbal description. This is a problem of overwhelming difficulty, relying as it does on finding solutions to both vision and language and then integrating them. I suspect that scene analysis will be one of the last cognitive tasks to be performed well by computers."
" Imagine, for example, a computer that could look at an arbitrary scene - anything from a sunset over a fishing village to Grand Central Station at rush hour - and produce a verbal description. This is a problem of overwhelming difficulty, relying as it does on finding solutions to both vision and language and then integrating them. I suspect that scene analysis will be one of the last cognitive tasks to be performed well by computers."
Subscribe to:
Posts (Atom)
