I am fascinated with manipulating video with code. I decided to do a small project in which I could create a supercut not based on the words that are spoken, as it traditionally is but based on the visuals on the screen. For example what would happen if we could see all the "indoor" shots in a Wes Anderson movie mashed together? Will we see a pattern? Will it reveal a secret to his style? Can we learn something from watching all the shots with men in them and comparing it with all the shots with women in them? I decided to make a quick tool for this kind of exploration. Here is a quick demo video of how it works:
- User enter a link of the video they want to analyze
- The video gets downloaded using a youtube-dl node package
- It is then read by the fs package built into node in a base-64 encoding
- The video data is sent to Clarifai for analysis, we get such a response:
- The value is "confidence" score. Clarifai gives concepts for every second of video, "frame_info" gives the details of the timeframe. For the purpose of our application it becomes important to cleanup this data. If a concept appears of just 1 sec it might not be correct even if the confidence score was high, so we have to strike a balance between how many seconds should a concept persist with what threshold of confidence to be sure.
- Once the data is cleaned up we can get a JSON file like so:
- Finally once you have such a clean JSON file we want to load the video and make it jump between the required timestamps. This quite is a challenge too and requires a good amount of understanding of the HTML video. I used 'seeked' and 'timeupdated' events in order to manage skipping.
All the code for this project is available here: https://github.com/kshivanku/whatifvids