"title"=>"Hey Gemini, explain me these pictures — in bash ",
"summary"=>nil,
"content"=>"
Hey Gemini, explain me these pictures — in bash 🧑🏽💻
And… it’s multimodal! Which means, you can use both images and text to ask for questions (the output is only text — for now 😉).
Note: all the code snippets (and images) in here are available in my GitHub repo.
Let’s get started!
Hey Gemini, describe what you see..
Let’s start by an album by my favourite band:
$ ./gemini-generic.sh images/genesis-selling-england.jpg Describe what you see
# 🤌 QUESTION: Describe what you see
# 🌡️ TEMPERATURE: 0.2
# 👀 Examining image images/genesis-selling-england.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 96x96, segment length 16, baseline, precision 8, 536x528, components 3.
# ♊ Gemini no Saga answer for you:
The cover of Genesis album Selling England by the Pound features a painting
by British artist Paul Whitehead. The painting depicts a group of people in
a park, with a man sleeping on a bench in the foreground. The people are all
wearing clothes from the 1920s or 1930s, and the painting has a nostalgic,
almost surreal feel to it. The colors are muted and the figures are slightly
blurred, which gives the painting a dreamlike quality. The painting is also
full of symbolism, with the sleeping man representing England and the people
around him representing the different aspects of English society. The painting
has been interpreted in many different ways, but it is generally seen as a
commentary on the state of England in the 1970s.
♊ The cover of Genesis’ album Selling England by the Pound features a painting by British artist Paul Whitehead. The painting depicts a group of people in a park, with a man sleeping on a bench in the foreground. The people are all wearing clothes from the 1920s or 1930s, and the painting has a nostalgic, almost surreal feel to it. The colors are muted and the figures are slightly blurred, which gives the painting a dreamlike quality. The painting is also full of symbolism, with the sleeping man representing England and the people around him representing the different aspects of English society. The painting has been interpreted in many different ways, but it is generally seen as a commentary on the state of England in the 1970s.
A quick googling confirms that Paul Whitehead actually covered one of my favourite album of all times. If you love Genesis too and want to see me play Firth of Fifth, please check my youtube video.
Let’s try this cat now:
Gemini: A cat is walking in the snow. The cat is brown and white and has a long tail. The cat is looking at the camera. The snow is white and is covering the ground.
How did I do it?
If you check my bash script (gemini-generic.sh), you can see the request looks like this:
data=$(_base64_encode_mac_or_linux "$IMAGE") # Mac or Linux should both work!
# ...
cat > "$REQUEST_FILE" <<EOF
{'contents': {
'role': 'USER',
'parts': [
{'text': '$QUESTION'},
{'inline_data': {
'data': '$data',
'mime_type':'image/jpeg'}}]
}
}
EOF
The hardest part is actually the base64conversion, and making it work both on my Mac and on my Linux (as our grand-parents say in Italy “You wanted /bin/bash.? Now you pedal!”).
An Italian image, explained in Italian
./gemini-explain-image.sh images/italian-town.jpg
[..]
This is a view of the city of Trento, Italy from the Buonconsiglio Castle.
Gemini (🇬🇧): This is a view of the city of Trento, Italy from the Buonconsiglio Castle.
Let’s ask the same in Italian:
$ GENERATE_MP3=true ./gemini-explain-image-italian.sh images/italian-town.jpg
# 🤌 QUESTION: Descrivimi cosa vedi in questa immagine
# 🌡️ TEMPERATURE: 0.2
# 👀 Examining image images/italian-town.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=1, software=Google], baseline, precision 8, 926x1230, components 3.
# ♊ Gemini no Saga answer for you:
La foto mostra una loggia con delle colonne in pietra che incorniciano la vista di una città.
La città è circondata da montagne e si possono vedere i tetti delle case e le torri delle chiese.
Il cielo è azzurro e ci sono delle nuvole bianche.
# TTS_LANG: it-IT
Written .tmp.tts-output.json. curl_ret=0
t.audio.encoded: ASCII text, with very long lines (65536), with no line terminators
t.mp3: MPEG ADTS, layer III, v2, 32 kbps, 24 kHz, Monaural
t.mp3: MPEG ADTS, layer III, v2, 32 kbps, 24 kHz, Monaural
All good. MP3 created: 't.La foto mostra una loggia con delle colonne in pie.mp3'
Gemini (🇮🇹): La foto mostra una loggia con delle colonne in pietra che incorniciano la vista di una città. La città è circondata da montagne e si possono vedere i tetti delle case e le torri delle chiese. Il cielo è azzurro e ci sono delle nuvole bianche.
As you see, Italian is more verbose (my wife says the same about me!) but it’s not aware of the Buonconsiglio Palace. Interesting! I presume the Italian model has less training material to learn from than the English one.
Makes sense.
Comparing two images
How about comparing two images?
Some sample code is in my Makefile for convenience. try this:
ricc@derek:~/git/genai-googlecloud-scripts/07-bash-gemini-multimodal-jam$ 🐼 !mak
make compare-two-geminis
./gemini-generic-two-pics.sh images/gemini-constellation.png images/saga-blue-hair.jpg
♊️ Question: Can you highlight similarity and differences between the two? Also, do you recognize the same person in both of them?
👀 Examining image1 images/gemini-constellation.png: images/gemini-constellation.png: PNG image data, 1675 x 1302, 8-bit/color RGBA, non-interlaced.
👀 Examining image2 images/saga-blue-hair.jpg: images/saga-blue-hair.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 193x261, components 3.
♊️ Describing attached image:
The two images are of the constellation Gemini and the anime character Gemini Saga. The constellation is said to represent the twins Castor and Pollux, while the character Gemini Saga is a Gemini Saint in the anime series Saint Seiya. Both images depict two figures that are connected to each other. The constellation is made up of stars, while the character is a human.
♊ Response: ️ The two images are of the constellation Gemini and the anime character Gemini Saga. The constellation is said to represent the twins Castor and Pollux, while the character Gemini Saga is a Gemini Saint in the anime series Saint Seiya. Both images depict two figures that are connected to each other. The constellation is made up of stars, while the character is a human.
Today we’ve learnt something we certainly didn’t know: a constellation is made of stars, while the guy on the right is human 😃
Introducing Audio!
Why don’t we throw some audio in the mix?
I’ve added a tts.sh , which creates an MP3 out of an english (or Italian!) text given in ARGV. Convenient uh?
A few weeks ago my kids went to visit Santa Klaus.. the American Santa — they’ll tell you there’s another who speaks German instead, he’s called Samiklaus, they still can’t decide if its the same guy — my kids are X and Y year old; how old? Let’s ask Gemini!
$ make age-test
# equivalent to:
$ GENERATE_MP3=true ./gemini-generic.sh images/ricc-family-with-santa.jpg Tell me the age of the people you see, from left to right.
# 🤌 QUESTION: Tell me the age of the people you see, from left to right.
# 🌡️ TEMPERATURE: 0.2
# 👀 Examining image images/ricc-family-with-santa.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=3, software=Google], baseline, precision 8, 1164x826, components 3.
# ♊ Gemini no Saga answer for you:
1. 30-35
2. 2-3
3. 40-45
4. 2-3
5. 60-65
[..]
All good. MP3 created [..]
Now I’m older than 40–45 (46) so technically speaking this is a mistake, but as a human interaction I will consider it as a success 😉
Now this is not the best of MP3s, but it works! Check out ricc-family-with-santa.jpg.mp3.
You can also find Trento in Italian here: italian-town.jpg.mp3.
Something useful now: understand a diagram!
A colleague told me yesterday: the power of Gemini is that not only “he” understands text and images, also the very complex relationship between items in a diagram. This morning my Accrux earphones wouldn’t work, so I decided to understand how they work. I downloaded the User Manual, uploaded to Gemini as image, and asked Gemini a few questions.
From UI, this is easy peasy. Upload image and ask question:
And what about bash? Let’s do the same from CLI:
$ make read-instruction-manual-for-me
[..]
./gemini-generic.sh images/instruction-manuals/Acrux-User-Manual-4700503.png '1. How do i TURN it on? 2. Where is the power button located? 3. Is this the one called ANC?'
# 🤌 QUESTION: 1. How do i TURN it on? 2. Where is the power button located? 3. Is this the one called ANC?
# 🌡️ TEMPERATURE: 0.2
# 👀 Examining image images/instruction-manuals/Acrux-User-Manual-4700503.png: PNG image data, 1664 x 929, 8-bit/color RGBA, non-interlaced.
# ♊ Gemini no Saga answer for you:
1. Long press the power button for 2 seconds.
2. The power button is located on the right earcup.
3. Yes, this is the one called ANC.
# Note: No mp3 file generated (use GENERATE_MP3=true to generate one)
My question (I actually asked the questions one by one, so think of it as an avalanche-chat):
- How do i TURN it on?
- Where is the power button located?
- Is this the one called ANC?
# ♊ Gemini answer:
1. Long press the power button for 2 seconds.
2. The power button is located on the right earcup.
3. Yes, this is the one called ANC.
Bingo!
An unexpected Games of Thrones plot twist
This is the avatar I use in Google. I’m really curious to see what gemini can infer from here, also because I used some filters.
$ ./gemini-explain-image.sh images/ricc-logo.png
[..]
This is a photo of a man standing behind a Google Cloud Platform cutout.
The man is smiling and wearing a shirt that says, “That’s what I do,
I drink and I know things.” The background is a brick wall with blue
and white accents.
♊ Gemini: This is a photo of a man standing behind a Google Cloud Platform cutout. The man is smiling and wearing a shirt that says, “That’s what I do,
I drink and I know things.” The background is a brick wall with blue
and white accents.
And I thought! Of course, this is my favourite Games of Thrones t-shirt. Let’s see if he knows:
Q: Do you recognize the quote in this person tshirt?
$ GENERATE_MP3=true ./gemini-generic.sh images/ricc-logo.png Do you recognize the quote in this person tshirt
[..]
"That's what I do, I drink and I know things" is a quote from the TV show Game of Thrones,
said by the character Tyrion Lannister.
♊ Response: “That’s what I do, I drink and I know things” is a quote from the TV show Game of Thrones, said by the character Tyrion Lannister.
And you also have the MP3 if you wish: images/ricc-logo.png.mp3
Chapeau, Gemini!
Conclusions
Calling Gemini is easy, fun and powerful!
- Easy. If you can invoke Gemini via curl, imagine how easy it is to invoke it with your favourite language! Try out our SDK in Python, Node, Java, Go. For everything else there’s REST/cURL!
- Fun. look at how many use cases I’ve shown in this page: comparing two images, asking questions abuot an image, a writing, age of people, .. possibilities are endless.
- Powerful. Multi-modal is a game changer. And for what I hear, more modes are coming. 😉
Did you like this? Want more of this? Write it in the comments!
Riccardo
Hey Gemini, explain me these pictures — in bash 🧑🏽💻 was originally published in DevOps.dev on Medium, where people are continuing the conversation by highlighting and responding to this story.
","author"=>"Riccardo Carlesso",
"link"=>"https://blog.devops.dev/hey-gemini-explain-me-these-pictures-in-bash-06c03d0d0512?source=rss-b5293b96912f------2",
"published_date"=>Thu, 14 Dec 2023 15:16:16.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://blog.devops.dev/hey-gemini-explain-me-these-pictures-in-bash-06c03d0d0512?source=rss-b5293b96912f------2",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Sun, 31 Mar 2024 09:22:22.307697000 UTC +00:00,
"updated_at"=>Mon, 21 Oct 2024 15:33:53.788239000 UTC +00:00,
"newspaper"=>"Riccardo Carlesso - Medium",
"macro_region"=>"Blogs"}