Skip to content

bobby_dreamer

Git Theory - 5 - Internals

notes, git6 min read

# Git Architecture

Git maintains two primary data structures

  • Index
    • Stores information about current working directory and changes made to it
  • Object Database
    • Blobs (files)
      • Stored in .git/objects
      • All files are stored as blobs
      • It has only file’s data does not contain any meta data information or even file name.
      • Blob name is just hash of data it contains
    • Trees
      • Represents one-level directory information. It can also be pictured as a simple table.
      • Records object details like Hash ID, mode, type and filename. This can pictured as columns in the simple table.
      • Only types of objects, it can contain are blob or tree(sub-folders).
    • Commits
      • One object for every commit
      • Contains hash of parent, name of author, time of commit, and hash of the current tree
    • Tags
      • Human readable name to a commit

To efficiently use disk space and network bandwidth, git compresses the objects and stores in pack-files which are also placed in .git/objects directory

Simplistically just remember,

  • Git stores content of your files as blob objects
  • You folders become tree object which contains blob objects(files) and other tree objects(sub folders)
  • Commit is a type of object that always point to a tree. Every commit always creates two objects (a)tree (b)commit metadata.
  • Branches are pointers to commit metadata objects.

# SHA1 hash

  • SHA1 values are 160-bits, 20-bytes. Represented in 40 Hex Characters.

  • Git uses SHA1 hash of the content as file name. SHA1 hash is 40 characters, so first 2 characters as folder name and remaining 38 characters as filename in .git/object/ directory.

    • Like for example 8ab686eafeb1f44702738c8b0f24f2567c36da6d is the hash of the content. Then it will be stored as .git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d.
  • It is considered as globally unique because you can have 2160 or 148 possible SH1 hashes(i.e., 1 with 48 zeros after it)

  • Important characteristic of SHA1 hash computation is it always computes the same hash for identical content, regardless of where the content is. In other words, the same file content in different directories and even on different machines yields the exact same SHA1 hash ID. Thus, the SHA1 hash ID of a file is a globally unique identifier.

  • Any change to the file makes SHA1 hash change and thus creating new version of the file.

  • A collision is very rare but possible( if one hashed 280 random blobs )

  • SHA1 hash can point to a blob, a commit or a tree.

# How git hashes content

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
2$ echo "Hello, World!" | git hash-object --stdin
38ab686eafeb1f44702738c8b0f24f2567c36da6d
Git hash-object
  • This command computes hash of the content and optionally can write it to the object database
  • It is one of plumbing commands
Git has two types of commands
  • Porcelain – User facing commands/functions
  • Plumbing – Low level commands /functions

# Contents of .git directory

Contents of .git directory

# Git Objects

Git is simple key-value data store meaning any content you add into git repo, you will in-turn get a unique key for it. Later that object/content can be retreived by using the key.

Git consists of 4 types of object :

  • blob : Binary large object. In git terms all files containing data are called as blobs.
  • tree : Trees are like folders which can contain more files or sub-folders. So in git terms, trees can contain more blobs or trees. Each entry in a tree object consists of SHA-1 hash of a blob or subtree and its mode, type, and filename.
  • commit : When you perform commit, the commit object stores details like tree, author, committer, commit-data and message. Tree object contains details of newly updated objects and objects that has'nt been changed. For example, if a file is updated and when the change is committed, git creates a new tree object containing links pointing to the newly updated file and other objects which hasn't changed.
  • tag : Tags just refers to a commit point. This is a much easier way than remembering a commit hash. It can be called as a bookmark or say a user-friendly commit name usually a version number.
Initial commit

Git Objects – Initial Commit

Second commit

Git Objects – second Commit

Overall this will be the structure of git internal objects

Git Objects

Objects, Hashes & Blobs

For testing purposes, we can either create a new file or create a object like below,

1$ echo 'test content' | git hash-object -w --stdin
2d670460b4b4aece5915caf5c68d12f560a9fe3e4
  • git hash-object would take the content you handed to it and merely return the unique key that would be used to store it in your Git database.
  • The -w option then tells the command to not simply return the key, but to write that object to the database.
  • Finally, the --stdin option tells git hash-object to get the content to be processed from stdin; otherwise, the command would expect a filename argument at the end of the command containing the content to be used.
    1$ git hash-object -w test.txt
    283baae61804e65cc73a7201a7252750c76066a30

All the git objects can be found in .git/objects folder. Below, we are in a new repo, so its empty.

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
2$ find .git/objects
3.git/objects
4.git/objects/info
5.git/objects/pack

Create a new file with content & commit

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
2$ echo "Hello, World!" > HW.txt
3
4Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
5$ git add .
6warning: LF will be replaced by CRLF in HW.txt.
7The file will have its original line endings in your working directory.
8
9Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
10$ git commit -m "Initial Commit"
11[master (root-commit) 003c678] Initial Commit
12 1 file changed, 1 insertion(+)
13 create mode 100644 HW.txt
14
15Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
16$ git log
17commit 003c6781e4475888b59f248e5e76d3334d278f99 (HEAD -> master)
18Author: Sushanth Bobby Lloyds <bobby.dreamer@gmail.com>
19Date: Mon Oct 12 22:34:54 2020 +0530
20
21 Initial Commit

Note : If you look at the output of commit above, the code next create mode means,

  • 100644 : File attributes of the object. Regular non-executable file
  • 100755 : Executable file

Now objects directory has 3 files – 3 Objects. They are Commit, Tree & Blob

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
2$ find .git/objects
3.git/objects
4.git/objects/00
5.git/objects/00/3c6781e4475888b59f248e5e76d3334d278f99
6.git/objects/8a
7.git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d
8.git/objects/ee
9.git/objects/ee/929cd9cd862b204986cf94ab23853b4c98cb97
10.git/objects/info
11.git/objects/pack

Now lets map what is what

  • From git log we know commit hash is 003c6781e4475888b59f248e5e76d3334d278f99

    1.git/objects/00
    2.git/objects/00/3c6781e4475888b59f248e5e76d3334d278f99
  • Using the command git ls-files -s, we can know the hash of the files

    1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
    2$ git ls-files -s
    3100644 8ab686eafeb1f44702738c8b0f24f2567c36da6d 0 HW.txt

    Now we can say that below is the file

    1.git/objects/8a
    2.git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d
  • Now we can easily make a guess that remaining one has to the tree

    1.git/objects/ee
    2.git/objects/ee/929cd9cd862b204986cf94ab23853b4c98cb97

Instead of guessing, we can use git cat-file to know the type, content and size of the files from the hash. git cat-file –t <hash>

  • To know type of the object

git cat-file –p <hash>

  • Pretty-print the content of the file

git cat-file –s <hash>

  • To know the size of the file
Knowing object type
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
2$ git cat-file -t ee929
3tree
4
5Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
6$ git cat-file -t 8ab68
7blob
8
9Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
10$ git cat-file -t 003c6
11commit
Knowing content in the object
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
2$ git cat-file -p ee929
3100644 blob 8ab686eafeb1f44702738c8b0f24f2567c36da6d HW.txt
4
5Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
6$ git cat-file -p 8ab68
7Hello, World!
8
9Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
10$ git cat-file -p 003c6
11tree ee929cd9cd862b204986cf94ab23853b4c98cb97
12author Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1602522294 +0530
13committer Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1602522294 +0530
14
15Initial Commit
Knowing the size of the object
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
2$ git cat-file -s ee929
334
4
5Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
6$ git cat-file -s 8ab68
714
8
9Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
10$ git cat-file -s 003c6
11209

Note : Cannot use CAT command to print the contents as they are compressed

cat object

Knowing more about tags

Lets see basic difference about light-weight tag & annotated tag. In the below example,

  • v1.0 : Light-weight tag
  • v2.0 : Annotated tag
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
2$ git cat-file -t v1.0
3commit
4
5Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
6$ git cat-file -t v2.0
7tag

Lets pretty print v1.0 & v2.0. Here you can see whats in both the tags.

  • v1.0 : Just refers the commit 0500b45. This tag is sort of alias/aka of that commit.
  • v2.0 : This has additional information.
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
2$ git lol
3* 83ce55e - (HEAD -> master) Adding g.txt (1 year, 10 months ago) <Sushanth Bobby Lloyds>
4* 804e1db - Added f.txt - Tag Testing (1 year, 10 months ago) <Sushanth Bobby Lloyds>
5* 25dc023 - (tag: v2.0) Revert "f7.txt Update 1" (1 year, 11 months ago) <Sushanth Bobby Lloyds>
6* d9e798a - f7.txt Update 2 (1 year, 11 months ago) <Sushanth Bobby Lloyds>
7* 2f161e1 - f7.txt Update 1 (1 year, 11 months ago) <Sushanth Bobby Lloyds>
8* 431be32 - f7.txt Initial (1 year, 11 months ago) <Sushanth Bobby Lloyds>
9* 4d60b51 - Adding e.txt (1 year, 11 months ago) <Sushanth Bobby Lloyds>
10* 34013d4 - Adding d.txt (1 year, 11 months ago) <Sushanth Bobby Lloyds>
11* f891fb4 - Adding c.txt (1 year, 11 months ago) <Sushanth Bobby Lloyds>
12* 3cee413 - Adding b.txt (1 year, 11 months ago) <Sushanth Bobby Lloyds>
13* 080f76f - Adding a.txt (1 year, 11 months ago) <Sushanth Bobby Lloyds>
14* 4fd2b57 - Revert "Adding f5.txt" (1 year, 11 months ago) <Sushanth Bobby Lloyds>
15* 0500b45 - (tag: v1.0) Adding f6.txt (1 year, 11 months ago) <Sushanth Bobby Lloyds>
16...
17
18Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
19$ git cat-file -p v1.0
20tree 10aa603d8807f825e542e351421d82784119b542
21parent 5e01aa2fd80af3f7ac30013f41df6fee105f9c90
22author Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1542990955 +0530
23committer Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1542990955 +0530
24
25Adding f6.txt
26
27Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
28$ git cat-file -p v2.0
29object 25dc023c91c8a2ae63b2f7d92f93b094347e9bec
30type commit
31tag v2.0
32tagger Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1543898924 +0530
33
34O.O Version 2

To confirm that v1.0 is just refering the commit. We are below pretty printing hash 0500b45

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
2$ git cat-file -p 0500b45
3tree 10aa603d8807f825e542e351421d82784119b542
4parent 5e01aa2fd80af3f7ac30013f41df6fee105f9c90
5author Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1542990955 +0530
6committer Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1542990955 +0530
7
8Adding f6.txt

The output is exactly same. So for ease of use, instead of using commit hash you can use light-weight commit for referencing.

Knowing the content in merged commit

Lets see this git graph.

  • HEAD is pointing to the tip of the branch which is a commit created by merge
  • Parent of the merge-commit is,
    • branch : add b1988f4
    • branch : sub f608a17
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test3 (master)
2$ git lol
3* 335604e - (HEAD -> master) Merge branches 'add' and 'sub' (8 days ago) <Sushanth Bobby Lloyds>
4|\
5| * f608a17 - (sub) Update sub() (8 days ago) <Sushanth Bobby Lloyds>
6* | b1988f4 - (add) Updated add() (8 days ago) <Sushanth Bobby Lloyds>
7|/
8* c35ab3e - Added both add and sub (8 days ago) <Sushanth Bobby Lloyds>
9|\
10| * 447bd6a - Added sub feature (8 days ago) <Sushanth Bobby Lloyds>
11* | 8671031 - Added add feature (8 days ago) <Sushanth Bobby Lloyds>
12|/
13* 216acda - Initial commit (8 days ago) <Sushanth Bobby Lloyds>

Lets pretty print commit 335604e or HEAD and confirm the parent

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test3 (master)
2$ git cat-file -p HEAD
3tree ec0367d3524177d7b5350200217a227677b5b9e0
4parent b1988f45e9ef5f1717762df1d39ea409eb63cb4d
5parent f608a17f5bbccf7bc5b5154ec0c8299e03933364
6author Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1601826795 +0530
7committer Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1601826795 +0530
8
9Merge branches 'add' and 'sub'

git merge-base

  • It race backwards from these two points until these branches have same commit point
  • This helps in analysis
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test3 (master)
2$ git merge-base add sub
3c35ab3e9d7611576ac203473d5c75946b336810b

# git rev-parse

  • Most of the git commands internally executes “git rev-parse” to get the full SHA1-hash
  • It basically converts short-hash into long-hash
  • Below you can see rev-parse used 4letter hash to get the actual hash
  • This is what we used earlier to get the tag hash

Lets take below example for our rev-parse test.

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
2$ git lol
3* 83ce55e - (HEAD -> master) Adding g.txt (1 year, 10 months ago) <Sushanth Bobby Lloyds>
4* 804e1db - Added f.txt - Tag Testing (1 year, 10 months ago) <Sushanth Bobby Lloyds>
5* 25dc023 - (tag: v2.0) Revert "f7.txt Update 1" (1 year, 11 months ago) <Sushanth Bobby Lloyds>
6...
7* 0500b45 - (tag: v1.0) Adding f6.txt (1 year, 11 months ago) <Sushanth Bobby Lloyds>
8...

'git rev-parse` takes short-hash convert to long-hash

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
2$ git rev-parse 83ce
383ce55e716284a03e8bcf20d732f3df90799d77c

Here we are rev-parsing tags, when we

  • v1.0 : We know that its refering to commit
  • v2.0 : Annotated tag is a object with its own hash.
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
2$ git rev-parse v1.0
30500b45503db6409bc2dc2d2c27a8d09a86150f8
4
5Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
6$ git rev-parse v2.0
78a128853cc7f76c0243331b49aed36f8100cbabf
8
9Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
10$ git cat-file -t 8a1288
11tag
12
13Sushanth@Sushanth-VAIO MINGW64 /d/GITs/test1 (master)
14$ git cat-file -p 8a1288
15object 25dc023c91c8a2ae63b2f7d92f93b094347e9bec
16type commit
17tag v2.0
18tagger Sushanth Bobby Lloyds <bobby.dreamer@gmail.com> 1543898924 +0530
19
20O.O Version 2

Knowing rev-parse we can get the hash of commit or tree easily git rev-parse commit-ish^{type}

git rev-parse head^{tree}

  • Shows current HEAD’s tree hash git rev-parse head^{commit}
  • Shows current HEAD’s commit hash
1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
2$ git lol
3* 003c678 - (HEAD -> master) Initial Commit (76 minutes ago) <Sushanth Bobby Lloyds>
4
5Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
6$ git rev-parse master^{commit}
7003c6781e4475888b59f248e5e76d3334d278f99
8
9Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
10$ git rev-parse master^{tree}
11ee929cd9cd862b204986cf94ab23853b4c98cb97
12
13Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
14$ git cat-file -p ee92
15100644 blob 8ab686eafeb1f44702738c8b0f24f2567c36da6d HW.txt
16
17Sushanth@Sushanth-VAIO MINGW64 /d/GITs/Internals (master)
18$ git ls-files -s
19100644 8ab686eafeb1f44702738c8b0f24f2567c36da6d 0 HW.txt

# Git FileSystemChecK

git fsck

  • Verifies the connectivity and validity of the objects in the repository

Typical output looks like this

1Sushanth@Sushanth-VAIO MINGW64 /d/GITs/git (master)
2$ git fsck
3Checking object directories: 100% (256/256), done.
4dangling commit 1972cb1c728ed3c120ed4ea41b1ff421d9eb7604
5dangling blob 2ccc9d4b364a7f69544839b78e223c482508919f
6dangling tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
7dangling commit 5060f314ceef6ed57af5a9c0bc96eee39d925e04
8dangling commit 69643d82b2951bbe46e4c613446fe5424319cb5a
9dangling blob 9686686012038d0e769708df79668e4a83afdccb
10dangling blob 9fec48973ca630bb0869b04f42957f23f40e3d2e
11dangling commit a34ce3582f54e198d8b050871ef5bab970a0c9c4
12dangling blob d62cb1ccedd70cd9b1c1ce0a19962389f0d2b4a5
13dangling tag 2725b352a7d635e37761ddb0e6070bb9ce5f40c0
14dangling tag 4129fcbafa2cc682fc9fecef0304bedce94f7bbe
15dangling commit 8e1f25a9ada69c9ae4b5d56c16722e5ffe2d8fb7
16dangling tag 9a19fca3bb272e12b138ab3b43bbf45d5516eaa8
17dangling commit 9a21716ed23c6f9049e90dfa1d86838fa3a22d44
18dangling tag b3abe249b2b1e7d0cf65d77c276a3c77556db162
19dangling commit f0871d07baf443ce8915d28e3cbdf1d658fec211
  • Dangling blob : A change that made it to the staging area/index but never got committed. One thing that is amazing with git is that once it gets added to the staging area, you can always get it back because these blobs behave like commits in that they have a hash too.

  • Dangling commit/tag : A commit which is not associated with reference, i.e there is no way to reach it. For example, we delete the branch featureX without merging its changes, then the commit in featureX will become a dangling commit because there is no reference associated with it. Had it been merged into master, then HEAD and master references would have pointed to the commit in featureX and it would not be dangling anymore, even if we deleted featureX.

You can think branches(master/main, featureX) and HEAD are just references to specific commits. featureX and master labels refer to latest commits on their respective branches. HEAD generally refers to the tip of the currently checked out branch (master in this case).

# Garbage Collection

git gc

  • Executes a lots of housekeeping activities
    • Compresses all the objects and stores in pack file
    • Removes unreachable objects ( dangling commits )

Below command can be used to remove all dangling objects from the repository

1git gc --prune=now

# Cleaning the repo

CommandsDescription
git clean -nto list what files would be removed(dry run)
git clean -fto remove untracked files
git clean -dfx(d):remove any untracked folders, (f):force, (x):remove ignored/hidden files as well

Caution : git clean -dfx usually everyone ignore key files and folders. This command can delete them and it will be unrecoverable.

# Un git

1rm –rf .git .gitignore

# Next steps

  • Collaboration : Git remote repository
  • Git Everyday : Git flowchart, shortcuts and references
  • Origin : How it all began. What is git ? and Terminologies used in this series.
  • Basics : config, init, add, rm, .gitignore, commit, log, blame, diff, tag, describe, show and stash
  • Undos : checkout, reset, revert and restore
  • Branching : Git Branching