This post was originally published on this site

Just a short blog about HDFS directory file and memory limits.

I’ve recently faced an issue with Hive with the following error message

org.apache.hadoop.ipc.RemoteException:
The directory item limit of /tmp/hive/hive is exceeded: limit=1048576 items=1048576

The parameter which caused the error above is called

dfs.namenode.fs-limits.max-directory-items

dfs.namenode.fs-limits.max-directory-items 1048576 Defines the maximum number of items that a directory may contain.

The parameter controls how much in a directory on HDFS may contain.

To check the current setting of the parameter above you could run

hdfs getconf -confKey dfs.namenode.fs-limits.max-directory-items
1048576

The parameters default value is 1048576 which seems to be quite enough. Though imagine you run thousand of jobs a day and each job writes s single logfile to a specific directory, then the limit will soon be reached.

A quick fix is deleting the not needed files in the directory causing the error (e.g. /tmp/hive/hive).

I’ve also discovered a script on github for cleaning up /tmp directory on HDFS.

https://github.com/nmilford/clean-hadoop-tmp

Please test carefully before running in production environments 😉

To prevent this issue you could set the parameter to “0” which will disable check. However the parameter needs to be specified in hdfs-site.xml which will lead to an restart of the whole HDFS stack.

 

Nevertheless do not forget that maximum number of files in HDFS depends on the amount of memory available for the NameNode.

As a rule of thumb we  the NameNode allocate 1000 MB  per million blocks stored in HDFS.

So with a block size of 128MB , a million blocks would be

128 MB * 1.000.000 blocks = 128.000.000 MB = 128 TB

So with the 1000MB allocated to the NameNode you could manage a cluster with 128TB of raw disk space.
Keep in mind the 1000MB are just by the NameNode to hold the block metadata in memory. The node itself requires additional memory for other services running and the OS itself

For reference some links with examples how to calculate NameNodes Memory:

https://www.cloudera.com/documentation/enterprise/5-10-x/topics/admin_nn_memory_config.html

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.5/bk_command-line-installation/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html

 

The post Hadoop HDFS maximum number of files appeared first on blog.muehlbeyer.net.